Feature request: automatically generated ids for headers

The use of square brackets is quite common. I can see the [My first section] syntax creating links in Markdown documents where a link was not intended by the author. Unless you meant the reference in reference style links only? e.g. [Click/tap this text to visit my first section][My first section]. I can’t think of any strong objections to this latter syntax. But…

It complicates testing and there’s the problem of the IDs being different between implementations. As an opt-in extension, having a consistent algorithim for generating the IDs would be useful as this would allow a subset of CommonMark documents to all count on external links to headings not breaking if the CommonMark implementation is swapped out with another implementation. If such an extension existed then you define a link such as…

[Click/tap this text to visit my first section](#my-first-section)

…and the heading…

## My First Section

…and count on it working across all CommonMark implementations which opt-in to the Implict IDs extension. It’s not going to solve all of the issues raised in this topic (reordering headings, etc), but for a sizeable number of documents (wiki articles, forum posts, etc) it would probably be reliable enough for their use cases. For documents that require more certainty in how the IDs are generated we could have a seperate explicit IDs extension and leave it up to the application developer to choose which extension (implicit IDs, explicit IDs, or both) to use.

No, authors must never be required to use (and determine first) the actual ID. Some may choose to do that, though.

With default settings, Pandoc is the only current implementation in available Babelmark that seems to get it right.

For the nth time, this is a request for mandatory generation of ids for all headers, as a core component of the specification. An extension is not good enough. Optional-to-implement is not good enough. Only-if-the-author-does-something is not good enough. It must be in the core, it must be mandatory to implement, and it must apply to all headers. Only that will move us toward a world where all HTML documents always have IDs on all of their headers.

All the picayune stuff this request keeps getting sidetracked on - what the IDs actually are, how the author can control them, whether the author should be able to opt-out some headers (no), etc - is not as important as the principle.

Mandatory generation of IDs.
For all headers.
In the core specification.

1 Like

Not if the identity of id includes the position of the header in the outline. Not that I think such should be mandated by the specification, but it can be explained in an informative section on “best practice”.

I agree that it’s preferable, but I don’t see it as an absolute requirement. It’s a nice-to-have and something that can be achieved given a best practice algorithm, but since we don’t yet know what that algorithm will look like, I think the feature can be added to the Commonmark syntax and eventually that algorithm will surface. When it does, it can be added as a reference to the core language specification.

I don’t see what problem an explicit id is solving here. While I do agree it should be possible to have explicit ids, they will suffer from the exact same synchronicity problem as implicit id’s. The link referencing the anchor will have to reference something. That something can change, whether it is an explicit id or the text of a header.

:thumbsup:

I don’t see what problem an explicit id is solving here. While I do agree it should be possible to have explicit ids, they will suffer from the exact same synchronicity problem as implicit id’s. The link referencing the anchor will have to reference something. That something can change, whether it is an explicit id or the text of a header.

It is solving the problem that the auto-generated ID will change if you change the heading text or the ordering of 2 identically-named headings. Your anchor ID will stay the same unless you explicitly change it, meaning that your links to it will not get broken.

If the link is written as @an3ss suggests:

## The Philosophy of CommonMark

You need to understand [the philosophy of CommonMark] because blah, blah, blah...

The reference and the anchor are synchronized through the text. If you change the text of the header, you most likely want to change the ID as well, unless you just use UUIDs. I don’t think this is a very compelling argument for requiring an explicit ID, but if it is what it takes to achieve consensus, I can live with that.

The reference and the anchor are synchronized through the text.

Not necessarily; it’s very conceivable you could use text other than the header text to refer to a section. This also doesn’t take into account links from external documents.

I suppose it is nicer for the author not to have to manually write the ID of a heading in order to link to it. I am concerned that adding square bracket heading links to the core spec would break a lot of existing Markdown documents (imagine how many GitHub README.md files would use square brackets for some other purpose and shouldn’t link to a heading), but as an opt-in extension it could be useful. Or some other syntax could be used besides square brackets.

How that would get around the reordering issue without the author defining explicit IDs? Can you provide an example?

To be honest, I’m not sure what I was thinking of, but I think it was something like this: If the generated ID consists of both the position of the header in the outline as well as the header’s name, reordering of headers shouldn’t be a problem. For the following outline:

# Level One
## Level Two
# Level One
## Level Two

The generated ID’s of the headers could be something like:

  1. section1-level-one
  2. section1.1-level-two
  3. section2-level-one
  4. section2.1-level-two

If you reorder or rename a header, it will get a new ID. You won’t have backward compatibility with incoming links, but you will avoid conflicts.

To the argument of being backwards compatible and conserving incoming links, I think that’s impossible unless your ID’s have absolutely nothing to do with the document structure at all; i.e. semantically nonsense. UUID’s will give you this detachment if you really want it and for those who do, they should by all means be able to explicitly name the ID of their headers and stuff an UUID in there, but for those of us who like semantically accurate and human intelligible ID’s, we can go with a more attached and brittle autogenerated ID.

I think no one has noticed this issue, it affects users from other languages (english is fine with this). Most implementation of automatic IDs have the lame effect of ignoring accented characters, which is correct actually, since the markdown might be incorrect and cause issues if the accent was included (the URL can’t contain accents), but the issue is that it’s not properly converted to the non-accented counterpart, it’s just ommited.

For example, for the title:

# “Techné” as the greek word for “Art”

The id would be techn-as-the-greek-word-for-art, instead of the correct way: techne-as-the...

Most automatic ID generators commit this error and I don’t think they’re to blame, I just think that language is really complicated and even I have no idea what issues this might be for other languages such as Japanese or Corean.

Markdown should not behave in an opinionated way (which is inevitable with automatic IDs), unless it clearly provides an alternative to use your own criteria to generate ids.

I really like the {#id-goes-here} approach because it’s clearly understandable and for reasons stated above.

2 Likes

We could define some rules to automatically convert the commonly used accented characters. But you might be right about it being difficult to anticipate the correct behaviour for all languages. This is a compelling reason to include an override method as part of the extension.

1 Like

I’m kinda in favor of the simplest solution, the one that GitHub has adopted, even at the risk of collisions (though GitHub avoids collisions by appending a suffix when a collision is detected).

The thing to remember is that links should be easy to author, just as with the rest of markdown/commonmark.

I haven’t thought through the implications enough, hence the “kinda”.

4 Likes

Babelmark shows just how different approaches are for spaces and roman non-ASCII letters.

With an info string, authors could override automatic IDs.

1 Like

We don’t need namespaces. We already have scopes.

Within the scope of the content I’m authoring, [me too](#me-too) can unambiguously link to # Me Too. I as an author should not have to think about any containing scopes. This is perfectly analogous to block scopes in most programming languages.

It is the responsibility of the embedding context to respect and protect my scope. Whether in its rendering it avoids ID collisions by altering its IDs or mine, or demotes my heading levels to avoid multiple H1s, it is its business, its job to make it work.

There should be a clear separation of concerns between authoring content and publishing mechanics. I shouldn’t have to manage the technicalities of an output format while authoring in a format that is supposed to be independent and portable. As the content author, the only thing that matters is that I have a semantically unambiguous and intuitive way to create internal links. I don’t care how they are ultimately rendered.

The one case where I do care, when I want the world to be able to deep link into my published content, I make sure to choose a publishing tool that produces predictable, “exportable” header ids and deep links, perhaps one that retains the unaltered CommonMark anchors or perhaps one that doesn’t. Who knows, maybe the content will get published in a relational database, and all anchors get translated into foreign keys. These are publishing concerns, not authoring concerns, not CommonMark spec concerns.

7 Likes

Commonmark should (with “support” meaning either in the mandatory core or via an optional extension) …

  • require implementations to automatically generate implicit IDs for headings
  • specify how to generate implicit IDs from heading text
    (e.g. “Überschrift 1” ⇒ #uberschrift_1)
  • specify how to generate implicit IDs from document structure
    (e.g. ### Heading#section-3.6.1)
  • specify how to generate safe IDs for user-generated content
    (e.g. {#window.evil');\ DROP\ TABLE\ *;--}#user-window-evil-drop-table)
  • support manually entered, explicit IDs for headings
    (e.g. ## Heading ## #ID)
  • support manually entered, explicit IDs for any block
    (e.g. ~~~ #ID)
  • support manually entered, explicit IDs for links
    (e.g. [text](target #ID))
  • support manually entered, explicit IDs for any inline markup
    (e.g. *emphasis*{#ID})
  • support overriding implicit IDs with explicit ones
    (e.g. [heading text]: #ID)
  • support relative links to headings/sections
    (e.g. [next section][>] or [this section](#.))

0 voters

2 Likes

One problem with implicit IDs is that they change when the heading changes. If you’re serious about cross-referencing in large documents, you want to guard against this, otherwise you’re spending your time updating links when you change headings. There are many reasons you might want to change the heading but keep the section same - to make it more informative, replace a word with another, etc.

So the best option of the options in the above vote is to support explicit IDs which may displace implicit IDs, or don’t specify implicit IDs in the spec at all and then systems which already generate implicit IDs will have to choose to displace the implicit ID with explicitly specified ID, when specified.

2 Likes

With either implicit or explicit IDs, there is no magic solution that works for large documents other than using a smart CommonMark-aware editor that handles the issues for you automatically.

For example, with explicit IDs:

  • How do you avoid ID collision?
  • How do you avoid off-by-one-char invalid refs?
  • To create a link to a section many pages up, you have to now go back and add an explicit ID to that target header, and return to where you are creating the link, even if you know the title of the section you are linking to.
  • If you delete a section, how do you know there are now broken references to its heading?
  • When you are reading/reviewing (see quote below), it’s hard know what the link is pointing to, since the explicit ID reference no longer matches the title of a section. You can’t rely on the anchor text either, because the premise here is that the target heading was renamed (if we are avoiding updating link refs we’re probably also avoiding updating anchors too).

Explicit ID’s go against the spirit of Markdown as described by Gruber and as quoted in the introduction to the CommonMark spec:

Readability, however, is emphasized above all else. A Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.

2 Likes

This is a good reason to use implicit IDs, although in fairness to the explicit ID camp, Markdown also supports reference links which have a lot in common with explicitly linking a heading to a link destination with a unique ID.

Perhaps the point to take away here is that Markdown supports more than one way of defining regular links, so why not support more than one way of setting the ID for a heading?

  • Implicit IDs only
  • Explicit IDs only
  • Implicit IDs overridable with explicit IDs

These all seem like desirable options for different use cases.

1 Like

I hear you about it being easy enough to support both (I’m not implementing so I can’t say!). But from a user perspective I’m a strong proponent of keeping it simple and sticking to the design philosophy…

Not sure I agree. Reference links arguably improve readability, push URLs to a footnote so they don’t interrupt the flow of text, and look even less like markup. For example:

Markdown also supports [reference links](https://spec.commonmark.org/0.28/#reference-link)
which have a lot in common with explicitly linking a heading
to a link destination with a unique ID.

reads better with a reference link:

Markdown also supports [reference links][1] which have a
lot in common with explicitly linking a heading to a link
destination with a unique ID.

[1]: https://spec.commonmark.org/0.28/#reference-link

and even better with a shortcut reference link:

Markdown also supports [reference links] which have a
lot in common with explicitly linking a heading to a link
destination with a unique ID.

[reference links]: https://spec.commonmark.org/0.28/#reference-link

But explicit IDs interrupt the content with markup and IDs:

## To Be Explicit Or Not To Be Explicit ## 2BOrNot2B
pros: some pros
cons: some cons

〰〰〰〰〰

See [the pros and cons of explicit IDs](#2BOrNot2B) above.

shortcut anchor links

The last example does suggest supporting the following:

## To Be Explicit Or Not To Be Explicit
pros: some pros
cons: some cons

〰〰〰〰〰

See [To Be Explicit Or Not To Be Explicit] above.

The above would work because the heading generates both an implicit anchor ID and a the following implicit ref link definition:

[To Be Explicit Or Not To Be Explicit]: #To-Be-Explicit-Or-Not-To-Be-Explicit

non-heading anchors

Ideally non-heading anchors are also supported with minimal markup. Perhaps:

This line contains a @link anchor@. 

This line links to the [above anchor](#link-anchor) via
inline link.

This line links to the above [link anchor] with a shortcut
ref link to the implicit ref link definition.

This line links to the above using a #link-anchor in the form
of a hashtag link that everyone is already familiar with.
2 Likes

Yes and if the author specifies a link definition where the link label matches the text of a heading (and potentially shortcut reference links) and the link target starts with a hash sign #, the shortcut anchor link gets an explicit ID.

## To Be Explicit Or Not To Be Explicit
pros: some pros
cons: some cons

〰〰〰〰〰

See [To Be Explicit Or Not To Be Explicit] above. [Same target](#2BX). 

〰〰〰〰〰
  [To Be Explicit Or Not To Be Explicit]: #2BX
## Foo ##

[Foo] 
[Foo][] 
[Foo][Foo] 
[Foo](#bar) 

[Foo]: #bar
. 
<h2 id="bar">Foo</h2>
<p><a href="#bar">Foo</a>
<a href="#bar">Foo</a>
<a href="#bar">Foo</a>
<a href="#bar">Foo</a></p>

If the parser understands some attribute syntax extension, there would be an alternative, inline way to achieve an explicit ID for the heading, but shortcut anchor links would probably fail then, unless the system supported multiple IDs per element.

## Foo ## {#bar} 

[Foo] 
[Foo][] 
[Foo][Foo] 
[Foo](#bar)
. 
<h2 id="bar">Foo</h2>
<p>[Foo] 
[Foo][] 
[Foo][Foo] 
<a href="#bar">Foo</a></p>

Also, if an author puts the heading text inside square brackets, this should be an easy way to provide a link to itself.

[Foo] 
=====

## [Bar] ##

[Foo] [Bar] 
. 
<h1 id="Foo"><a href="#Foo">Foo</a></h1>
<h2 id="Bar"><a href="#Bar">Bar</a></h2>
<p><a href="#Foo">Foo</a> <a href="#Bar">Bar</a></p>

If, however, a proper reference link definition with that link label exists, its link destination would get preferred (for backwards compatibility), although the implicit ID is still generated.

## [Foo] ##

[Foo] 

[Foo]: /bar
. 
<h2 id="Foo"><a href="/bar">Foo</a></h2>
<p><a href="/bar">Foo</a></p>
4 Likes