Feature request: automatically generated ids for headers

zwol · September 3, 2014, 9:49pm

One of the big lacunae in Markdown (IMNSHO) is that there’s no way to get id="" attributes in the generated HTML, and therefore, no way to use fragment references when linking to the containing document.

A general mechanism for adding id attributes to generated HTML elements seems Too Hard, and any mechanism for adding user-specified attributes to generated elements is going to have backward compatibility headaches. However, id attributes are most useful on headers, and it’s possible to generate id attributes algorithmically from the header text, so I’d like to propose that Markdown should do that. A reasonable algorithm might look something like

Apply aggressive Unicode normalization (NFKC + lowercasing) to the text of the header.
Replace all characters not in Unicode categories [L*] and [N*] (that is, that are neither letters nor numbers) with - characters.
Compress all runs of - characters to a single -.

I’m sure this can be bikeshedded, but the important thing is to get implementations consistently doing something, not the exact details.

imsky · September 3, 2014, 10:06pm

This needs to be optional or at least namespaced. Two cases present problems:

Existing IDs in the DOM, e.g. # Main should not conflict with <div id="main">
Existing IDs in the Markdown can cause unnecessary confusion, e.g. # Main ... # Main causes two <h1 id="main"> elements. Github solves this with an additional suffix, though that seems like a kludge.

poke · September 3, 2014, 10:12pm

I actually think the opposite. If you would be adding automatically generated ids to the standard specification, then there absolutely must be a clear specification of how that’s going to work. So when using links in the document that link to separate sections/headers, those links can be specified exactly without having implementation-specific differences.

zwol · September 4, 2014, 12:13am

This needs to be optional or at least namespaced.

I would like not to have to make this feature depend on adding a file-metadata mechanism, which seems like a can of worms in itself.

Existing IDs in the DOM, e.g. # Main should not conflict with <div id="main">

I sort-of feel like resolving this class of conflict should be the responsibility of whatever template engine is embedding Markdown content in a larger structure with existing IDs. Of course the problem with that logic is then the author of the Markdown content can no longer predict what the fragment IDs are going to be.

# Main ... # Main causes two <h1 id="main"> elements. Github solves this with an additional suffix, though that seems like a kludge.

I’m not sure there’s any way to deal with that that isn’t gonna feel like a kludge to some extent.

zwol · September 4, 2014, 12:17am

@poke said: If you would be adding automatically generated ids to the standard specification, then there absolutely must be a clear specification of how that’s going to work.

You misunderstand me – I definitely think the algorithm should be spelled out in the spec; I just think there are several possible ways the algorithm could work and any one of them would be fine. All it needs to do is generate unique-in-the-document fragment IDs, and I suppose it would be nice (but not strictly necessary, because you can always \-escape) if they were valid CSS #name tokens.

tabatkins · September 4, 2014, 1:00am

Markdown Extra allows special attributes to add classes and ids to some constructs, notably headers. It looks like:

I'm a header
============ {#the-id}

zwol · September 4, 2014, 1:31am

That syntax seems as good as any I could come up with (I’d probably have omitted the #, but that’s a detail) … but I do not think it obviates the need for mechanically generated ids in general. It’s valuable to have a guarantee that all subheadings will have an id, whether or not it was explicitly specified.

codinghorror · September 4, 2014, 1:35am

I agree that whenever you have headers, you pretty much always want an implicit anchor there too.

tabatkins · September 4, 2014, 1:43am

@zwol: The reason for the # in there is that it also supports classes, by preceding the word with a .. It’s CSS syntax, basically.

My own experience with auto-genning IDs suggests that it’s often not a great idea. If you try to gen them from text, you get some terrible anchors, and they’re not stable against wording changes; if you gen them from a counter or something similar, they’re not stable against reordering or adding/deleting sections above them. You can make it slightly better if you pay attention to the outline generated by the headings, as then sub-headings before you won’t matter, but still.

In my spec-generation tool that accepts some Markdown, I auto-gen IDs from text but log a warning that the author should be providing an explicit ID.

If we do end up supporting autogenned IDs on headings, however, I do suggest it be based on the outline level of the heading, like “heading-2-1” for the first h2 after the second h1. As I said above, it makes it slightly more robust against document changes; only ancestors and preceding siblings of ancestors (in the outline tree) can affect your ID, rather than all preceding headings like what you get if you just number them sequentially.

imsky · September 4, 2014, 1:54am

Right, and how about Markdown content being loaded into independent containers - now the template engine has to check IDs globally to make sure that “news-3” does not conflict with another “news-3” somewhere else on the page.

StackOverflow and Reddit don’t add any attributes to headers. Github does a workaround where it wraps the header text into a link and uses JS to make document fragments (page.html#heading URLs) work:

<h1><a name="user-content-js-conf-slides---2011" class="anchor" href="#js-conf-slides---2011" rel="noreferrer"><span class="octicon octicon-link"></span></a>JS Conf slides - 2011</h1>

In Babelmark 2, fewer parsers add attributes to headers than don’t: http://johnmacfarlane.net/babelmark2/?text=%23+Heading

Namespacing or leaving the anchoring behavior up to the template engine are the best ways to ensure that there are no conflicts with existing elements on the page.

imsky · September 4, 2014, 1:59am

Stability is a great point, I’ve seen a few spec documents where structure was preserved simply for the sake of keeping references intact.

Another thing about implicit anchors: users won’t know what they are unless you add a linking element of some kind so they can copy the URL. That’s what Github does and it works for them, at the cost of broken functionality when JS is disabled.

barraponto · September 4, 2014, 2:06am

I think automatically-generated ids would be awesomely useful but should it add some sort of link feature, to make it easier to extract the full url with the appropriate fragment? Github does that, but it seems to me that it should be the responsability of the template engine that is embedding markdown content.

There are some issues with autogenerating ids that are hard to crack: should there be a limit to the ids, as to still be useful as a url fragment? Should the header text be transliterated to eliminate/transform non-ascii characters? I think transliteration rules for several languages are stil under heavy debate…

Instead, I think we should push for browser features for deep-linking to custom fragments, as proposed by Simon in his CSS Fragments spec and already implemented in some browser extensions.

stof · September 4, 2014, 2:06am

The functionality is not broken at all without Javascript. The element is not added in Javascript but by the renderer. Hiding it until you hover the title is a matter of approximately 5 lines of CSS

imsky · September 4, 2014, 2:10am

Disable JavaScript and visit https://gist.github.com/mythz/957816#track-b

Click on the link icon to test the anchors as well.

This is broken in Chrome 37 and Firefox 32 (with javascript.enabled set to false).

poke · September 4, 2014, 2:13am

It is. The point is that the exported HTML for a header looks for example like this:

<h2><a name="user-content-configuration" class="anchor" href="#configuration" aria-hidden="true"><span class="octicon octicon-link"></span></a>Configuration</h2>

As you can see, there is no id, and the name (which works as an anchor) is user-content-configuration while the link obviously goes to just #configuration.

GitHub actually uses a hashchange event listener to see if there is a user-content-foo for a hash #foo and navigates there using JavaScript. It is broken without JavaScript.

$.hashChange(function () {
    var t, e;
    if (location.hash && !document.querySelector(':target'))
        return t = 'user-content-' + location.hash.slice(1),
            e = document.getElementsByName(t),
            $(e).scrollTo()
})

imsky · September 4, 2014, 2:31am

CSS Fragments would be a cool feature for developers, but not so great for documents. Consider this:

http://example.com/lorem.html#css(.content:nth-child(2))

There’s nothing semantic communicated in the link at all.

That said, deep linking of some kind would be very useful. Perhaps a fuzzy search query can be helpful here, for example:

<h1>Production</h1>
...
<h1>Models</h1>
<h2>Currently in production</h2>
<h3>Trucks</h3>
<h3>Buses</h3>

A search “deep link” for Production would look like this: page.html#@production, implying /^production$/i.
A link for Currently in production would look like this: page.html#@models/*production, implying /^.*production$/i under the element matched by /^models$/i.
Finally, a link for Buses would look like this: page.html#@models/*production/buses or like this: page.html#@models/*/buses.

The above approach has its issues, but it could be used to provide stability without requiring precision.

jericson · September 4, 2014, 5:59pm

I suggest adopting Wikipedia’s section anchor scheme, which takes the text of the header and converts it to a valid id:

spaces => _
certain punctuation is converted to character entity references
second and following duplicate headers are appended with _2 to _n suffixes (N.B.: the first instance is unchanged.)
header depth is ignored

This would allow authors to refer to headers without imposing any extra syntax:

# A good place to start

...

Please begin at [the beginning](#A_good_place_to_start).

Downsides:

The requirement for unique ids (point #3) means that the converter will need to keep track of duplicate header names. Further, if there are multiple Markdown documents on a single page, the page will need to be validated to be sure that no documents duplicate the same ids. These are concerns with any system that allows URL fragments in Markdown, however. (This leads me to believe that link anchors should be an extension and not part of the core standard.)
Edits to the header text will break links (both internal and external). This could be solved by adding manual anchors and creating redirects (which is what Wikipedia does). If authors are able to specific anchors manually, they could future-proof their own documents at the cost of giving up the convenience of automation. (Again, this seems more appropriate for an extension that also adds manual anchors.)

On the plus side:

Authors can predict what ids will be generated. And perhaps more importantly, so can their authoring tools. (Though, again, #3 complicates matters.)
Editors can reorganize a document without breaking it as long as they don’t change the text of the header. (Oh, they also need to be careful not to rearrange the order of duplicate headers. Sigh.)
People familiar with Wikipedia section links will feel at home.
There is a pleasing parallelism between the syntax of ATX headers and the links to them.

zzzzBov · September 4, 2014, 9:00pm

I think that this feature, just like tables in markdown, is useful but should be considered an extension of standard markdown.

mofosyne · September 5, 2014, 5:58am

I like your approach. For conflicting IDs or commonly jumped to sections, this may be be a good addition

# header text here {#anchorName}

As for making it and tables an extension. These are commonly used to an extent that it shouldn’t be excluded. (Otherwise implementations will constantly lack it)

EnCey · September 6, 2014, 12:57pm

I propose that anchors only be generated explicitly when the user requests it, or at least provide the option to explicitly declare anchor names (e.g. by adding {#anchorName} as already mentioned).

The problem with automatically generated anchor names is that they break when you change the header content. Suppose you use Standard Markdown as title for your document and link to it via [link][Standard_Markdown]. Suppose further you now have to change the title to CommonMark and must now update all links to that anchor.

This is definitely a problem that users don’t see until they are faced with it. Requiring a user to pick an explicit anchor name may appear cumbersome, but I believe it helps users more than it burdens them.

I understand that sites like GitHub use automatically generated names and that this feature is very useful. However I believe this is something where the spec should only give a recommendation as to how these links are generated. There should be an explicit alternative in the spec that is obligatory, whereas automatic anchor name generation appears to be an optional feature in my opinion.