Feature request: automatically generated ids for headers

I think automatically-generated ids would be awesomely useful but should it add some sort of link feature, to make it easier to extract the full url with the appropriate fragment? Github does that, but it seems to me that it should be the responsability of the template engine that is embedding markdown content.

There are some issues with autogenerating ids that are hard to crack: should there be a limit to the ids, as to still be useful as a url fragment? Should the header text be transliterated to eliminate/transform non-ascii characters? I think transliteration rules for several languages are stil under heavy debate…

Instead, I think we should push for browser features for deep-linking to custom fragments, as proposed by Simon in his CSS Fragments spec and already implemented in some browser extensions.

The functionality is not broken at all without Javascript. The element is not added in Javascript but by the renderer. Hiding it until you hover the title is a matter of approximately 5 lines of CSS

Disable JavaScript and visit https://gist.github.com/mythz/957816#track-b

Click on the link icon to test the anchors as well.

This is broken in Chrome 37 and Firefox 32 (with javascript.enabled set to false).

It is. The point is that the exported HTML for a header looks for example like this:

<h2><a name="user-content-configuration" class="anchor" href="#configuration" aria-hidden="true"><span class="octicon octicon-link"></span></a>Configuration</h2>

As you can see, there is no id, and the name (which works as an anchor) is user-content-configuration while the link obviously goes to just #configuration.

GitHub actually uses a hashchange event listener to see if there is a user-content-foo for a hash #foo and navigates there using JavaScript. It is broken without JavaScript.

$.hashChange(function () {
    var t, e;
    if (location.hash && !document.querySelector(':target'))
        return t = 'user-content-' + location.hash.slice(1),
            e = document.getElementsByName(t),
            $(e).scrollTo()
})
1 Like

CSS Fragments would be a cool feature for developers, but not so great for documents. Consider this:

http://example.com/lorem.html#css(.content:nth-child(2))

There’s nothing semantic communicated in the link at all.

That said, deep linking of some kind would be very useful. Perhaps a fuzzy search query can be helpful here, for example:

<h1>Production</h1>
...
<h1>Models</h1>
<h2>Currently in production</h2>
<h3>Trucks</h3>
<h3>Buses</h3>

A search “deep link” for Production would look like this: page.html#@production, implying /^production$/i.
A link for Currently in production would look like this: page.html#@models/*production, implying /^.*production$/i under the element matched by /^models$/i.
Finally, a link for Buses would look like this: page.html#@models/*production/buses or like this: page.html#@models/*/buses.

The above approach has its issues, but it could be used to provide stability without requiring precision.

I suggest adopting Wikipedia’s section anchor scheme, which takes the text of the header and converts it to a valid id:

  1. spaces => _
  2. certain punctuation is converted to character entity references
  3. second and following duplicate headers are appended with _2 to _n suffixes (N.B.: the first instance is unchanged.)
  4. header depth is ignored

This would allow authors to refer to headers without imposing any extra syntax:

# A good place to start

...

Please begin at [the beginning](#A_good_place_to_start).

Downsides:

  • The requirement for unique ids (point #3) means that the converter will need to keep track of duplicate header names. Further, if there are multiple Markdown documents on a single page, the page will need to be validated to be sure that no documents duplicate the same ids. These are concerns with any system that allows URL fragments in Markdown, however. (This leads me to believe that link anchors should be an extension and not part of the core standard.)

  • Edits to the header text will break links (both internal and external). This could be solved by adding manual anchors and creating redirects (which is what Wikipedia does). If authors are able to specific anchors manually, they could future-proof their own documents at the cost of giving up the convenience of automation. (Again, this seems more appropriate for an extension that also adds manual anchors.)

On the plus side:

  • Authors can predict what ids will be generated. And perhaps more importantly, so can their authoring tools. (Though, again, #3 complicates matters.)

  • Editors can reorganize a document without breaking it as long as they don’t change the text of the header. (Oh, they also need to be careful not to rearrange the order of duplicate headers. Sigh.)

  • People familiar with Wikipedia section links will feel at home.

  • There is a pleasing parallelism between the syntax of ATX headers and the links to them.

2 Likes

I think that this feature, just like tables in markdown, is useful but should be considered an extension of standard markdown.

1 Like

I like your approach. For conflicting IDs or commonly jumped to sections, this may be be a good addition

# header text here {#anchorName}

As for making it and tables an extension. These are commonly used to an extent that it shouldn’t be excluded. (Otherwise implementations will constantly lack it)

1 Like

I propose that anchors only be generated explicitly when the user requests it, or at least provide the option to explicitly declare anchor names (e.g. by adding {#anchorName} as already mentioned).

The problem with automatically generated anchor names is that they break when you change the header content. Suppose you use Standard Markdown as title for your document and link to it via [link][Standard_Markdown]. Suppose further you now have to change the title to CommonMark and must now update all links to that anchor.

This is definitely a problem that users don’t see until they are faced with it. Requiring a user to pick an explicit anchor name may appear cumbersome, but I believe it helps users more than it burdens them.

I understand that sites like GitHub use automatically generated names and that this feature is very useful. However I believe this is something where the spec should only give a recommendation as to how these links are generated. There should be an explicit alternative in the spec that is obligatory, whereas automatic anchor name generation appears to be an optional feature in my opinion.

7 Likes

Is there likely to be any progress on this?

I really like the idea of the headings having a default generated however I follow @tabatkins concerns that it creates more issues than it solves.

Having a standardised way to provide id’s as suggested I feel is the most important feature Markdown lacks.

Is there any chance adding ID’s would happen @codinghorror?

1 Like

I don’t like idea to unify IDs. Those are too specific for different projects & languages. There can be recommendation, default implementation, but not mandatory requirement.

If parser provides easy way to customize renderer, that’s not a big problem to add ids as needed.

My old view was that this is merely a writer implementation issue. The spec just tells you what to parse as a header, what its contents are, and what level it is. It’s up to the writer to determine exactly how to render it in a given format.

But of course I do see the problem if different implementations use different schemes for auto-numbering IDs. It means that a document that works on one implementation may not work on another (because links to headers are broken). So maybe it isn’t crazy to explore adding something about this to the spec.

On the other hand, there are many issues. (I have experience with this from pandoc, which has supported automatic header ID numbering for many years.) Chief among them:

  • How to avoid duplicate IDs, both with other headers and with IDs that
    may be defined in other ways.

  • How to deal with punctuation in headers.

Previously handling of non-ASCII alphabetics also used to be a problem, but HTML5 no longer restricts identifiers to ASCII.

Pandoc’s system is described here:
http://johnmacfarlane.net/pandoc/README.html#header-identifiers-in-html-latex-and-context
See also the section on “Implicit header references.”

I’ll note that I’ve often thought, in dealing with issues that come up with these, that it would be best just to make people specify header IDs explicitly when they’re needed.

The problem is double:

  • you want to have ids to be document-consistent e.g. you want that referring to #foo lands you all the time to the right heading,
  • you want to make the ids generally constant and it gets hairier if there is an include directive in the mix.

The former can be addressed using some [][]-compatible anchor system, the latter requires probably more attention.

At the very least if it shouldn’t be specified, it should at least have a mention in the spec as a recommendation.

I think I like the method that is used in wikipedia as mentioned by jericson:

http://talk.commonmark.org/t/feature-request-automatically-generated-ids-for-headers/115/17?u=mofosyne

It’s simple and gets the job mostly done.

If collision is an issue, try adding section numbers to it. Like id="my_header_name--3.2.1--" . The benefit to this approach, is that section numbering can be guaranteed to be unique. (Plus you wont need to track duplicate ID )

There’s a library for Ruby called Stringex that generates URI-friendly strings and handles unicode characters elegantly. For example, the string “Rock & Roll” is converted to “rock-and-roll” which is much more readable than “rock-roll”. A similar approach for auto generated header ids could make a nice CommonMark extension.

To handle duplicate header names, a simple approach would be to add a hyphen and a number to the end of the ID. E.g. the second instance of “Rock & Roll” would have the ID “roll-and-roll-2”. The parser would need to keep track of previous IDs and add the next number to the end of the ID based on some conventional rules.

  1. That’s still not enougth if you have 2 different forum posts with the same headings on the same page. Also, some seo guys will not agree to have garbage in tail.
  2. International chars in ID will work only for html5. Also i don’t know easy way to convert any unicode char to english transcription in browser (without huge mapping tables)

I’d like to have automatic IDs, but don’t see universal solution good for all.

@jgm I quite like the self referencing inline links you seem to be using in the spec code: [line](@line)

Could the same be used for headings perhaps?

## Great attractive heading [#](@wow-look-at-this-heading)

For:

Great attractive heading #

Implementations could then choose how to render the #

1 Like

+++ Chris Alley [Dec 08 14 07:10 ]:

There’s a library for Ruby called Stringex that generates URI-friendly strings and handles unicode characters elegantly. For example, the string “Rock & Roll” is converted to “rock-and-roll” which is much more readable than “rock-roll”. A similar approach for auto generated header ids could make a nice CommonMark extension.

To handle duplicate header names, a simple approach would be to add a hyphen and a number to the end of the ID. E.g. the second instance of “Rock & Roll” would have the ID “roll-and-roll-2”. The parser would need to keep track of previous IDs and add the next number to the end of the ID based on some conventional rules.

This is similar to what pandoc does. On problem, though, is that when you insert a like-named section in your document before the section that is linked to, it breaks the link because it changes the generated identifier.

This discussion has gone pretty far into the weeds from the original feature request.

What’s important to me is moving toward a world where I can have confidence that every header within an arbitrarily chosen HTML document will have a fragment identifier.

Therefore, this is a feature request for mandatory generation of fragment identifiers for all headers that don’t have one explicitly specified.

I do not consider an optional recommendation to be adequate, and neither do I consider a mechanism for author-specified explicit fragment identifiers (even if that mechanism is mandatory-to-implement) to be adequate.

I also don’t care about the details of the generation algorithm, or how conflicts are resolved (either within the document, or with surrounding chrome), although I acknowledge that the standard must specify both.

I would suggest also having implicit header link references, using the header title as the link’s label. I haven’t seen this discussed yet.

Specifically, for each header in the document, the following link reference would be automatically generated by the parser:

[Header title]: #generated-id

If N headers shared the title, then N+1 link references would be generated:

[Header title]: #generated-id
[Header title #1]: #generated-id
[Header title #2]: #generated-id-for-2nd-repeat-header
 ...

If a link reference defined by the author had the same label as an implicit header reference, then the defined link reference would take preference.

This would make including links to sections in a document a breeze:

see [Header title]
see [Header title #2]
this is covered in the [section on subject][Header title]

What do you think?

2 Likes