I don’t like idea to unify IDs. Those are too specific for different projects & languages. There can be recommendation, default implementation, but not mandatory requirement.
If parser provides easy way to customize renderer, that’s not a big problem to add ids as needed.
My old view was that this is merely a writer implementation issue. The spec just tells you what to parse as a header, what its contents are, and what level it is. It’s up to the writer to determine exactly how to render it in a given format.
But of course I do see the problem if different implementations use different schemes for auto-numbering IDs. It means that a document that works on one implementation may not work on another (because links to headers are broken). So maybe it isn’t crazy to explore adding something about this to the spec.
On the other hand, there are many issues. (I have experience with this from pandoc, which has supported automatic header ID numbering for many years.) Chief among them:
How to avoid duplicate IDs, both with other headers and with IDs that
may be defined in other ways.
How to deal with punctuation in headers.
Previously handling of non-ASCII alphabetics also used to be a problem, but HTML5 no longer restricts identifiers to ASCII.
I’ll note that I’ve often thought, in dealing with issues that come up with these, that it would be best just to make people specify header IDs explicitly when they’re needed.
If collision is an issue, try adding section numbers to it. Like id="my_header_name--3.2.1--" . The benefit to this approach, is that section numbering can be guaranteed to be unique. (Plus you wont need to track duplicate ID )
There’s a library for Ruby called Stringex that generates URI-friendly strings and handles unicode characters elegantly. For example, the string “Rock & Roll” is converted to “rock-and-roll” which is much more readable than “rock-roll”. A similar approach for auto generated header ids could make a nice CommonMark extension.
To handle duplicate header names, a simple approach would be to add a hyphen and a number to the end of the ID. E.g. the second instance of “Rock & Roll” would have the ID “roll-and-roll-2”. The parser would need to keep track of previous IDs and add the next number to the end of the ID based on some conventional rules.
That’s still not enougth if you have 2 different forum posts with the same headings on the same page. Also, some seo guys will not agree to have garbage in tail.
International chars in ID will work only for html5. Also i don’t know easy way to convert any unicode char to english transcription in browser (without huge mapping tables)
I’d like to have automatic IDs, but don’t see universal solution good for all.
There’s a library for Ruby called Stringex that generates URI-friendly strings and handles unicode characters elegantly. For example, the string “Rock & Roll” is converted to “rock-and-roll” which is much more readable than “rock-roll”. A similar approach for auto generated header ids could make a nice CommonMark extension.
To handle duplicate header names, a simple approach would be to add a hyphen and a number to the end of the ID. E.g. the second instance of “Rock & Roll” would have the ID “roll-and-roll-2”. The parser would need to keep track of previous IDs and add the next number to the end of the ID based on some conventional rules.
This is similar to what pandoc does. On problem, though, is that when you insert a like-named section in your document before the section that is linked to, it breaks the link because it changes the generated identifier.
This discussion has gone pretty far into the weeds from the original feature request.
What’s important to me is moving toward a world where I can have confidence that every header within an arbitrarily chosen HTML document will have a fragment identifier.
Therefore, this is a feature request for mandatory generation of fragment identifiers for all headers that don’t have one explicitly specified.
I do not consider an optional recommendation to be adequate, and neither do I consider a mechanism for author-specified explicit fragment identifiers (even if that mechanism is mandatory-to-implement) to be adequate.
I also don’t care about the details of the generation algorithm, or how conflicts are resolved (either within the document, or with surrounding chrome), although I acknowledge that the standard must specify both.
But I think that header identifiers should be case-insensitive to be consistent with normal link references and to be more flexible. For example, in this case the link looks better with a different capitalization:
## The Philosophy of CommonMark
You need to understand [the philosophy of CommonMark] because blah, blah, blah...
Am I missing something that makes it more advisable to be case-sensitive here?
Let me add something else for completeness:
The header label used in this kind of link will have to exclude inline formatting characters (emphasis or code) and include character entities unchanged (not as their single-character equivalent). Also, if the header contains links (it seems that it’s possible), only the link text will be used in the header label.
However, to make thing easier, maybe the use of formatting, entities or links in a header should preclude the generation of an automatic link reference. I don’t know.
Generated labels should not allow arbitrary content. Those must have prefix or postfix for security. Also, those should be tuneable, because multiple docs can be displayed on one page.
Notice that how the section title is used to generate an id is irrelevant in the mecanism that I suggest (and that is implemented in @jgm’s Pandoc).
I’m only talking about the header title (as written in the CM document) acting as a case-insensitive link label (as defined by the CM spec), with its spaces, puntuation and all. This label is used by the parser to create a link reference for the corresponding header. This link reference is then used by the parser to resolve any reference links that use it. The header label does not end up in the generated HTML.
I know that this feature goes beyond the original feature request, but I’ve included it here because I think that there is already a consensus that ids have to be generated for headers (in a safe way) and this feature needs generated ids to create the actual <a> elements.
@an3ss How would your proposal work if the writer changed the order of two headings that had the same text (e.g. both called “The Philosophy of CommonMark”)? This seems to be biggest issue for automatically generated header IDs.
Of course, the writer has to synchronize manually changes in header titles and changes in the repeated headers’ order.
If you have two headers titled “Scene 1”, like in your example, then the following header references will be made availabe automatically (the ids are hypothetical auto-generated ids):
So, if you use the reference link [Scene 1 #1] (or just [Scene 1]), you will be pointing to the first “Scene 1” header that occurs in the document, and with [Scene 1 #2] you’ll be pointing to the second.
If at some point you decide to exchange the contents of the two “Scene 1” sections, then you have to take care of updating all links to them accordingly (exchange #1 and #2 in their link labels).
Note that the generated id is irrelevant with this mechanism; you are linking to the header with title “Scene 1” in the order that you specify, that’s all. The parser will take care of generating the id and using it in the generated <a> element.
On the other hand, sometimes it may be better to use explicit ids, declaring the id in the header (with whatever syntax is decided):
###Scene 1 {#they_kiss}
and defining the link reference explicitly:
[They kiss]: #they_kiss
so you can link to the scene with, for example:
in [the scene where they kiss][they kiss].
or simply:
in the scene where [they kiss].
This option is perfectly compatible with the implicit header references that I am proposing.