Feature request: automatically generated ids for headers

My old view was that this is merely a writer implementation issue. The spec just tells you what to parse as a header, what its contents are, and what level it is. It’s up to the writer to determine exactly how to render it in a given format.

But of course I do see the problem if different implementations use different schemes for auto-numbering IDs. It means that a document that works on one implementation may not work on another (because links to headers are broken). So maybe it isn’t crazy to explore adding something about this to the spec.

On the other hand, there are many issues. (I have experience with this from pandoc, which has supported automatic header ID numbering for many years.) Chief among them:

  • How to avoid duplicate IDs, both with other headers and with IDs that
    may be defined in other ways.

  • How to deal with punctuation in headers.

Previously handling of non-ASCII alphabetics also used to be a problem, but HTML5 no longer restricts identifiers to ASCII.

Pandoc’s system is described here:
http://johnmacfarlane.net/pandoc/README.html#header-identifiers-in-html-latex-and-context
See also the section on “Implicit header references.”

I’ll note that I’ve often thought, in dealing with issues that come up with these, that it would be best just to make people specify header IDs explicitly when they’re needed.

The problem is double:

  • you want to have ids to be document-consistent e.g. you want that referring to #foo lands you all the time to the right heading,
  • you want to make the ids generally constant and it gets hairier if there is an include directive in the mix.

The former can be addressed using some [][]-compatible anchor system, the latter requires probably more attention.

At the very least if it shouldn’t be specified, it should at least have a mention in the spec as a recommendation.

I think I like the method that is used in wikipedia as mentioned by jericson:

http://talk.commonmark.org/t/feature-request-automatically-generated-ids-for-headers/115/17?u=mofosyne

It’s simple and gets the job mostly done.

If collision is an issue, try adding section numbers to it. Like id="my_header_name--3.2.1--" . The benefit to this approach, is that section numbering can be guaranteed to be unique. (Plus you wont need to track duplicate ID )

There’s a library for Ruby called Stringex that generates URI-friendly strings and handles unicode characters elegantly. For example, the string “Rock & Roll” is converted to “rock-and-roll” which is much more readable than “rock-roll”. A similar approach for auto generated header ids could make a nice CommonMark extension.

To handle duplicate header names, a simple approach would be to add a hyphen and a number to the end of the ID. E.g. the second instance of “Rock & Roll” would have the ID “roll-and-roll-2”. The parser would need to keep track of previous IDs and add the next number to the end of the ID based on some conventional rules.

  1. That’s still not enougth if you have 2 different forum posts with the same headings on the same page. Also, some seo guys will not agree to have garbage in tail.
  2. International chars in ID will work only for html5. Also i don’t know easy way to convert any unicode char to english transcription in browser (without huge mapping tables)

I’d like to have automatic IDs, but don’t see universal solution good for all.

@jgm I quite like the self referencing inline links you seem to be using in the spec code: [line](@line)

Could the same be used for headings perhaps?

## Great attractive heading [#](@wow-look-at-this-heading)

For:

Great attractive heading #

Implementations could then choose how to render the #

1 Like

+++ Chris Alley [Dec 08 14 07:10 ]:

There’s a library for Ruby called Stringex that generates URI-friendly strings and handles unicode characters elegantly. For example, the string “Rock & Roll” is converted to “rock-and-roll” which is much more readable than “rock-roll”. A similar approach for auto generated header ids could make a nice CommonMark extension.

To handle duplicate header names, a simple approach would be to add a hyphen and a number to the end of the ID. E.g. the second instance of “Rock & Roll” would have the ID “roll-and-roll-2”. The parser would need to keep track of previous IDs and add the next number to the end of the ID based on some conventional rules.

This is similar to what pandoc does. On problem, though, is that when you insert a like-named section in your document before the section that is linked to, it breaks the link because it changes the generated identifier.

This discussion has gone pretty far into the weeds from the original feature request.

What’s important to me is moving toward a world where I can have confidence that every header within an arbitrarily chosen HTML document will have a fragment identifier.

Therefore, this is a feature request for mandatory generation of fragment identifiers for all headers that don’t have one explicitly specified.

I do not consider an optional recommendation to be adequate, and neither do I consider a mechanism for author-specified explicit fragment identifiers (even if that mechanism is mandatory-to-implement) to be adequate.

I also don’t care about the details of the generation algorithm, or how conflicts are resolved (either within the document, or with surrounding chrome), although I acknowledge that the standard must specify both.

I would suggest also having implicit header link references, using the header title as the link’s label. I haven’t seen this discussed yet.

Specifically, for each header in the document, the following link reference would be automatically generated by the parser:

[Header title]: #generated-id

If N headers shared the title, then N+1 link references would be generated:

[Header title]: #generated-id
[Header title #1]: #generated-id
[Header title #2]: #generated-id-for-2nd-repeat-header
 ...

If a link reference defined by the author had the same label as an implicit header reference, then the defined link reference would take preference.

This would make including links to sections in a document a breeze:

see [Header title]
see [Header title #2]
this is covered in the [section on subject][Header title]

What do you think?

2 Likes

+++ Andrés M [Jan 07 15 11:16 ]:

 I would suggest also having implicit header link references, using the
 header title as the link's label. I haven't seen this discussed yet.

Yes, I think this makes sense. Pandoc already does this (though without the disambiguation strategy you suggest).

That’s great!

But I think that header identifiers should be case-insensitive to be consistent with normal link references and to be more flexible. For example, in this case the link looks better with a different capitalization:

## The Philosophy of CommonMark

You need to understand [the philosophy of CommonMark] because blah, blah, blah...

Am I missing something that makes it more advisable to be case-sensitive here?


Let me add something else for completeness:

The header label used in this kind of link will have to exclude inline formatting characters (emphasis or code) and include character entities unchanged (not as their single-character equivalent). Also, if the header contains links (it seems that it’s possible), only the link text will be used in the header label.

However, to make thing easier, maybe the use of formatting, entities or links in a header should preclude the generation of an automatic link reference. I don’t know.

1 Like

Note:

Generated labels should not allow arbitrary content. Those must have prefix or postfix for security. Also, those should be tuneable, because multiple docs can be displayed on one page.

Notice that how the section title is used to generate an id is irrelevant in the mecanism that I suggest (and that is implemented in @jgm’s Pandoc).

I’m only talking about the header title (as written in the CM document) acting as a case-insensitive link label (as defined by the CM spec), with its spaces, puntuation and all. This label is used by the parser to create a link reference for the corresponding header. This link reference is then used by the parser to resolve any reference links that use it. The header label does not end up in the generated HTML.

I know that this feature goes beyond the original feature request, but I’ve included it here because I think that there is already a consensus that ids have to be generated for headers (in a safe way) and this feature needs generated ids to create the actual <a> elements.

1 Like

I unterstand. That’s an implementation memo for all. IDs are not generated now, but i think sometime they will, because it’s very useful thing.

1 Like

@an3ss How would your proposal work if the writer changed the order of two headings that had the same text (e.g. both called “The Philosophy of CommonMark”)? This seems to be biggest issue for automatically generated header IDs.

Is it notmal to have headers with exactly the same content?

You might have a document with headings like this:

## Episode 1
### Scene 1
### Scene 2
## Episode 2
### Scene 1
### Scene 2
2 Likes

Of course, the writer has to synchronize manually changes in header titles and changes in the repeated headers’ order.

If you have two headers titled “Scene 1”, like in your example, then the following header references will be made availabe automatically (the ids are hypothetical auto-generated ids):

[Scene 1]: #id-of-scene-1-1
[Scene 1 #1]: #id-of-scene-1-1
[Scene 1 #2]: #id-of-scene-1-2

So, if you use the reference link [Scene 1 #1] (or just [Scene 1]), you will be pointing to the first “Scene 1” header that occurs in the document, and with [Scene 1 #2] you’ll be pointing to the second.

If at some point you decide to exchange the contents of the two “Scene 1” sections, then you have to take care of updating all links to them accordingly (exchange #1 and #2 in their link labels).

Note that the generated id is irrelevant with this mechanism; you are linking to the header with title “Scene 1” in the order that you specify, that’s all. The parser will take care of generating the id and using it in the generated <a> element.

On the other hand, sometimes it may be better to use explicit ids, declaring the id in the header (with whatever syntax is decided):

###Scene 1 {#they_kiss}

and defining the link reference explicitly:

[They kiss]: #they_kiss

so you can link to the scene with, for example:

in [the scene where they kiss][they kiss].

or simply:

in the scene where [they kiss].

This option is perfectly compatible with the implicit header references that I am proposing.

I think this solution is adequate for shorter documents (say, a Wikipedia page) where repeated headers are unlikely. Requiring explicit header IDs would add significant overhead for the writer. For longer documents, that likelihood increases and it becomes more difficult for the writer to keep track of which links point where. It might be wise for the writer to define explicit IDs in the case of longer documents.

So, automatic header IDs, with the option of overwriting them with explicit header IDs.

Another issue with automatic IDs is that they may clash with other IDs on the page. Imagine two posts in a forum topic having the same header text. Now, suppose the first post is deleted. The order of the IDs would change and any links to the second post would break. As a solution, the parser could accept an optional namespace parameter that would be added to the start of the ID. The ID of the header “The Philosophy of CommonMark” would become #discourse-topic-115-post-40-the-philosophy-of-commonmark, for example.

I agree.

I assume you’re talking here about automatic IDs in general, not implicit header references.

Let me insist, just in case: In order to use IHRs, an author does NOT need to know anything about automatic or explicit header IDs. As mentioned before, IHRs are already implemented in Pandoc. They are documented here:
http://johnmacfarlane.net/pandoc/README.html#extension-implicit_header_references

That said, I feel we shouldn’t discuss IHRs under this topic any further. If there is any chance that they end up in the core spec or as an extension, then @jgm or @codinghorror will create a new topic when the time is right :wink:

My vote is to include IHRs in the core spec because they are simple, useful, and can be implemented in a backwards-compatible way.

1 Like