Feature request: automatically generated ids for headers

chrisalley · June 28, 2015, 12:23am

In practice that seems likely. In theory, h5-h6 pairs could be identical.

This feature should probably be an extension. Sites could then use either explicit IDs that are guaranteed not to change (but add clutter to the document) or implicit IDs knowing that there is a risk that the headings may be reordered, breaking links. The length of the document (and likelihood of duplicate headings) could be a deciding factor.

zwol · June 28, 2015, 12:49am

This feature should probably be an extension

What part of “this feature must be in the core as mandatory to implement, because nothing less than that moves us toward a world where every heading in every HTML document has a fragment identifier” is unclear? Are my posts not actually getting through or something?

chrisalley · June 28, 2015, 12:59am

It’s clear what you’re requesting @zwol . What’s not clear is why every heading needs to have an ID.

zwol · June 28, 2015, 1:19am

What’s not clear is why every heading needs to have an ID

So that it is always possible to link to a specific section of a document. Why else? (Yes, I regularly encounter cases where I can’t write an appropriate hyperlink because the section header I need to point at doesn’t have an ID.)

chrisalley · June 28, 2015, 3:31am

The author may not wish to allow headings to be linked to. For example, the headings may be subject to change in the future (we discussed reordering headings above), so a link to the overall document may be preferred. This is why an extension may be more appropriate (with a predictable method of generating the IDs for the implementations that adopt the extension).

Crissov · June 28, 2015, 1:32pm

For what it’s worth, I agree with @zwol that every heading (including captions and maybe every reference) should automatically become a link target. Since neither Xlink nor Selectors can be expected to be used in general, explicit IDs remain the only viable solution. If Commonmark was (to become) a modular specification where everything but the core was optional to implement (i.e. an extension), there should be a module for implicit header references which also included the requirement for automatically generated IDs and probably a way for authors to set an explicit value.

Such internal links, which include an automatically generated TOC, would be the major use of automatic heading identifiers, I assume. If IDs were generated from textual content (or arbitrary/random), they are kept in synch automatically when the author rearranges the document structure, except perhaps for headings with canonically equal content. If IDs were generated by hierarchic position, on the other hand, they would be safe against subsequent textual changes. I don’t see how we could get both, but I prefer “speaking” names for all parts of an URL.

I fail to see @chrisalley’s latest point since (external) links with hash target will also work if that target is not found, i.e. the reader gets directed to the top of the whole document. (There are possible scenarios where the reader would see an unintended section instead.)
I also consider it not very important for an author or reader to be able to predict the exact ID of a heading (by either position or content) by applying some canonization algorithm mentally.

There is one way we could deal with internal links by structure rather than name: symbolic relative links, but I should probably open a separate thread for that.

# Top chapter
## First section
## Previous section
## Current section
Chainable relative links for simple siblings and ancestors:
* [Current][@]
* [Top or Upper][^] – cf. [^footnote]
* [Previous][<]
* [Next][>]
* [First][|<]
* [Last][>|]
* [Document or Top][.] – almost as in a POSIX file system
## Next section
## Last section
Implicit links like [top chapter] work always, 
whereas the following explicit ones only work in certain implementations:
* [hierarchic][#heading1]
* [hierarchic][#chapter1]
* [textual][#top chapter]
* [textual][#top-chapter]
* [textual][#top_chapter]
* [textual][#top%20chapter]
* [textual][#topchapter]
* [textual][#explicit override ID]

  [#Top chapter]: explicit override ID

chrisalley · June 29, 2015, 9:38am

[quote=“Crissov, post:58, topic:115”]
I fail to see @chrisalley’s latest point since (external) links with hash target will also work if that target is not found, i.e. the reader gets directed to the top of the whole document. (There are possible scenarios where the reader would see an unintended section instead.)[/quote]

As you said, the reader may see an unintended section in some scenarios. That was essentially my point. By making the implicit header IDs opt-in, the developer first has to make a decision as to whether this is acceptable behaviour. If it is not considered acceptable behaviour, the developer can choose the explicit header ID extension instead.

Crissov · June 29, 2015, 12:06pm

I consider these scenarios as too unlikely to counter the benefits of linkable headings in general. It makes another argument in favor of name-based IDs, though, because these are more likely to be unique over time than simple hierarchy-based ones.

matmuchrapna · June 29, 2015, 1:06pm

why this feature cannot be delegated to extensions?

After some time everybody will be able to choose implementation they prefer

jgm · June 29, 2015, 5:02pm

My main worry about automatically generated header IDs is that, in order to ensure uniqueness, you have to add subscripts or use some other mode of disambiguation. And then the problem is that the ID of a particular element might change due to changes elsewhere in the document, which can lead to broken links. @Crissov’s proposal is quite nice, and would substantially reduce the need for such disambiguation, but not eliminate it entirely. (Maybe it would eliminate it enough?)

For internal links it’s nice if the Markdown renderer creates both the target IDs and the links, as is done in pandoc:

## My header

See [My header], or [above][My header].

But of course there’s still a problem about disambiguation, and links from outside the document still need to know the generated ID.

hulkur · June 29, 2015, 8:34pm

Being new to markdown I don’t know all the right syntax but I want to mention my use-case for referencing headers.

Previously I used my own markup+renderer and there I had auto generated IDs for headers and {TOC} tag which generated table of content for these headers. This way I could change headers as needed and not worry about changing them in ToC.

Ofcourse it had above mentioned problems - autogenerated IDs not constant in time and not predictable, so not linkable from outside. For that I added also explicit IDs.

Looking at spec I think most understandable would be to use reference syntax and explicit IDs (implicit generated IDs can be added). Usage would not be limited to ToC but can be use also internally in text.

I think something like following would work:

Table of Content
1. [header_ref]
1.1 [section_ref]
1.2 [other_ref]

[header_ref]: # Top Header {#explicitID}
Some text with mention of [section_ref] to link internally.
Maybe this could also have [section_ref](some other link text) for internal linking

[section_ref]: ## Section Title
Some more text

There is still problem of reordering headers but this is easier to do with only references to worry about (references don’t change that often)

ToC could be fully auto-generated (post-processing) if requested (like my {TOC} tag) but that is whole other issue.

On topic of conflicting autogenerated IDs: add a prefix like md- or mdref-
You can’t account for all cases but you can make an educated guess on “will work in most cases”

asbjornu · January 28, 2016, 8:39am

Sorry for bumping this old thread, but it doesn’t seem to have reached consensus and I think Markdown/Commonmark really needs this, so I want to give @zwol a .

The author may not wish to allow headings to be linked to. For example, the headings may be subject to change in the future (we discussed reordering headings above), so a link to the overall document may be preferred.

A link to a non-existing header will become a link to the overall document, so HTML takes care of that use case right out of the box. I thus find this to be a weak argument against adding an id attribute to all headers.

I also want to add that I agree with @an3ss in using the text content of the header as the reference, leaving to the implementation how the ID’s are generated. Mandating how the resulting HTML needs to look is in my opinion not required; just that it should be possible for Commonmark itself to make self-references to headers defined within the same document.

This of course requires an id attribute to be added to all headers, but it does not need to have the same algorithm for how these id attributes are generated across implementations. I thus think this is a pretty simple and small addition to the Commonmark language.

chrisalley · January 29, 2016, 9:02am

If the headings are reordered, an old link to the heading would now point to the wrong heading, which could mislead the reader. For this reason, the author may not wish to allow direct links to the heading if it is subject to change later.

If the algorithms are different, and a CommonMark parser is swapped out with another CommonMark parser, the IDs may no longer be the same. This strikes me as problematic; it’s preferable that links to headings continue to work across implementations.

Crissov · January 31, 2016, 2:42pm

That’s only true if ordinal IDs were being used (e.g. for multiple headings with the same textual content). That’s a border case and alternatives which avoid the problem have been demonstrated.

Internal links would still work, because IDs are not used directly by authors.

If this was indeed considered a problem to be solved by the spec, the solution would be explicit ID overrides. I’m not 100% sure how this should look. I present here are two variants:

# Variant 1
## Implicit ID

Paragraph with links to [implicit ID] and [explicit ID][#ExplicitID]. 

  [Implicit ID]: #ExplicitID

(Both links are the same if the parser supports explicit IDs, 
 otherwise both links will fail in output, 
 because there is no `ID` attribute value ‘ExplicitID’.)

# Variant 2
## Implicit ID

Paragraph with links to [implicit ID] and [explicit ID][#ExplicitID]. 

  [#Implicit ID]: #ExplicitID

(Only the second link works – perhaps – if the parser supports explicit IDs, 
 otherwise only the first link works, 
 unless the output format supports multiple IDs per element.)

  [Implicit ID]: http://example.com/overwritten

(Both links now work if the parser supports explicit IDs, 
 but they have different targets, 
 otherwise only the first link, to an external site, works.)

chrisalley · February 1, 2016, 8:54am

My point was in response to @asbjornu’s comment stating that the algorithms for generating the IDs could be different. If indeed the algorithms are different then the generated links (from those IDs) could be different.

Regarding the reordering of headings, I’m in agreement that the IDs could be generated in a way that ensure uniqueness in most, but not all, cases. Whether this can be done in way so that the URLs are aesthetically pleasing, I’m not sure. If the URL is considered part of a website’s design, then flat ordinal IDs might be preferable (to the designer/author) over longer IDs which concatenate heading strings together but less prone to duplication (as you suggested earlier, @Crissov). Aesthetic considerations of the generated IDs shouldn’t be overlooked here, because designers may wish to make their URLs beautiful and easy to read and write.

tmpfs · February 2, 2016, 11:56am

I agree with this, however maybe implicit IDs is the first thing to support.

Conflicting IDs between Markdown rendering and a containing HTML document should be resolved by a preprocessor (or some post-processing or validation) so I don’t think that’s too much of an issue if IDs were automatically generated.

If IDs are automatically generated then it needs to be clearly specified, but this sounds like a feature extension and post 1.0.

However, if they are implicit then I think something like:

#headingid Heading Title
Heading Title
=============heading-id

Is the cleanest and is tune with the info string on fenced code blocks, however you would not be able to use spaces in the ID which I would (and most people I think) consider a bad practice.

I believe I saw a discussion about requiring a single space after the # in ATX headings, that would also need to be ratified for the above to be possible.

Jeremy_Morton · February 2, 2016, 1:59pm

My 2 cents: I pretty much agree with hulkur here.

It would be a nice feature to have anchor links in Markdown, but given the problems of auto-generating them based on heading text (heading text changing, identical headings being re-ordered) it should probably require an explicit ID. The explicit ID could either be applied to the header or even just put anywhere in the document in order to generate an empty a tag with that ID, to link to that point in the document. Normalize the ID by lowercasing it, changing spaces to dashes, and removing all other non-dash punctuation.

I would also prefix the IDs in an attempt at giving them a unique namespace (obviously one can never guarantee this unless one has access to the entire HTML document but it’s a reasonable precaution) - perhaps markdown-anchor-? For example:

[# Step 2 - config]{Step 2!}
Configure the software by doing stuff...
More text...
Here is somewhere inline you can []{inline-link}link to.

… generates:

<h1 id="markdown-anchor-step-2">Step 2 - config</h1>
<p>Configure the software by doing stuff...
More text...
Here is somewhere inline you can <a id="markdown-anchor-inline-link"></a>link to.</p>

If there are any duplicate anchor IDs, the parser should warn the user (or maybe refuse to generate the Markdown until the duplicates are removed).

Any links in the document could then be normalized so that they linked to the generated fragment IDs:

[Link to heading](#STEP-2)

[Link to inline](#inline link)

… generates:

<p><a href="#markdown-anchor-step-2">Link to heading</a></p>
<p><a href="#markdown-anchor-inline-link">Link to inline</a></p>

JavaScript could optionally be used to allow linking to fragments without the markdown-anchor- prefix, much as Github does.

Crissov · February 2, 2016, 11:15pm

I prefer overwriting implicit IDs with extended reference link syntax, i.e. an indirect approach. If there needed to be direct explicit IDs (and classes) anyhow, I always thought the obvious way was like this (remember line suffixes?):

# Heading Title # .class #headingid "title"

Heading Title
============= #heading-id .class @for

jgm · February 2, 2016, 11:30pm

I agree that different sites may have different needs for automatically generated header IDs. So I’d hate to specify this in the spec. But it might be worth adding implicit links to headers, leaving the assignment of IDs up to the implementation. They might work like this: [My first section] links to a section with contents My first section, unless the reference label [My first section] is explicitly defined in the document. If there are multiple sections with contents My first section, it links to the first one.

This would be fairly simple to implement and would solve most everyday section linking needs. For more control, we could consider a way to specify IDs explicitly. In pandoc and several other implementations, you do it this way:

# Heading {#myid}

[EDIT: of course, leaving the precise IDs undefined complicates testing.]

tmpfs · February 3, 2016, 1:30am

I prefer this for legibilty and alignment. I think we should address classes and other attributes separately but your suggestion does allow for them neatly.