Feature request: automatically generated ids for headers


#72

I prefer this for legibilty and alignment. I think we should address classes and other attributes separately but your suggestion does allow for them neatly.


#73

That’s exactly why I recommended we drop this feature, but it didn’t seem to get noticed much and everyone kept on discussing it as though it’s a good idea to violate SRP for a one-size fits nobody approach to adding some [id] attributes.


#74

Then it makes sense to me to do it that way. I do however like the ability to specify IDs; if the {#myid} syntax is already familiar to people then I suggest we follow than convention too (I haven’t used it yet).

I agree that different sites may have different needs for automatically generated header IDs

For me, that rules out automatic IDs as part of Commonmark. I suggest any automatic ID generation is left to a processor.

leaving the assignment of IDs up to the implementation.

I worry that implementations would treat this so differently that it could cause headaches for users wishing to switch implementations.

After more consideration, I prefer that Commonmark not automatically generate IDs but implicit IDs are implemented likely following the existing convention {#myid} although I do wonder if those curly braces can be dropped :slightly_smiling:


#75

The syntax with curly braces may have one advantage in that it can be applied to all kinds of blocks and even inline markup. Like all explicit metadata it makes the input source code, which, in the case of Markdown/Commonmark, is always also the simplest form of output, less readable. IOW, it’s against the spirit.

The “prefixed line suffixes” variant I introduced could work well for headings (both ATX and Setext) and fenced blocks as well as thematic breaks, but needs further work for other types:

With optional line suffix / terminator

1. Enumerated list item . #ID
* Bullet list item * .class
> Quotation < "title"

Without optional line suffix / terminator

1. Enumerated list item #ID
* Bullet list item .class
> Quotation "title"

I’d expect #ID and .class to work surprisingly well, except that the former may clash with hashtags, but the title syntax derived from links is just asking for trouble. You could still add optional curly braces for ambiguous cases, of course.


#76

The use of square brackets is quite common. I can see the [My first section] syntax creating links in Markdown documents where a link was not intended by the author. Unless you meant the reference in reference style links only? e.g. [Click/tap this text to visit my first section][My first section]. I can’t think of any strong objections to this latter syntax. But…

It complicates testing and there’s the problem of the IDs being different between implementations. As an opt-in extension, having a consistent algorithim for generating the IDs would be useful as this would allow a subset of CommonMark documents to all count on external links to headings not breaking if the CommonMark implementation is swapped out with another implementation. If such an extension existed then you define a link such as…

[Click/tap this text to visit my first section](#my-first-section)

…and the heading…

## My First Section

…and count on it working across all CommonMark implementations which opt-in to the Implict IDs extension. It’s not going to solve all of the issues raised in this topic (reordering headings, etc), but for a sizeable number of documents (wiki articles, forum posts, etc) it would probably be reliable enough for their use cases. For documents that require more certainty in how the IDs are generated we could have a seperate explicit IDs extension and leave it up to the application developer to choose which extension (implicit IDs, explicit IDs, or both) to use.


#77

No, authors must never be required to use (and determine first) the actual ID. Some may choose to do that, though.

With default settings, Pandoc is the only current implementation in available Babelmark that seems to get it right.


#78

For the nth time, this is a request for mandatory generation of ids for all headers, as a core component of the specification. An extension is not good enough. Optional-to-implement is not good enough. Only-if-the-author-does-something is not good enough. It must be in the core, it must be mandatory to implement, and it must apply to all headers. Only that will move us toward a world where all HTML documents always have IDs on all of their headers.

All the picayune stuff this request keeps getting sidetracked on - what the IDs actually are, how the author can control them, whether the author should be able to opt-out some headers (no), etc - is not as important as the principle.

Mandatory generation of IDs.
For all headers.
In the core specification.


#79

Not if the identity of id includes the position of the header in the outline. Not that I think such should be mandated by the specification, but it can be explained in an informative section on “best practice”.

I agree that it’s preferable, but I don’t see it as an absolute requirement. It’s a nice-to-have and something that can be achieved given a best practice algorithm, but since we don’t yet know what that algorithm will look like, I think the feature can be added to the Commonmark syntax and eventually that algorithm will surface. When it does, it can be added as a reference to the core language specification.

I don’t see what problem an explicit id is solving here. While I do agree it should be possible to have explicit ids, they will suffer from the exact same synchronicity problem as implicit id's. The link referencing the anchor will have to reference something. That something can change, whether it is an explicit id or the text of a header.

:thumbsup:


#80

I don’t see what problem an explicit id is solving here. While I do agree it should be possible to have explicit ids, they will suffer from the exact same synchronicity problem as implicit id’s. The link referencing the anchor will have to reference something. That something can change, whether it is an explicit id or the text of a header.

It is solving the problem that the auto-generated ID will change if you change the heading text or the ordering of 2 identically-named headings. Your anchor ID will stay the same unless you explicitly change it, meaning that your links to it will not get broken.


#81

If the link is written as @an3ss suggests:

## The Philosophy of CommonMark

You need to understand [the philosophy of CommonMark] because blah, blah, blah...

The reference and the anchor are synchronized through the text. If you change the text of the header, you most likely want to change the ID as well, unless you just use UUIDs. I don’t think this is a very compelling argument for requiring an explicit ID, but if it is what it takes to achieve consensus, I can live with that.


#82

The reference and the anchor are synchronized through the text.

Not necessarily; it’s very conceivable you could use text other than the header text to refer to a section. This also doesn’t take into account links from external documents.


#83

I suppose it is nicer for the author not to have to manually write the ID of a heading in order to link to it. I am concerned that adding square bracket heading links to the core spec would break a lot of existing Markdown documents (imagine how many GitHub README.md files would use square brackets for some other purpose and shouldn’t link to a heading), but as an opt-in extension it could be useful. Or some other syntax could be used besides square brackets.

How that would get around the reordering issue without the author defining explicit IDs? Can you provide an example?


#84

To be honest, I’m not sure what I was thinking of, but I think it was something like this: If the generated ID consists of both the position of the header in the outline as well as the header’s name, reordering of headers shouldn’t be a problem. For the following outline:

# Level One
## Level Two
# Level One
## Level Two

The generated ID’s of the headers could be something like:

  1. section1-level-one
  2. section1.1-level-two
  3. section2-level-one
  4. section2.1-level-two

If you reorder or rename a header, it will get a new ID. You won’t have backward compatibility with incoming links, but you will avoid conflicts.

To the argument of being backwards compatible and conserving incoming links, I think that’s impossible unless your ID’s have absolutely nothing to do with the document structure at all; i.e. semantically nonsense. UUID’s will give you this detachment if you really want it and for those who do, they should by all means be able to explicitly name the ID of their headers and stuff an UUID in there, but for those of us who like semantically accurate and human intelligible ID’s, we can go with a more attached and brittle autogenerated ID.


#85

I think no one has noticed this issue, it affects users from other languages (english is fine with this). Most implementation of automatic IDs have the lame effect of ignoring accented characters, which is correct actually, since the markdown might be incorrect and cause issues if the accent was included (the URL can’t contain accents), but the issue is that it’s not properly converted to the non-accented counterpart, it’s just ommited.

For example, for the title:

# “Techné” as the greek word for “Art”

The id would be techn-as-the-greek-word-for-art, instead of the correct way: techne-as-the...

Most automatic ID generators commit this error and I don’t think they’re to blame, I just think that language is really complicated and even I have no idea what issues this might be for other languages such as Japanese or Corean.

Markdown should not behave in an opinionated way (which is inevitable with automatic IDs), unless it clearly provides an alternative to use your own criteria to generate ids.

I really like the {#id-goes-here} approach because it’s clearly understandable and for reasons stated above.


#86

We could define some rules to automatically convert the commonly used accented characters. But you might be right about it being difficult to anticipate the correct behaviour for all languages. This is a compelling reason to include an override method as part of the extension.


#87

I’m kinda in favor of the simplest solution, the one that GitHub has adopted, even at the risk of collisions (though GitHub avoids collisions by appending a suffix when a collision is detected).

The thing to remember is that links should be easy to author, just as with the rest of markdown/commonmark.

I haven’t thought through the implications enough, hence the “kinda”.


#88

Babelmark shows just how different approaches are for spaces and roman non-ASCII letters.

With an info string, authors could override automatic IDs.


#89

We don’t need namespaces. We already have scopes.

Within the scope of the content I’m authoring, [me too](#me-too) can unambiguously link to # Me Too. I as an author should not have to think about any containing scopes. This is perfectly analogous to block scopes in most programming languages.

It is the responsibility of the embedding context to respect and protect my scope. Whether in its rendering it avoids ID collisions by altering its IDs or mine, or demotes my heading levels to avoid multiple H1s, it is its business, its job to make it work.

There should be a clear separation of concerns between authoring content and publishing mechanics. I shouldn’t have to manage the technicalities of an output format while authoring in a format that is supposed to be independent and portable. As the content author, the only thing that matters is that I have a semantically unambiguous and intuitive way to create internal links. I don’t care how they are ultimately rendered.

The one case where I do care, when I want the world to be able to deep link into my published content, I make sure to choose a publishing tool that produces predictable, “exportable” header ids and deep links, perhaps one that retains the unaltered CommonMark anchors or perhaps one that doesn’t. Who knows, maybe the content will get published in a relational database, and all anchors get translated into foreign keys. These are publishing concerns, not authoring concerns, not CommonMark spec concerns.


Anchors in markdown
#90

Commonmark should (with “support” meaning either in the mandatory core or via an optional extension) …

  • require implementations to automatically generate implicit IDs for headings
  • specify how to generate implicit IDs from heading text
    (e.g. “Überschrift 1” ⇒ #uberschrift_1)
  • specify how to generate implicit IDs from document structure
    (e.g. ### Heading#section-3.6.1)
  • specify how to generate safe IDs for user-generated content
    (e.g. {#window.evil');\ DROP\ TABLE\ *;--}#user-window-evil-drop-table)
  • support manually entered, explicit IDs for headings
    (e.g. ## Heading ## #ID)
  • support manually entered, explicit IDs for any block
    (e.g. ~~~ #ID)
  • support manually entered, explicit IDs for links
    (e.g. [text](target #ID))
  • support manually entered, explicit IDs for any inline markup
    (e.g. *emphasis*{#ID})
  • support overriding implicit IDs with explicit ones
    (e.g. [heading text]: #ID)
  • support relative links to headings/sections
    (e.g. [next section][>] or [this section](#.))

0 voters


#91

One problem with implicit IDs is that they change when the heading changes. If you’re serious about cross-referencing in large documents, you want to guard against this, otherwise you’re spending your time updating links when you change headings. There are many reasons you might want to change the heading but keep the section same - to make it more informative, replace a word with another, etc.

So the best option of the options in the above vote is to support explicit IDs which may displace implicit IDs, or don’t specify implicit IDs in the spec at all and then systems which already generate implicit IDs will have to choose to displace the implicit ID with explicitly specified ID, when specified.