Ordered list detection has precedence over title detection

This is a good case. Here are some related ones.

The relevant part of the spec is the first paragraph of the Setext Headers section:

A setext header consists of a line of text, containing at least one non-space character, with no more than 3 spaces indentation, followed by a setext header underline. The line of text must be one that, were it not followed by the setext header underline, would be interpreted as part of a paragraph: it cannot be interpretable as a code fence, ATX header, block quote, horizontal rule, list item, or HTML block.

Thanks for pinning down the relevant part of the spec, jgm.

So my question is now whether this particular part of the specification is good. Do you know the reasons behind it? I stick by my opinion that setext headers should have a higher precedence than the other sort of structures the quotation from the specification lists, not least because it looks so visually striking. I can’t imagine anybody looking at the example markdown in my original post and seeing those a being anything but headings.

This seems reasonable to me.

Yes, this does not seem so reasonable. Seems like a very special exception which could catch people off guard.

It’s a very reasonable question. I agree that, in many cases at least, one would naturally expect

1. Foo
-------

to be a header. However, there are many factors to balance here. One is uniformity. The current rule is clear and uniform. If we allowed the example above to be a header, what about the other cases? Should this be a header?

----
----

What about this?

> bar
-----

What about

    code
----

I think it’s a bad idea to have a rule that’s very complicated with lots of exceptions (which is where @chrisalley’s suggestion is going).

Another consideration is parsing (though I’d give this secondary weight). The current spec allows us to use a very elegant method for parsing blocks; we can do it line by line, discarding the input after each line, with no lookahead whatsoever. Allowing headers whose contents could be lists complicates this considerably. (This is a secondary consideration, because we could use a different parsing algorithm that looks ahead one line.)

Bottom line is that I’m open to reconsidering this, but I don’t think it’s obvious what a better rule would look like.

To a reader of the specification, perhaps. But not to a user. Very few users of Reddit or similar will read the Commonmark spec! The rule in my head, at least, was “if there’s a line of dashes under something it’s a title”, not “as long as it’d otherwise be seen as a paragraph (after testing for these other dozen structures), and there is a line of dashes under, make it a title”.

Your other examples are interesting. In order, my gut feelings are that they’d be <h2>----</h2>, <h2>&gt; bar</h2> and <pre><code>code</code></pre><hr>.

I suppose this means for me that indentation and its effects have a very high precedence, then setext headers are way up there too, and # style headers, blockquotes and so on take less precedence. Having said that, I wouldn’t take major exception to indentation and setext’s precedence being the other way around. It just seems very unlikely you’ll indent a title line even though it won’t have an effect on the output, and put the dashes underneath which won’t be lining up (because they’re not allowed to be indented more than three spaces), intending for a title to be rendered.

If an order of precedence isn’t enough, or it is too complicated as a rule (though I hardly think so: schoolchildren deal easily for the most part with the order of precedence of basic mathematical operators, for example), and things still get ambiguous I would encourage requiring blank lines (or start/end of file) around things. Things look awfully cluttered without them anyway, when viewing as plain text. I realize this would probably shake things up quite a bit, but it seems to me it’d solve a lot of the problems. If horizontal rule and blockquote required blank lines before and after, for example, the first and second examples are no longer ambiguous. Indented code could require them too, and setext headers, and groups of list items.

Just to make sure we’re clear, I’m not suggesting things should be nested like that. I’m aware that some things can be nested already (I’d hazard a guess that it’s only span things like italic and bold?), but I don’t think HTML even allows block level things like lists within heading tags, off the top of my head. No, I’d definitely be expecting the likes of <h2>1. Foo</h2>.

I’m in favour of all of these cases producing a header. This is consistent with Markdown.pl and a number of other existing implementations. CommonMark would be in good company.

The original Markdown syntax guide says this about horizontal rules:

You can produce a horizontal rule tag (<hr />) by placing three or more hyphens, asterisks, or underscores on a line by themselves.

This doesn’t exclude the header taking precedence over the horizontal rule. It’s ambiguous.

I agree with @tremby that horizontal rules look cluttered without blank lines surrounding them. It is convenient to quickly type some Markdown without adding aesthetically-pleasing whitespace around the horizontal rule. But readability suffers as a result. Markdown’s philosophy is that readability is emphasised above all else. This is another reason to prioritise the header (which makes the markup more readable) over the horizontal rule without a blank line (which makes the markup less readable).

The issue won’t be solved by requiring blanks around horizontal rules. With that change, commonmark.js would still not produce a setext header; you’d just get a literal string ---.

Here’s a nice case that I think illustrates some of the parsing difficulties:

> > 1. hi
> --

As this case shows, it’s not enough just to peek at the next line to see if it contains a setext header underline. For the underline might be embedded in nested block quotes and/or list items.

Note that Markdown.pl and the like aren’t very consistent about their behavior. Here’s another “nested” case where you don’t get a setext header:

- hi
    - there
    ---

even though it works without the nesting.

If the rule of uniformity is to be followed here, I’d expect for the behaviour to carry over on different list and quote levels. E.g. check if the next line contains a setext header underline and is intended the same amount. I can see this getting complicated to parse, but if the text is more readable then perhaps it is worth it.

Given

1. Juli
------

- Event 1
- Event 2

Is it really so onerous to ask the user to escape the . there? There are a lot of other situations where accidental numbered lists need to be escaped, such as:

1986. What a great season. Perhaps the finest season in the history of the franchise.

and I’m not sure why this one is so special or different, certainly not special or different enough to warrant a ton of special casing and unusual rules.

I agree that the situations are similar. That said, the requirement to use the escape character isn’t desirable (for readability reasons) in either case. I posted a related topic regarding list items on their own which might be helpful here (apologies for bringing up yet another spec change proposal so late in the game).

Let me phrase the issue in this way:

According to the current specification, we have

  • ATX heading syntax and
  • setext heading syntax

as two alternative input syntax variations to specify a heading (of level 1 or 2). Given that these options confer no difference in meaning whatsoever, and produce the exact same result, I’d say that having a “rewrite rule” connecting them would be “desirable”.

Consider a string {string}, and on the one hand

#⎵{string}⎵#

##⎵{string}⎵##

compared to

{string}
========

{string}
--------

It would be IMO “desirable” to impose as little as possible side-conditions on the content or “form” of {string} in order to make the re-write rule work, that is: in order to make both input forms in fact equivalent.

Now the ATX heading syntax works with pretty much any value for {string}: it may contain arbitrarily many NUMBER SIGN characters, even at the start or end (see example 40), or start with “>⎵”. In fact, I see no other restriction for {string} than that it must not straddle input lines.

In contrast to that, the setext heading syntax description has a whole paragraph of restrictions on {string}:

[…] a line of text, containing at least one non-whitespace character, with no more than 3 spaces indentation, followed by a setext heading underline. The line of text must be one that, were it not followed by the setext heading underline, would be interpreted as part of a paragraph: it cannot be interpretable as a code fence, ATX heading, block quote, thematic break, list item, or HTML block.

For my taste too, the syntactic side-conditions for {string} in ATX headings are so much weaker than those applicable in setext headings that one can in fact ask why this is required.

It seems that the weakest possible requirement on {string} in setext headings would be that {string} is not blank: that is, contains a least on graphical character.

It would then be IMO appropriate to require the {string} line and hence the setext heading to be preceded by a blank line (which I would consider reasonable anyway). Consequently, when parsing CommonMark, recognition of a line following a blank line as (the starting line of) a:

  • code fence,
  • ATX heading,
  • block quote,
  • thematic break,
  • list item, or
  • HTML block

would be “suppressed” iff the next line is a setext heading underline.

Algorithmically, this would require only look-ahead over one input line, and would only be executed in the transition from blank lines to block of non-blank lines.

In this way, the “rewrite rule” would in fact “always” work—except in the “degenerate” case when {string} is a null string, or consists of white space only. But it seems that this exception can not be avoided in order to still allow recognition of a thematic break (formerly known as <HR>, or harsh rupture :wink: ) comprising only HYPHEN-MINUS characters.

I worry, though, that this case – numbered section headings – is exceptionally common, so asking authors to escape here is onerous. Keep in mind, also, that there may be a large number of existing Markdown documents that have numbered setext section headings, and these might all be broken by the current CommonMark behavior.

Maybe but since we’ve decided that

#heading

is no longer a heading (the space is required), some similar adjustment to headers in the form of

2. Second level heading
---

wouldn’t be too surprising, would it? Headers are, by definition, not super common elements in a document, so changing them is a bit easier than something that affected every para in a document.

Sorry, but I have to ask: how are “1986.␣What a great season.” and “2.␣Second level heading” examples for numbered section headings? I thought they were ordered list items the moment before?

If one has the—IMO misguided, although eg IEEE does it too—desire to ignore most modern style guides and to use the string “2.␣Second level heading” as a section heading, then I’d say it’s nothing else but just to require the input syntax:

2\.␣Second level heading
-------------------------

In any case, there’s still setext, where no escaping is needed already today:

##␣2.␣Second level heading␣##

So that desire seems a weak reason to change the rules, for my taste.

Just to show my face again:

The same client (who came across this issue some time ago and couldn’t figure out why the English looked fine while the German did not) just had the same issue again.

I’m not about to explain escaping characters to non-tech clients. I’ve instructed her to just use the ## style instead, but I have the feeling she had just gotten comfortable with underlining with dashes. It’s a shame this isn’t more intuitive.

1 Like

I agree that there are some good reasons to change something here.
BabelMark2 shows three interpretations of your original case:

  1. ordered list, hrule (commonmarks, cheapskate, maruku)
  2. ordered list with header in content (lunamark, parsedown, kramdown)
  3. header with number in content (all the others)

I think 3 is the best interpretation. The question is just how to get the spec to produce it without making the rules complicated or hard to remember.
I’ve already got this on the list of things to figure out before a 1.0 release.

1 Like

Agreed. Great; thanks. I look forward to the resolution.

It’s not easy to see how to modify the existing spec and parsing strategy to get (3).
I will think about it, but I can’t promise that the resolution will go that way.

What about tin-pot’s suggestion above, from Jan 4 and 5? Do they not help in this regard?

I think the suggestions were not taking into account that setext headers can be deeply nested. This adds a lot of complexity.

1 Like