Ordered list detection has precedence over title detection

tremby · June 28, 2015, 3:09am

I’m not sure if this belongs in spec or implementation, since I don’t have time to dig through the spec at this moment.

I had a client just asking me why the German version of one of her pages is displaying so differently from the English.

It’s a list of dates and events on those dates. The English:

July 1
------

- Event 1
- Event 2

July 2
------

- Event 3, etc

The German:

1. Juli
------

- Event 1
- Event 2

2. Juli
------

- Event 3, etc

The English is rendered as I expect, with h2 titles and unordered lists beneath each. But in the German since they lines start with a number and dot they’re interpreted as ordered lists, and then the hyphens underneath make a horizontal rule.

Is this really what’s intended? I’m aware I can do ## 1. Juli instead to get the effect I’m looking for, or escape the dot like 1\. Juli. But it seems to me that the underlined title syntax should have precedence over an ordered list.

Knagis · June 29, 2015, 5:56pm

The same issue is present in many languages and is much broader than just the titles. In the past the more common case has been a paragraph that start with the year:

English:

In the year 1985 ...

Latvian (and probably many more):

1985. gadā ...

tremby · June 29, 2015, 11:05pm

Yes, I’ve noticed that in the past.

In those cases it’s a little more obvious what went wrong. In the case of the title the presence of the line of dashes below really seems to me that it makes the intent extremely clear.

In any case, would a possible solution be to only treat it as an ordered list if at least two items are found?

chrisalley · June 30, 2015, 6:24am

Starting a title with a number is a common scenario, as sections may be numbered. It is much less common for a paragraph to start with a number followed by a full stop unless the paragraph is intended to be a list item.

This issue is similar to hashtags at the start of a line vs atx-style headers without a space. As argued in that topic, requiring a space after a header marker makes the document look cleaner. Similarly, requiring a blank line (to end the preceding list) produces a cleaner looking document and would simultaneously solve @tremby’s issue.

tremby · June 30, 2015, 6:37am

Not sure I follow. Where exactly in my example would this blank line be looked for? Do I already have it?

chrisalley · June 30, 2015, 11:47am

Sorry, I thought you had an ordered list in your example. I was actually referring to your comment about treating the “2. Juli” line as an ordered list if at least two items are found. Something like this might be an appropriate solution:

This creates an ordered list and a horizontal rule:

1. First ordered list Item
2. Second ordered list item
---

This creates an ordered list and a header:

1. Ordered List Item

2. Second level heading
---

In other words, the blank line ends the first list and a second list is only started if the line isn’t followed by the hyphens.

Alternatively, lines that start with a number and full stop followed directly by a line containing the hyphens could always be treated as a second level header. But this seems more likely to break backward compatibility.

jgm · June 30, 2015, 4:08pm

This is a good case. Here are some related ones.

The relevant part of the spec is the first paragraph of the Setext Headers section:

A setext header consists of a line of text, containing at least one non-space character, with no more than 3 spaces indentation, followed by a setext header underline. The line of text must be one that, were it not followed by the setext header underline, would be interpreted as part of a paragraph: it cannot be interpretable as a code fence, ATX header, block quote, horizontal rule, list item, or HTML block.

tremby · June 30, 2015, 10:45pm

Thanks for pinning down the relevant part of the spec, jgm.

So my question is now whether this particular part of the specification is good. Do you know the reasons behind it? I stick by my opinion that setext headers should have a higher precedence than the other sort of structures the quotation from the specification lists, not least because it looks so visually striking. I can’t imagine anybody looking at the example markdown in my original post and seeing those a being anything but headings.

tremby · June 30, 2015, 10:46pm

This seems reasonable to me.

Yes, this does not seem so reasonable. Seems like a very special exception which could catch people off guard.

jgm · July 1, 2015, 3:46am

It’s a very reasonable question. I agree that, in many cases at least, one would naturally expect

1. Foo
-------

to be a header. However, there are many factors to balance here. One is uniformity. The current rule is clear and uniform. If we allowed the example above to be a header, what about the other cases? Should this be a header?

----
----

What about this?

> bar
-----

What about

    code
----

I think it’s a bad idea to have a rule that’s very complicated with lots of exceptions (which is where @chrisalley’s suggestion is going).

Another consideration is parsing (though I’d give this secondary weight). The current spec allows us to use a very elegant method for parsing blocks; we can do it line by line, discarding the input after each line, with no lookahead whatsoever. Allowing headers whose contents could be lists complicates this considerably. (This is a secondary consideration, because we could use a different parsing algorithm that looks ahead one line.)

Bottom line is that I’m open to reconsidering this, but I don’t think it’s obvious what a better rule would look like.

tremby · July 3, 2015, 7:58am

To a reader of the specification, perhaps. But not to a user. Very few users of Reddit or similar will read the Commonmark spec! The rule in my head, at least, was “if there’s a line of dashes under something it’s a title”, not “as long as it’d otherwise be seen as a paragraph (after testing for these other dozen structures), and there is a line of dashes under, make it a title”.

Your other examples are interesting. In order, my gut feelings are that they’d be <h2>----</h2>, <h2>> bar</h2> and <pre><code>code</code></pre><hr>.

I suppose this means for me that indentation and its effects have a very high precedence, then setext headers are way up there too, and # style headers, blockquotes and so on take less precedence. Having said that, I wouldn’t take major exception to indentation and setext’s precedence being the other way around. It just seems very unlikely you’ll indent a title line even though it won’t have an effect on the output, and put the dashes underneath which won’t be lining up (because they’re not allowed to be indented more than three spaces), intending for a title to be rendered.

If an order of precedence isn’t enough, or it is too complicated as a rule (though I hardly think so: schoolchildren deal easily for the most part with the order of precedence of basic mathematical operators, for example), and things still get ambiguous I would encourage requiring blank lines (or start/end of file) around things. Things look awfully cluttered without them anyway, when viewing as plain text. I realize this would probably shake things up quite a bit, but it seems to me it’d solve a lot of the problems. If horizontal rule and blockquote required blank lines before and after, for example, the first and second examples are no longer ambiguous. Indented code could require them too, and setext headers, and groups of list items.

Just to make sure we’re clear, I’m not suggesting things should be nested like that. I’m aware that some things can be nested already (I’d hazard a guess that it’s only span things like italic and bold?), but I don’t think HTML even allows block level things like lists within heading tags, off the top of my head. No, I’d definitely be expecting the likes of <h2>1. Foo</h2>.

chrisalley · July 3, 2015, 12:50pm

jgm:

It’s a very reasonable question. I agree that, in many cases at least, one would naturally expect

Foo

to be a header. However, there are many factors to balance here. One is uniformity. The current rule is clear and uniform. If we allowed the example above to be a header, what about the other cases? Should this be a header?

What about this?

> bar

What about
code

I’m in favour of all of these cases producing a header. This is consistent with Markdown.pl and a number of other existing implementations. CommonMark would be in good company.

The original Markdown syntax guide says this about horizontal rules:

You can produce a horizontal rule tag (<hr />) by placing three or more hyphens, asterisks, or underscores on a line by themselves.

This doesn’t exclude the header taking precedence over the horizontal rule. It’s ambiguous.

I agree with @tremby that horizontal rules look cluttered without blank lines surrounding them. It is convenient to quickly type some Markdown without adding aesthetically-pleasing whitespace around the horizontal rule. But readability suffers as a result. Markdown’s philosophy is that readability is emphasised above all else. This is another reason to prioritise the header (which makes the markup more readable) over the horizontal rule without a blank line (which makes the markup less readable).

jgm · July 3, 2015, 4:57pm

The issue won’t be solved by requiring blanks around horizontal rules. With that change, commonmark.js would still not produce a setext header; you’d just get a literal string ---.

Here’s a nice case that I think illustrates some of the parsing difficulties:

> > 1. hi
> --

As this case shows, it’s not enough just to peek at the next line to see if it contains a setext header underline. For the underline might be embedded in nested block quotes and/or list items.

Note that Markdown.pl and the like aren’t very consistent about their behavior. Here’s another “nested” case where you don’t get a setext header:

- hi
    - there
    ---

even though it works without the nesting.

chrisalley · July 5, 2015, 10:26am

If the rule of uniformity is to be followed here, I’d expect for the behaviour to carry over on different list and quote levels. E.g. check if the next line contains a setext header underline and is intended the same amount. I can see this getting complicated to parse, but if the text is more readable then perhaps it is worth it.

codinghorror · January 3, 2016, 10:37am

Given

1. Juli
------

- Event 1
- Event 2

Is it really so onerous to ask the user to escape the . there? There are a lot of other situations where accidental numbered lists need to be escaped, such as:

1986. What a great season. Perhaps the finest season in the history of the franchise.

and I’m not sure why this one is so special or different, certainly not special or different enough to warrant a ton of special casing and unusual rules.

chrisalley · January 3, 2016, 12:51pm

I agree that the situations are similar. That said, the requirement to use the escape character isn’t desirable (for readability reasons) in either case. I posted a related topic regarding list items on their own which might be helpful here (apologies for bringing up yet another spec change proposal so late in the game).

tin-pot · January 4, 2016, 5:38pm

Let me phrase the issue in this way:

According to the current specification, we have

ATX heading syntax and
setext heading syntax

as two alternative input syntax variations to specify a heading (of level 1 or 2). Given that these options confer no difference in meaning whatsoever, and produce the exact same result, I’d say that having a “rewrite rule” connecting them would be “desirable”.

Consider a string {string}, and on the one hand

#⎵{string}⎵#

##⎵{string}⎵##

compared to

{string}
========

{string}
--------

It would be IMO “desirable” to impose as little as possible side-conditions on the content or “form” of {string} in order to make the re-write rule work, that is: in order to make both input forms in fact equivalent.

Now the ATX heading syntax works with pretty much any value for {string}: it may contain arbitrarily many NUMBER SIGN characters, even at the start or end (see example 40), or start with “>⎵”. In fact, I see no other restriction for {string} than that it must not straddle input lines.

In contrast to that, the setext heading syntax description has a whole paragraph of restrictions on {string}:

[…] a line of text, containing at least one non-whitespace character, with no more than 3 spaces indentation, followed by a setext heading underline. The line of text must be one that, were it not followed by the setext heading underline, would be interpreted as part of a paragraph: it cannot be interpretable as a code fence, ATX heading, block quote, thematic break, list item, or HTML block.

For my taste too, the syntactic side-conditions for {string} in ATX headings are so much weaker than those applicable in setext headings that one can in fact ask why this is required.

It seems that the weakest possible requirement on {string} in setext headings would be that {string} is not blank: that is, contains a least on graphical character.

It would then be IMO appropriate to require the {string} line and hence the setext heading to be preceded by a blank line (which I would consider reasonable anyway). Consequently, when parsing CommonMark, recognition of a line following a blank line as (the starting line of) a:

code fence,
ATX heading,
block quote,
thematic break,
list item, or
HTML block

would be “suppressed” iff the next line is a setext heading underline.

Algorithmically, this would require only look-ahead over one input line, and would only be executed in the transition from blank lines to block of non-blank lines.

In this way, the “rewrite rule” would in fact “always” work—except in the “degenerate” case when {string} is a null string, or consists of white space only. But it seems that this exception can not be avoided in order to still allow recognition of a thematic break (formerly known as <HR>, or harsh rupture ) comprising only HYPHEN-MINUS characters.

jgm · January 5, 2016, 10:41pm

I worry, though, that this case – numbered section headings – is exceptionally common, so asking authors to escape here is onerous. Keep in mind, also, that there may be a large number of existing Markdown documents that have numbered setext section headings, and these might all be broken by the current CommonMark behavior.

codinghorror · January 5, 2016, 11:03pm

Maybe but since we’ve decided that

#heading

is no longer a heading (the space is required), some similar adjustment to headers in the form of

2. Second level heading
---

wouldn’t be too surprising, would it? Headers are, by definition, not super common elements in a document, so changing them is a bit easier than something that affected every para in a document.

tin-pot · January 5, 2016, 11:32pm

Sorry, but I have to ask: how are “1986.␣What a great season.” and “2.␣Second level heading” examples for numbered section headings? I thought they were ordered list items the moment before?

If one has the—IMO misguided, although eg IEEE does it too—desire to ignore most modern style guides and to use the string “2.␣Second level heading” as a section heading, then I’d say it’s nothing else but just to require the input syntax:

2\.␣Second level heading
-------------------------

In any case, there’s still setext, where no escaping is needed already today:

##␣2.␣Second level heading␣##

So that desire seems a weak reason to change the rules, for my taste.