Spec section 4.3 ("setext heading") - Ref-impl bug or unclear spec?

tin-pot · January 9, 2016, 11:49pm

In the description of setext headings (section 4.3), the specification states:

A setext heading underline is a sequence of “=” characters or a sequence of “-” characters, with no more than 3 spaces indentation and any number of trailing spaces. If a line containing a single “-” can be interpreted as an empty list items, it should be interpreted this way and not as a setext heading underline.

In this example:

Dolor
-

the condition written in the second sentence seems to apply: The line below “Dolor” contains a single “-” HYPHEN-MINUS character, and it can be interpreted as “an empty list items” [is that “an empty list item”?]

Thus the expected result seems to be that in the example the second line is not a setext heading underline, and thus the example is not a setext heading.

In this example the situation is similar, but less ambiguous:

Dolor
1.

The line below “Dolor” here is (and must be) “interpreted as an empty list item”.

The reference implementation translates the first example to

<h2>Dolor<h2>

but the second to

<p>Dolor<p>
<ol>
  <li></li>
</ol>

Is this a bug in the reference implementation? Do I misread the specification? Is the specification wrong?

jgm · January 10, 2016, 1:23am

Yes, you’re right, this is problematic.

This bit of the spec was introduced in response to

Maybe too hastily. This needs rethinking. Any suggestions?

codinghorror · January 10, 2016, 1:28am

Why can’t we just require more than one - or = on a line to make a setext style header?

Header
===

header
---

Some of this “flexibility” is awfully destructive.

And we are already tightening the rules around headings with the space required here

#not-a-header

tin-pot · January 10, 2016, 1:37am

Why can’t we just require more than one - or = on a line to make a setext style header?

That would be a simple and reasonable solution (after all, the code fence line has a minimum count, too).

It would, however, be also be an incompatible solution.

Given that the syntactic construct has two possible interpretations, and that “empty list items” are not exactly common, or “portable”, I honestly see no merit in preferring an “empty list item” interpretation over the “setext heading” interpretation.

If you really want an “empty” list item, you can always (and even somewhat more “portable” and clear) write

Dolor
- <span></span>

Yes, technically this is not really an empty item. In CommonMark you could write alternatively

Dolor
- <!-- empty item -->

Depending on the implementation, also not “really empty”.

With my suggested (don’t remember if I actually wrote about it here) “discardable input tag”, the empty comment declaration <!>, this would look like this:

Dolor
- <!>

In this case the <!> would be “seen” in parsing, but discarded when it comes to “inline” text. (The <!> is valid SGML, but not XML or HTML, so discarding it is no harm anyway.) I think <!> could be useful in other places, too.

jgm · January 10, 2016, 2:36am

I’d be in favor of that, but worry about compatibilty. I suspect that it’s not too uncommon to use single-character underlines, just because people are lazy. It would be nice if we had data on this.

tin-pot · January 10, 2016, 2:41am

Any hunch how common it is to

use an empty list item
in an unordered list
as the first item
after a “vanilla” text line (ie without a preceding blank line?

I have no data either, but my wild guess would be that “using a single-character setext underline” is more common than this combination of circumstances.

Cetero censeo: Recognizing list items in a “vanilla paragraph” is a bad idea (and that is what is going on here, after all).

codinghorror · January 10, 2016, 4:07am

Given that we are already introducing the breaking

#this-is-not-a-header

Change I don’t see any harm to tightening up the rules on setext headers to require more than one - and =

As I have mentioned before headers are by definition rare elements in a doc and thus pretty easy to fix as they are a) obvious and b) big thematic breaks so easy to find.

tin-pot · January 10, 2016, 4:20am

I don’t know whether this is an argument or a mere curiosity—but it’s in your position’s favor

Currently this:

foo
bar
-
baz

produces (in CommonMark and cheapskate only) a list with an empty item (I think that’s by design, and I hate it ) So why souldn’t this:

bar
-
baz

and this:

bar
-

too? (Only cheapskate sees empty list items in both cases already; the clear majority of parsers see a setext heading in both examples.)

Crissov · January 10, 2016, 12:27pm

It’s certainly uncommon to be the desired end result, but it happens quite often while typing. I’ve seen implementations with integrated live preview or preview-like syntax highlighting indicating a heading at first when beginning a line with hyphen-minus (and whitespace) but switching to list item as soon as any other character is typed- It’s annoying, so I support a minimum number of - or = for Setext headings that is greater than 1.

tin-pot · January 10, 2016, 12:47pm

As I wrote, I find the whole idea of splitting a “vanilla-starting” paragraph into items and whatnot “after the fact” to be bogus. Thus I have no horse one way or the other in this race over just another irritating consequence of this idea.

In fact, I agree with you and @codinghorror that introducing one more little incompatibility with all the other Markdown dialects out there (in your case, to tame a particular GUI implementation’s annoying behavior) would likely do not any harm. For an appropriate definition of “harm”, that is.

.

jgm · January 10, 2016, 6:06pm

@codinghorror - do you have access to a large corpus of Markdown documents you could search for lines matching /^\s*-\s*$/?

It would be good to get a sense for how common these are.

It’s hard to see a good alternative to requiring 2 or more dashes in a setext header, given other decisions which are deeply embedded in the spec.

codinghorror · January 10, 2016, 6:15pm

Well, kind of, I can get to the Creative Commons data dump of all Stack Exchange posts. Or all Discourse posts we host.

What would be a better source I wonder? All public github .md files, maybe?

jgm · January 10, 2016, 6:21pm

github .md files would be an excellent source, since they tend to include longer documents which are lacking in SO. Since these are public, I suppose we could grab them ourselves, or maybe ask Vicent Marti. I’d love to have this data.

douginamug · July 18, 2021, 10:38am

Did anyone manage to find a corpus of notes to analyse?

I would be interested to know how badly breaking it would be:

to require multiple underlines for setext headers
to require setext headers and/or underlines to start at the beginning of lines

I believe the empty, hyphenated list-item is a non-trivial issue, but we don’t see it, because people realize it looks weird, then avoid it. With the rise of interactive markdown pads for group meeting notes, where bullet points are left empty to fill later, solving this could have significant benefits.

I am motivated to do research if necessary!

vas · July 18, 2021, 11:00am

as jgm points out, github .md files. That’s a huge dataset! You could figure out how to take a random sample.

I would think this would be rare in GitHub README.md files, so a simple count of

- text
  -

occurances would tell you a lot.

jgm · July 18, 2021, 6:42pm

Long ago I did search a corpus and found quite a few instances of a single - underline for setext headers.

douginamug · July 25, 2021, 12:47pm

I think I’ve come up with a method for random sampling:

Generate a list of random numbers between 0 and 200,000,000 (the approximate range of github.com repo IDs)
Scrape README.md from repos until desired sample size reached (10,000?)
- curl -H "Accept: application/vnd.github.v3+json" https://api.github.com/repositories/<ID>
- many randomly generated IDs will correspong to private or deleted projects, so significantly more calls will need to be made than desired sample size.

After that, I should be able to figure out how to do the appropriate regex searching.

I would be interested to hear if an investigation as I’ve described would have the necessary weight to influence decisions (in any direction), and if not, what would need to be changed to make it so? @jgm @vas (I ask because if the result would in any case be inconsequential, it’s not worth me doing the work)