Spec section 5.1: Ref-impl bug and/or unclear spec?

tin-pot · January 10, 2016, 5:46pm

Section 5.1 gives the rules for blockquotes.

Consider (a) this input text:

1.  top level
    -   sub level
    -   sub level
2.  top level

and (b) this input text:

> 1.  top level
      -   sub level
      -   sub level
> 2.  top level

Reading the description of the “basic case”, together with the details for “laziness”, it seems to me that (b) is a block quote wrapping (a) as content. The reference implementation (announced as version 0.21.0 in BabelMark) produces—as expected—a nested list for the input text (a); but for input text (b) it insists on seeing the two sub level lines as a code block, producing:

<blockquote>
    <ol>
        <li>top level</li>
    </ol>
</blockquote>
<pre><code>-   sub level
-   sub level
</code></pre>

<blockquote>
    <ol start="2">
        <li>top level</li>
    </ol>
</blockquote>

The implementation does the same even if two spaces are removed (c) from each of the sub level lines’ indent (to account for the absence of the “>␣” block quote marker, so to say).

The (presumably, as there’s no version indication) current commonmark.js implementation, available as the dingus, sees neither a code block nor an unorderd list in inputs (b) and (c), which is still not what I would expect [this matches the result produced by cmark 0.22].

<blockquote data-sourcepos="3:1-6:15">
<ol data-sourcepos="3:3-6:15">
<li data-sourcepos="3:3-5:19">top level
-   sub level
-   sub level</li>
<li data-sourcepos="6:3-6:15">top level</li>
</ol>
</blockquote>

Is this the behavior intended by the specification?
If so, how can it be inferred from the wording in section 5.1 (and/or elsewhere)?
Do authors have to write explicit block quote markers in front of the two sub level lines here?
If so, what is the rationale for this?

jgm · January 10, 2016, 6:41pm

No. The laziness rule defines a possible transformation: it says that in some cases you can move from

> text1
> text2

to

> text1
text2

But whether you can do this depends on whether text2 is a “paragraph continuation line,” that is, on whether in the original case it would be parsed as a continuation of a preceding paragraph. You don’t have that in this case, because the relevant line would be parsed as a list item.

Current (dev) versions of cmark and commonmark.js both parse your text (b) as:

<blockquote>
<ol>
<li>top level
-   sub level
-   sub level</li>
<li>top level</li>
</ol>
</blockquote>

So does the dingus.

I have just updated the version of commonmark.js in BabelMark2.

tin-pot · January 10, 2016, 7:48pm

But whether you can do this depends on whether text2 is a “paragraph continuation line,” that is, on whether in the original case it would be parsed as a continuation of a preceding paragraph. You don’t have that in this case, because the relevant line would be parsed as a list item.

Errrm. Okay. A blockquote can’t have a lazy continuation line which is also a list item then, but only if this is the first lazy continuation line? Otherwise, a lazy continuation line will be recognized as a list item:

> 1.  top level
foo
  -   sub level
  -   sub level
> 2.  top level

Current (dev) versions of cmark and commonmark.js both parse your text (b) as: [ not a unordered sub-list ]

That’s what the dingus did, and cmark 0.22 too.

So, trying to answer my own questions based on the information you gave:

1. Is this the behavior intended by the specification?

Yes (not the code block output, but the no unordered sub-list here output is intended, yes).

2. If so, how can it be inferred from the wording in section 5.1 (and/or elsewhere)?

By “simply” checking whether in the original case it would be parsed as a continuation of a preceding paragraph.

The summary rule is: A list item line (without a block quote marker) will be recognized as such

everywhere
in any block that
started “vanilla”,
or as a list item,
or as a heading,
or as a blockquote,

but not if the preceding line has a blockquote marker.

3. Do authors have to write explicit block quote markers in front of the two sub level lines here?

Because you don’t have that in this case, my guess is: Yes.

4. If so, what is the rationale for this?

I still don’t know. My guess would be the “uniformity principle” at work again. Compared with other parsers’ results, and with the maybe not quite so unfounded expectation of seeing an unordered list output, the stress seems to be on “principle”, not “uniformity”.

jgm · January 11, 2016, 12:16am

No, in your example the third and fourth lines aren’t lazy continuation lines. Note that the unordered list is not inside the block quote.

Think about it this way. Laziness is a convenience for the writer (not the reader – for the reader it’s better to avoid laziness). So the way to think of it is from the writer’s point of view. Laziness is a transformation you can perform on a non-lazy document that creates a document with the same meaning. The transformation is this: if you’ve got paragraph continuation lines with block quote markers in front of them, you can remove the block quote markers.

On lines with indicators of new block structure (i.e., non-paragraph continuation lines), you can’t remove the block quote markers. That would just cause too much confusion. Note that current implementations disagree about this.

tin-pot · January 11, 2016, 12:57am

Yes, yes: that’s in the spec, [example 186][ex-186] is an example for such an example illustrating this …

Laziness is a transformation you can perform on a non-lazy document that creates a document with the same meaning.

That’s precisely how I think of it.

The transformation is this: if you’ve got paragraph continuation lines with block quote markers in front of them, you can remove the block quote markers.

That’s precisely what I want to do—I’m that lazy.

On lines with indicators of new block structure (i.e., non-paragraph continuation lines), you can’t remove the block quote markers. That would just cause too much confusion.

To be honest, I find the pre-condition for removing the quote block marker, not exactly “non-confusing”:

If a string of lines Ls constitute a block quote with contents Bs, then the result of deleting the initial block quote marker from one or more lines in which the next non-whitespace character after the block quote marker is paragraph continuation text is a block quote with Bs as its content. Paragraph continuation text is text that will be parsed as part of the content of a paragraph, but does not occur at the beginning of the paragraph.

Could it be that much of the cause for confusion (which the lazy continuation line rule is supposed to avoid?) stems—again—from allowing nested constructs to begin right in the middle of an ordinary block?

I’ll have to think about this; for now I’d like to recap:

I find the lazy continuation rules for blockquote pretty unintelligible: both hard to understand, and even harder to relate to a rationale;
There’s quite some variation in the handling of this issue in other implementations.

Thanks for your explanations, though!
[ex-186]: CommonMark Spec (CommonMark spec, ex. 186)

tin-pot · January 11, 2016, 1:46am

The disagreement you point to seems to be about ATX headings right in the middle of a block, and not so much about blockquote.

Try this—simple—one:

> 1. one
2. two

Is see much less disagreement here:

Parsers that include the “two” line in the list:
showdown 0.3.1; marked 0.2.6; Markdown.pl 1.0.1; Markdown.pl 1.0.2b8; lunamark 0.2; pandoc (strict) 1.16; RedCarpet 2.1.1; RDiscount 1.6.8; PHP Markdown Extra 1.2.8; Python-Markdown 2.6.5; PHP Markdown 1.0.2; Minima 0.8.0a3_20140907; kramdown 1.2.0; Parsedown 1.6.0; s9e\TextFormatter (Fatdown/PHP) ; cebe/markdown GFM 1.1.0; cebe/markdown 1.1.0; cebe/markdown MarkdownExtra 1.1.0; pandoc 1.16; Blackfriday.

I would say that *every* relevant implementation represented in _BabelMark_—except one!— is included in this list.
Parsers that exclude the line, and close the blockquote before it:
commonmark 0.21.0; cheapskate 0.1.0.1; league/commonmark 0.12.0; markdown-it 4.1.0; Gambas 3.8.90 [ edit: forgot Maruku 0.7.2; Maruku (Math-Enabled) 0.7.3.beta1 ].

Note that *commonmark* is (twice!) in this list, among crap like Gambas (excuse me, but look at its output!).

That’s no substitute for a decision and a clear and useful rule, of course. But I don’t think there can be any doubt what is “common”, and what a “highly compatible” specification should specify in this case.

Therefore in my view the question is rather: what rule or set of rules would imply this result, while being simple, general, etc.