Am I Missing Something? Empty Lists and HTML Rendering

jackdw · December 4, 2020, 6:46pm

Going through a number of cases with my parser, making sure I have the edge cases are all captured.

I ran into a series of these, where I don’t understand something. Including babelmark links for repeatability. All 4 cases below, case 4 is the one I cannot figure out.

Based on what I have read in the specification, as proved out by the first 3 examples, the second line is indented properly to get considered for the start of a sublist. But in the case of an empty sublist, it seems that it is being ignored in favor of being considered as straight text.

Did I miss something in the specification?

Case 1 (good): Empty sublist within empty list

<ol>
<li>
<ol>
<li></li>
</ol>
</li>
</ol>

Case 2 (good): Normal sublist within normal list

<ol>
<li>abc
<ol>
<li>abc</li>
</ol>
</li>
</ol>

Case 3 (good): Normal sublist within empty list

<ol>
<li>
<ol>
<li>abc</li>
</ol>
</li>
</ol>

Case 4 (huh?): Empty sublist within normal list

<ol>
<li>abc
1.</li>
</ol>

mity · December 4, 2020, 8:23pm

See chapter 5.2 of the specification, and read multiple times (at least I needed to re-read it many times when implementing MD4C) the exceptions under the basic case.

One of the rules is that a list item mark candidate followed by end-of-line or end-of-file cannot form a list item if it would interrupt a paragraph.

That’s what makes the difference in your examples: Empty parent list item works because there is no paragraph and hence no paragraph interruption.

Similarly, if you put an empty line between the parent list item and the child item, then it should work too, because there is no paragraph interruption anymore, the paragraph is ended by the blank line:

1. paragraph

   1.

Or, to make it clearer, get rid of the parent list, and just focus on the different behavior of these two examples:

paragraph
1.

(not a list item, because list item mark is followed by end-of-line and hence cannot interrupt the paragraph)

versus

paragraph

1.

(a valid list item)

jgm · December 4, 2020, 10:50pm

See under 5.2:

Exceptions:

When the first list item in a list interrupts a paragraph—that is, when it starts on a line that would otherwise count as paragraph continuation text—then (a) the lines Ls must not begin with a blank line, and (b) if the list item is ordered, the start number must be 1.

In this case, you have (b) but not (a).

jackdw · December 6, 2020, 5:46pm

Thanks for the help with that. I had that working well for the base case (starting a list), but starting a sub-list needed a bit of help.

Based on that, had a couple of questions, and another request to make sure my reading of the specification is correct or not.

Item 1: Does the specification need more examples?

When testing, I easily hit the case for a list where this happens. But it wasn’t until I created a sublist cases that I was able to figure out that sublists have a problem. While I can see that part of that is because of how I implemented my parser, would it be useful to add cases of sublists with (a) and (b), then (a) not (b), then (b) not (a) for comparison? I found that comparison very helpful.

Item 2: What is “start number”?

I tried a couple of things with this, and just have one question: what is “start number”?

First:

If the list item is ordered, then it is also assigned a start number, based on the ordered list marker.

Then:

if the list item is ordered, the start number must be 1.

From the reference implementation, it appears that “1” is a string. If I use a “start number” of “01”, it is not recognized the same as “1”.

Is this intentional? If so, would a slight change to the specification be appropriate. Something like “the start number must be the character 1.”?

Item 3: Example 282

I got everything else working, but then I had problems with this example:

1. foo
2. bar
3) baz

This is my reading… please correct me if I am wrong.

In the Markdown, the third line should start a new list. This happens because the list marker has been changed from . to (. Following through with the rest of the rules, the rules discussed in this thread come into play, as bar opens a paragraph, and by the time we get to the third line, it is still open.
Hence:

When the first list item in a list interrupts a paragraph—that is, when it starts on a line that would otherwise count as paragraph continuation text

Unless I am misunderstanding some information from above, the above information comes into play.
Part a) says that the lines must not begin with a blank. Check. It starts with baz. Part b) says that the start number must be 1. There is my problem. When I get to that point, in my head and in code, 3 != 1, and that fails, hence it is continuation of the bar line, not a list start.

Did I read something wrong in this? (again?).

mity · December 6, 2020, 6:15pm

I think you are mixing precedence of two concepts: Ending the list item as a whole and a paragraph continuation.

When the parser sees the 3rd line, first of all it has to decide whether the line is part of the list item started on the preceding line. It is not because of the rule “if can be interpreted as a list, no lazy continuation and interpret as a list”.
Then the parser needs to test, whether it is a new item of the same list: No it is not (the mark endings do not match).
So, the parser has to conclude the 3rd line starts a completely new block. And starts analyzing that, to eventually see it is an ordered list.

The rule about list starting with 1 does not play any role here because the paragraph is already implicitly ended when the list item it lives in ends.

The specification sometimes makes it quite complicated to get all the precedence among the many rules it provides. So if you’re implementing a new parser implementation, you may perhaps find this helpful: https://github.com/commonmark/commonmark-spec/issues/438#issuecomment-476867464 (note the Lazy continuation line has very low precedence).

jgm · December 6, 2020, 7:00pm

Jack via CommonMark Discussion noreply@talk.commonmark.org
writes:

When testing, I easily hit the case for a list where this happens. But it wasn’t until I created a sublist cases that I was able to figure out that sublists have a problem. While I can see that part of that is because of how I implemented my parser, would it be useful to add cases of sublists with (a) and (b), then (a) not (b), then (b) not (a) for comparison? I found that comparison very helpful.

If you like, you can propose new examples (maybe on the GitHub tracker). Note that the cases are supposed to serve as illustrative examples, not exhaustive tests. (If we included exhaustive tests it would make the spec unreadable.)

I tried a couple of things with this, and just have one question: what is “start number”?

It’s the number to start with in an enumerated list (start attribute in HTML ol).

if the list item is ordered, the start number must be 1.

From the reference implementation, it appears that “1” is a string. If I use a “start number” of “01”, it is not recognized the same as “1”.

Is this intentional? If so, would a slight change to the specification be appropriate. Something like “the start number must be the character 1.”?

It’s an int in the C reference implementation. Anyway, it’s supposed to be a number. If the JS impl does otherwise, you could submit an issue there.

I got everything else working, but then I had problems with this example:
1. foo
2. bar
3) baz
This is my reading… please correct me if I am wrong.

In the Markdown, the third line should start a new list. This happens because the list marker has been changed from . to (. Following through with the rest of the rules, the rules discussed in this thread come into play, as bar opens a paragraph, and by the time we get to the third line, it is still open.

Correct.

Hence:

When the first list item in a list interrupts a paragraph—that is, when it starts on a line that would otherwise count as paragraph continuation text

Unless I am misunderstanding some information from above, the above information comes into play.

I agree, this formulation really isn’t adequate; it’s not clear why this doesn’t apply here.

The intent was to have language that applies to the starting of list items in the middle of a paragraph that isn’t a direct child of a list item at the same level. There should be a better way to put this; maybe you could put something on the commonmark-spec GitHub tracker so we don’t lose track of the issue.

jgm · December 6, 2020, 7:03pm

It’s possible to explain this case in terms of a procedure for parsing. But the idea behind the spec was to state declaratively what combinations of lines make up what sorts of blocks. That turned out to be quite difficult!

mity · December 6, 2020, 7:25pm

@jgm I have no doubt writing and maintaining the specification of such complexity is a real challenge. Especially as (I assume) vast majority of its readers are Markdown document writers with limited technical background and not people implementing new parsers.

My comment was posted in the hope it would help @jackdw, and in no way it was meant to defame the specification or you, and I’m truly sorry, if I offended you.

I believe its very existence, all the details it covers, the rich set of examples with great coverage it provides, as well as all your help whenever I needed it, has played an important role in MD4C’s development and success and I can never thank you enough for it.

jackdw · December 7, 2020, 2:54am

Btw, if I didn’t mention it already, I view it in a similar light. For me, a specification is an evolving thing that has a specific focus. As light shines on a given aspect of that specification, it either stands the test of time, or the specification grows and becomes more clear based on that light. Having written specifications before, I have admiration for anyone who takes a similar action, even more for people like @jgm who not only took that action, but engages in conversation on it.

And finding that balance between clarity and “oh no, not another example” is tough.

With that in mind, I have some debug from the process I am using, and I want to similarly shine a light on it. As @jgm did to help me with another issue, I share it for this issue:

DEBUG    pymarkdown.list_block_processor:list_block_processor.py:222 is_olist_start>>result>>True
DEBUG    pymarkdown.list_block_processor:list_block_processor.py:225 is_in_paragraph>>True
DEBUG    pymarkdown.list_block_processor:list_block_processor.py:227 at_end_of_line>>False
DEBUG    pymarkdown.list_block_processor:list_block_processor.py:250 xx>>)!=.
DEBUG    pymarkdown.list_block_processor:list_block_processor.py:262 is_first_item_in_list>>True
DEBUG    pymarkdown.list_block_processor:list_block_processor.py:264 olist_index_number>>3
DEBUG    pymarkdown.list_block_processor:list_block_processor.py:266 is_not_one>>True
DEBUG    pymarkdown.list_block_processor:list_block_processor.py:270 is_in_para>>True>>EOL>False>is_first>True
DEBUG    pymarkdown.list_block_processor:list_block_processor.py:278 is_start>>False
DEBUG    pymarkdown.list_block_processor:list_block_processor.py:280 is_olist_start>>result>>False

For 282, this section picks up in the function is_olist_start. Once I have worked through everything, that function has determined that I have an ordered list start due to is_olist_start>>result>>True. At that point, the code has determined that the text on that 3rd line is eligible to start/continue an ordered list. From there, debug relates that the token stack current denotes a paragraph is in progress (is_in_paragraph>>True) and that there is no more text on the line after the list start/continue (at_end_of_line) i.e. and empty list start. After that, because the list character has changed (xx>>)!=.), that newly eligible start/continuation is determined to be the start of a new list (is_first_item_in_list>>True). finally, a check is made against the index number (olist_index_number>>3) as to whether or not the index number is not “1”. (haven’t factored it the previous dicussion about “1” versus int(“1”) yet).

Given all of that… the last piece of code determines whether or not this was a “false start”.

           if (
                is_in_paragraph
                and (at_end_of_line or is_not_one)
                and is_first_item_in_list
            ):
                is_start = False
                LOGGER.debug("is_start>>%s", str(is_start))

In this case:

from debug is_in_paragraph is True
from debug at_end_of_line is False, but is_not_one is True
from debug is_first_item_in_list is True

Due to that logic, unless I am missing something in translation, which once again is possible, it should not be a list start. @jgm, regarding your comment

The intent was to have language that applies to the starting of list items in the middle of a paragraph that isn’t a direct child of a list item at the same level.

I agree with this. Keep in mind that replacing the 3) on line 3 with 1) also resolves that issue. Which is more in keeping with what you are thinking?

For what it is worth to both of you, I am grateful that both of you are conversing with me on this. Please know that.

jackdw · December 7, 2020, 2:54am

@mitty, I tried to understand your logic, but fell down on step 1. When line 3 is started, the logic can tell that it is in a paragraph, but without the extra logic, I don’t believe it can tell whether that 3rd line is part of the previous list item. From my viewpoint, the best it can do is to denote that it is “eligible” to be a part of that line, not that it is or is not. From there, once it has been determined to be possible, the logic then figures out if it is a valid list start/continuation. From there, it determines whether it satisfies all the conditions or not. Did I miss something in your explanation?

mity · December 7, 2020, 7:45am

Not sure what I can add to make it clearer, so just a few assorted notes:

A paragraph has the lowest precedence of all block types: It’s kind of a fallback block type: Only what cannot be interpreted in another way, is a paragraph. Actually the paragraphs in the spec are defined that way.
Once you realize the 3rd line is not another line of the preceding paragraph (it could be only as a lazy continuation line because it is not indented, but the spec explicitly declares it cannot be if the line could be seen as a beginning of another list), then it is known it has to be the 1st line of a completely new block, actually due the zero indentation a top-level block. Yes, with just that knowledge, it could still be another paragraph but also anything completely else. When you come to this point you have to (re)try all the possible interpretations as if a blank line precedes.

jgm · December 7, 2020, 6:16pm

My comment was posted in the hope it would help @jackdw, and in no way it was meant to defame the specification or you, and I’m truly sorry, if I offended you.

No worries, no offense was taken, and I’m sorry if my response suggested otherwise!