Blank lines before lists, revisited

I don’t like the hack giving special treatment to 1. either, and I’d like to remove it. Here is the difficulty. Most Markdown implementations that require a blank line before a list starts only do this at the outer level. Thus, in

paragraph A
1.  list item

1.  paragraph B
    1. list item

the list item is allowed to break paragraph A (which is not itself in a list item), but not paragraph B (which is contained in a list item). BabelMark2 example

The way the current CommonMark spec is constructed, it’s not possible to build in this kind of context-sensitivity. The rule for list items has the form: if you’ve got some lines that constitute a sequence of blocks, then the result of indenting them and adding a list marker is a list item containing these same blocks.

That implies that the two cases above must be treated the same way. The contents of the list item should be exactly what you get if you take off the list marker and deindent the lines. So, either the blank line must be required both places, or neither.

I’d actually be happy to require it both places, but this would be more revisionary and break more existing documents.

3 Likes

It looks as if the hack of treating ordered list markers differently in some contexts is harder than I thought to integrate with the parsing strategy in our reference implementations: see https://github.com/jgm/cmark/issues/204

I’m not sure, actually, how to handle this, but it may be that this decision needs to be rethought. (It always seemed ugly.)

I feel like this matters because it is arguably one of the things normal people mess up constantly in Markdown, and I mean literally daily:

blah blah blah
1. list item one
2. list item two

Then I have to go in and constantly fix their markup:

blah blah blah

1. list item one
2. list item two

However, if adding this rule is irreconcilable and impossible given the weird unspecified state of classic markdown, adding more irreconcilable weirdness is probably not a great idea.

So if we’ve definitively come down on the side of this is an unsolvable problem then we should just revert to the old behavior which demands and requires that there is a blank line above a list… so I’m OK with that, if you feel we’ve exhausted all avenues here @jgm.

One solution that can work is for block quote to be logically extended until a blank line. This way any lines not separated by a blank line will be treated as continuation lines or new block elements but still in the block quote. Block quote termination requires a blank line.

I added this option in my implementation for compatibility with other markdown processors and found it to be more intuitive when it comes to handling block quote continuation lines and new element lines.

This would have:

> This is _a_ paragraph continuation text
> 2. because the line starts with `2`, not `1`.
> This is _a_ paragraph continuation text
2. because the line starts with `2`, not `1`.

both result in:

This is a paragraph continuation text 2. because the line starts with 2, not 1.

and

> This is _a_ paragraph continuation text
> 1. because the line starts with `1`, not `2`.
> This is _a_ paragraph continuation text
1. because the line starts with `1`, not `2`.

both result in:

This is a paragraph continuation text

  1. because the line starts with 1, not 2.

As for ugliness of 1 vs non-1 list starts, I don’t think you can find a solution that everyone will like.

The solutions suggested so far seem to deal with wrapping issues in paragraphs, but not in lists.

Consider this example:

This sentence has been wrapped by
1. This is not a new list item.
1. An actual new list item.
2. Another new list item.

As already noted, this ambiguity can be solved by requiring a blank line before the list.

Now consider the case where the opening paragraph is itself a list item at the same level as the other list items:

1. This sentence has been wrapped by
1. This is not a separate list item.
2. An actual new list item.
3. Another new list item.

Solving this ambiguity requires a blank line before every new list item, not just the first item in the list, or even the first item at each new sublist level, like this:

1. This sentence has been wrapped by
1. This is not a separate list item.

2. An actual new list item.

3. Another new list item.

Perhaps the choice should be between:

  1. Always requiring a blank line before every new item in a list, including the first one.
    • This obeys the Principle of Uniformity and avoids ambiguity with line wrapping, setex, etc.
    • However, it is clearly a major change to current practise, and interferes with the definition of loose lists.
  2. Never requiring a blank line before any new list item, not even the first one.
    • This obeys the Principle of Uniformity and is consistent with current practise.
    • However, it requires escape characters or special heuristics to resolve issues with wrapping and setex.

My preference would be to never require blank lines before list items, and to use heuristics to determine whether something is a new list item or a continuation of the previous paragraph/list item. I know you guys are (understandably) reluctant to use heuristics, but they do have these benefits:

  1. A human reading the text is essentially using heuristics to determine what is and is not a new list item. Commonmark would essentially just be codifying this process.
  2. The heuristics don’t have to be perfect and can be improved over time as more edge cases are revealed.
  3. If there are exceptional cases where the heuristics get it wrong people can always use blank lines and escaping to get the result they intended.
  4. Heuristics can obey the Principal of Uniformity. The same heuristics can be applied at any sublist level to determine whether a new line is:
    • A) a continuation of the previous line.
    • B) a new list item at the same level of the previous item.
    • C) a child of the previous list item (i.e. a new sublist item)
1 Like

The current hack for 1. is causing problems and confusion. It should be replaced by a better solution. Several possibilities:

  1. Allow any single-digit list marker. (Still a hack, but covers more cases, including most of #2704.)
  2. Make the colon : at the end of a line (optionally followed by a single space) an indicator for special treatment of the next line, similar to double space and backslash \ for hard line breaks. This could be reused (by future extensions) for other things, e.g. blockquote attributions and table or figure captions, but it is a more severe deviation from Gruber Markdown.
  3. Require at least two (tight) list items (of the same type).
  4. Do not support loose lists to interrupt paragraphs.
  5. Allow a single-line paragraph (preceded by a blank line) with less than 80 characters to precede a list without the otherwise mandatory intervening blank line.

This is similar but not identical to heuristics I have suggested previously. I’m proposing to choose one of these or a mandatory combination of them, not multiple alternative options.

I really like the rules (3) and (4), however I would limit the rule (3) only to the numbered list items. The same rules should apply in nested lists (uniformity) and IMHO it’s quite common to have single-item bullet lists, especially when nested in a tight list:

* foo
  * subitem of foo, not continuation line
* bar

I find (1) and (2) too hackish and incompatible with current practice. Also not sure how e.g. : is used in other languages like e.g. Chinese, Japanese, Arabic.

The rule (5) looks very strange to me: A user who adds a word somewhere in the middle or beginning of the previous sentence can cause inadequately large rendering change elsewhere, after the sentence. IMHO very unintuitive behavior.

But, if we find (3) and (4) are not satisfactory, it might make sense to me if the 1st list item (not preceding line) is required to be quite short in order to interrupt the previous paragraph: Consider that tight lists tend to have short item contents. And if they contain longer text, author may want to place a blank line to visually split it from the preceded text or use loose list right away and do it intuitively.

2 Likes

I like 2+4 slightly better than 3+4, for what it’s worth.

It seems sensible to require at least two numbered list items.

How many situations are there in which someone actually wants to 
interrupt a paragraph with a single-item numbered list? Probably
1. Or maybe a couple more. Other than that, a numbered list would
be created by happenstance, without intending or expecting it.

See what I did there? :wink: Babelmark

1 Like

jkdev noreply@talk.commonmark.org writes:

It seems sensible to require at least two numbered list items.

How many situations are there in which someone actually wants to 
interrupt a paragraph with a single-item numbered list? Probably
1. Or maybe a couple more. Other than that, a numbered list would
be created by happenstance, without intending or expecting it.

It’s actually pretty common in some contexts. For example, in a
paper one might discuss a number of numbered examples, with regular
paragraph text in between. I’ve frequently used one-item numbered
lists (with different start numbers) for this kind of thing.

To weigh in on the proposals:

(1) (single-digit start numbers) seems okay to me. It might fix some
practical problems (and might cause some others). On the other hand,
even with this fix we’d have a mismatch between parser behavior and the
spec of the sort described in commonmark/cmark#204.

(3) is bad for the reason I just gave in my previous post.
One-item lists are, in general, useful, so we don’t want to
rule them out altogether. I suppose we could require at least
two items when the list interrupts a paragraph, but that creates
difficulties for parsing: you can’t know you’ve got a list until
you’ve parsed the whole list item and seen what comes after it.

(4) has a similar issue: we can’t tell if the list is going
to be loose unless we’ve parsed the whole thing. And I’m not
sure what is gained by allowing only tight lists to interrupt
a paragraph.

(5) is going to be problematic for people who wrap their text
to a reasonable width (say, 72 characters), and also for people
who don’t hard-wrap at all. And I echo @mity’s worry about surprising
behavior.

(2) seems the most promising to me, but there is the worry
about languages with different punctuation conventions.

I guess it might be a good idea at this point if someone
summarized clearly and concisely why we need to change things.
My own intervention above was motivated by
https://github.com/commonmark/cmark/issues/204, but I believe
that issue could be handled at the implementation level without
a change in the spec. At any rate, I’ve written a Haskell
implementation, roughly following the same strategy as cmark,
which gives the right results in this case.

1 Like

An enumerated example is not a list item. I don’t think this counts as a strong argument against (3). I did assume a minimum of two list items would only be expected in lists that are not preceded by a blank line. If this makes it to complicated, I’m fine with doing this idea.

The reasoning behind (4) is that a tight list could well be a child of a paragraph (in output formats which support this nesting), whereas a loose list, which can contain paragraphs (i.e. blank lines) itself, seems strange inside a paraphrasing and thus could only end it.

<p><list.tight/></p>

<p/><list.loose/><p/>

Anyhow, I prefer (2), too. The colon at the end of the line preceding a list works in two ways:

  1. Existing content in many languages will have a colon introduce a list without an intervening blank line. It works as a heuristic rule.
  2. New content in any language can be authored with the colon as a new active markup character.

The problem is that for much of variant 1 the colon should be retained in the output, whereas it should be dropped for many cases of variant 2.
This can be done with an additional rule, but I’m not sure whether that would still be acceptable.

Christoph Päper noreply@talk.commonmark.org writes:

An enumerated example is not a list item.

Well, a list item is the closest thing in commonmark to represent it
with. If you make it a regular paragraph, the indentation will be
wrong and it won’t stand out.

The reasoning behind (4) is that a tight list could well be a child of
a paragraph (in output formats which support this nesting), whereas a
loose list, which can contain paragraphs (i.e. blank lines) itself,
seems strange inside a paraphrasing and thus could only end it.

I see. But the way the spec is designed, a paragraph can never resume
after a tight list either. So, without much larger changes, (4) doesn’t
seem motivated.

Anyhow, I prefer (2), too. The colon at the end of the line preceding a list works in two ways:

  1. Existing content in many languages will have a colon introduce a list without an intervening blank line. It works as a heuristic rule.
  2. New content in any language can be authored with the colon as a new active markup character.

The problem is that for much of variant 1 the colon should be retained in the output, whereas it should be dropped for many cases of variant 2.
This can be done with an additional rule, but I’m not sure whether that would still be acceptable.

For something a bit like this, see
http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#literal-blocks

‘As a convenience, the “::” is recognized at the end of any
paragraph. If immediately preceded by whitespace, both colons will be
removed from the output (this is the “partially minimized” form). When
text immediately precedes the “::”, one colon will be removed from the
output, leaving only one colon visible (i.e., “::” will be replaced by
“:”; this is the “fully minimized” form).’

So, we could say that if the final colon is preceded by whitespace,
it gets removed.

1 Like

<space><colon> would not work well with French practice, I guess. I thought about leaving a single colon as is :, but removing both if there are two ::.

I honestly believed that list items needed to have some whitespace and not start at the left margin.

What if list items needed either a blank line separating them from the paragraph or some initial whitespace before the first bullet/number? Then you wouldn’t get accidental list items from linewrapped paragraphs.

2 Likes

I like that kind of flexibility. You could even mix-and-match blank lines and initial whitespace according to your preferences.

For example: a blank line for a new list, and initial whitespace / additional indentation for sublists.

Paragraph text here.

1. List after blank line.
    1. Sublist after indentation.
2. Another list item.
    1. Another new sublist after indentation.

This is very readable and intuitive, and backwards-compatible as well.

1 Like

I’ve thought about this a lot. @Crissov’s #3 makes the most sense. Every example of why a list should be able to interrupt a paragraph on this thread and in others on this forum have at least two items in the example. e.g.:

If you think about it, I’m pretty sure that’s how humans parse it. Compare

In Markdown 0.8 and earlier and version
1. This line turns into a list item.

and

In Markdown 0.8 and earlier and version
1. This line turns into a list item. Also in version
2. This line turns into another list item.

I think humans parse the latter as a list, irrespective of the content, at least until they read it. In such cases the plain text author is likely to see it that way too, and will do something to fix it, e.g. move the numbers to the end of the preceding line.

It’s the pattern that makes us see it as a list. Given that this is a plain text format designed for how humans read, it makes for rules that jive with that.

As to parsing, @jgm, looking at the source code for commonmark.js, it doesn’t seem that hard to peek ahead to the same text column on the next line. Am I wrong?

2 Likes

As to parsing, @jgm, looking at the source code for commonmark.js , it doesn’t seem that hard to peek ahead to the same text column on the next line. Am I wrong?

Imho, you are wrong because you would may need to peak to much further then that, and also in a non-trivial way. Consider:

  • Loose list (there may be a blank line): You may need to peek after a blank line.
  • Multi-line list item: You may need to peek to the 1st line after the potential list item ends, but it may be long.
  • Nested list in the 1st item: You may need to peek after the nested list ends.
  • All of it combined together.

So peeking is more like full block parsing looking speculatively ahead (without any hard limit) until we now there is a 2nd item, and then possibly reverting back if there is not one.

And when I say block parsing looking speculatively ahead (without any hard limit), I become afraid whether there might be a malitious input using such feature, leading to O(n^2) parsing times.

EDIT: And also note the nested list in the 1st list item may need the same treatment, so you may need to perform a speculation in speculation:

Lorem ipsum.
1. Is this 1st item of a list?
   1. Is this 1st item of a sub-list? Note we still do not even know yet whether there is a parent list...

@mity CommonMark.js, at least, isn’t streaming, takes two passes, has random access to every line. Still, you may be right.

The intent of my post was more philosophical. I’m working on an idea for how to extend Markdown in a way that stays true to its plain text human reader philosophy. I may even be able to apply that work to help make progress on the open issues keeping us from a 1.0 release.

There are a lot of posts on this forum that seem to not know that philosophy, or seem to not care about it. A lot of posters see Markdown as source code for HTML. I think it would be good if everyone got on the same plain text page.

[PS. I think the current rule, that a list can interrupt a paragraph if it starts with 1, is good. Again, I didn’t mean to reopen an issue that I think is settled. As long as we don’t go back to requiring a blank line always.]

1 Like

That wouldn’t help. O(n^2) where n is number of lines (and not bytes) is still bad.