Blank lines before lists, revisited

jkdev noreply@talk.commonmark.org writes:

It seems sensible to require at least two numbered list items.

How many situations are there in which someone actually wants to 
interrupt a paragraph with a single-item numbered list? Probably
1. Or maybe a couple more. Other than that, a numbered list would
be created by happenstance, without intending or expecting it.

It’s actually pretty common in some contexts. For example, in a
paper one might discuss a number of numbered examples, with regular
paragraph text in between. I’ve frequently used one-item numbered
lists (with different start numbers) for this kind of thing.

To weigh in on the proposals:

(1) (single-digit start numbers) seems okay to me. It might fix some
practical problems (and might cause some others). On the other hand,
even with this fix we’d have a mismatch between parser behavior and the
spec of the sort described in commonmark/cmark#204.

(3) is bad for the reason I just gave in my previous post.
One-item lists are, in general, useful, so we don’t want to
rule them out altogether. I suppose we could require at least
two items when the list interrupts a paragraph, but that creates
difficulties for parsing: you can’t know you’ve got a list until
you’ve parsed the whole list item and seen what comes after it.

(4) has a similar issue: we can’t tell if the list is going
to be loose unless we’ve parsed the whole thing. And I’m not
sure what is gained by allowing only tight lists to interrupt
a paragraph.

(5) is going to be problematic for people who wrap their text
to a reasonable width (say, 72 characters), and also for people
who don’t hard-wrap at all. And I echo @mity’s worry about surprising
behavior.

(2) seems the most promising to me, but there is the worry
about languages with different punctuation conventions.

I guess it might be a good idea at this point if someone
summarized clearly and concisely why we need to change things.
My own intervention above was motivated by
https://github.com/commonmark/cmark/issues/204, but I believe
that issue could be handled at the implementation level without
a change in the spec. At any rate, I’ve written a Haskell
implementation, roughly following the same strategy as cmark,
which gives the right results in this case.

1 Like

An enumerated example is not a list item. I don’t think this counts as a strong argument against (3). I did assume a minimum of two list items would only be expected in lists that are not preceded by a blank line. If this makes it to complicated, I’m fine with doing this idea.

The reasoning behind (4) is that a tight list could well be a child of a paragraph (in output formats which support this nesting), whereas a loose list, which can contain paragraphs (i.e. blank lines) itself, seems strange inside a paraphrasing and thus could only end it.

<p><list.tight/></p>

<p/><list.loose/><p/>

Anyhow, I prefer (2), too. The colon at the end of the line preceding a list works in two ways:

  1. Existing content in many languages will have a colon introduce a list without an intervening blank line. It works as a heuristic rule.
  2. New content in any language can be authored with the colon as a new active markup character.

The problem is that for much of variant 1 the colon should be retained in the output, whereas it should be dropped for many cases of variant 2.
This can be done with an additional rule, but I’m not sure whether that would still be acceptable.

Christoph Päper noreply@talk.commonmark.org writes:

An enumerated example is not a list item.

Well, a list item is the closest thing in commonmark to represent it
with. If you make it a regular paragraph, the indentation will be
wrong and it won’t stand out.

The reasoning behind (4) is that a tight list could well be a child of
a paragraph (in output formats which support this nesting), whereas a
loose list, which can contain paragraphs (i.e. blank lines) itself,
seems strange inside a paraphrasing and thus could only end it.

I see. But the way the spec is designed, a paragraph can never resume
after a tight list either. So, without much larger changes, (4) doesn’t
seem motivated.

Anyhow, I prefer (2), too. The colon at the end of the line preceding a list works in two ways:

  1. Existing content in many languages will have a colon introduce a list without an intervening blank line. It works as a heuristic rule.
  2. New content in any language can be authored with the colon as a new active markup character.

The problem is that for much of variant 1 the colon should be retained in the output, whereas it should be dropped for many cases of variant 2.
This can be done with an additional rule, but I’m not sure whether that would still be acceptable.

For something a bit like this, see
http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#literal-blocks

‘As a convenience, the “::” is recognized at the end of any
paragraph. If immediately preceded by whitespace, both colons will be
removed from the output (this is the “partially minimized” form). When
text immediately precedes the “::”, one colon will be removed from the
output, leaving only one colon visible (i.e., “::” will be replaced by
“:”; this is the “fully minimized” form).’

So, we could say that if the final colon is preceded by whitespace,
it gets removed.

1 Like

<space><colon> would not work well with French practice, I guess. I thought about leaving a single colon as is :, but removing both if there are two ::.

I honestly believed that list items needed to have some whitespace and not start at the left margin.

What if list items needed either a blank line separating them from the paragraph or some initial whitespace before the first bullet/number? Then you wouldn’t get accidental list items from linewrapped paragraphs.

2 Likes

I like that kind of flexibility. You could even mix-and-match blank lines and initial whitespace according to your preferences.

For example: a blank line for a new list, and initial whitespace / additional indentation for sublists.

Paragraph text here.

1. List after blank line.
    1. Sublist after indentation.
2. Another list item.
    1. Another new sublist after indentation.

This is very readable and intuitive, and backwards-compatible as well.

1 Like

I’ve thought about this a lot. @Crissov’s #3 makes the most sense. Every example of why a list should be able to interrupt a paragraph on this thread and in others on this forum have at least two items in the example. e.g.:

If you think about it, I’m pretty sure that’s how humans parse it. Compare

In Markdown 0.8 and earlier and version
1. This line turns into a list item.

and

In Markdown 0.8 and earlier and version
1. This line turns into a list item. Also in version
2. This line turns into another list item.

I think humans parse the latter as a list, irrespective of the content, at least until they read it. In such cases the plain text author is likely to see it that way too, and will do something to fix it, e.g. move the numbers to the end of the preceding line.

It’s the pattern that makes us see it as a list. Given that this is a plain text format designed for how humans read, it makes for rules that jive with that.

As to parsing, @jgm, looking at the source code for commonmark.js, it doesn’t seem that hard to peek ahead to the same text column on the next line. Am I wrong?

2 Likes

As to parsing, @jgm, looking at the source code for commonmark.js , it doesn’t seem that hard to peek ahead to the same text column on the next line. Am I wrong?

Imho, you are wrong because you would may need to peak to much further then that, and also in a non-trivial way. Consider:

  • Loose list (there may be a blank line): You may need to peek after a blank line.
  • Multi-line list item: You may need to peek to the 1st line after the potential list item ends, but it may be long.
  • Nested list in the 1st item: You may need to peek after the nested list ends.
  • All of it combined together.

So peeking is more like full block parsing looking speculatively ahead (without any hard limit) until we now there is a 2nd item, and then possibly reverting back if there is not one.

And when I say block parsing looking speculatively ahead (without any hard limit), I become afraid whether there might be a malitious input using such feature, leading to O(n^2) parsing times.

EDIT: And also note the nested list in the 1st list item may need the same treatment, so you may need to perform a speculation in speculation:

Lorem ipsum.
1. Is this 1st item of a list?
   1. Is this 1st item of a sub-list? Note we still do not even know yet whether there is a parent list...

@mity CommonMark.js, at least, isn’t streaming, takes two passes, has random access to every line. Still, you may be right.

The intent of my post was more philosophical. I’m working on an idea for how to extend Markdown in a way that stays true to its plain text human reader philosophy. I may even be able to apply that work to help make progress on the open issues keeping us from a 1.0 release.

There are a lot of posts on this forum that seem to not know that philosophy, or seem to not care about it. A lot of posters see Markdown as source code for HTML. I think it would be good if everyone got on the same plain text page.

[PS. I think the current rule, that a list can interrupt a paragraph if it starts with 1, is good. Again, I didn’t mean to reopen an issue that I think is settled. As long as we don’t go back to requiring a blank line always.]

1 Like

That wouldn’t help. O(n^2) where n is number of lines (and not bytes) is still bad.

What about requiring an inline space before any number interrupting a paragraph? The spec allows space before starting a numbered list, right (to allow lining  1. up with 10.)?

1 Like

It seems that requiring a blank line before a single item list to interrupt a paragraph makes sense. Without it, wrapping could create an inadvertent list, and an intentional single item list, to me, needs to be set off with blank lines in the middle of a paragraph, else it is hard to see in the Markdown source. A multiple item list without blank lines is easier to see in a paragraph, so a multi-item list could not require a blank line, and I think we avoid most of the inadvertent lists resulting from wrapping.

An exception would be the multi-item list with no blank line following a wrapped line starting with a number (as the example given by @shoogle). That could be mostly dealt with by looking at the numbers, but others mention lists being re-arranged without renumbering making this not easy. Trying to accommodate all these lazy forms is going to be a challenge. It also seems that these wrapping cases are hard breaks where the author doesn’t see the result of the wrap in order to see the obvious problem? Where does this happen?

Sorry for bumping this old thread, but I wanted to mention that it seems to me like requiring a space before list items that interrupt a paragraph doesn’t really solve anything. The issue is not that some authors prefer to start list items immediately after paragraphs, but rather that that is (perhaps by accident) a common occurrence in Markdown usage. People start list items immediately following a paragraph, and expect for it to just work.

I wonder if it would make sense to extend the current rule to allow a list to interrupt a paragraph only if it both start at “1” and has multiple list items. In the infrequent situations where authors actually want to have a list with a single item, they would still have the option to put an empty line before it.

Regarding the O(n²) issue, I think it’s interesting to note that parsing Markdown already requires arbitrary lookahead, so it seems to me like this doesn’t introduce any potential for exploit that isn’t already there. Cf. https://github.com/micromark/micromark/issues/8

Examples:

The following would not be a list, because it would have a single item:

We are talking about the number
1. It is a natural number.

The following would be a list, because it has multiple items:

We are talking about the number two.
1. It is a natural number.
2. It is a prime number.

The following would be a list, because it is not interrupting a paragraph anyway:

We are talking about some number.

1. It is a natural number.

The following would not be a list, because it doesn’t start at “1”.

My favorite number is the number
2. I heard it was created around
1982. I’m not sure that’s right.

If the “favorite number” in the example above were actually “1”, then it would unfortunately become a list. Perhaps it’d make sense to require lists interrupting paragraphs to actually be sequential, but I feel like that kind of situation is rare enough to not actually warrant worry.

Regarding the O(n²) issue, I think it’s interesting to note that parsing Markdown already requires arbitrary lookahead
, so it seems to me like this doesn’t introduce any potential for exploit that isn’t already there. Cf. https://github.com/micromark/micromark/issues/8

No, it doesn’t. The reference parser (cmark) parses one line at
a time with no backtracking.

1 Like

I would be interested in figuring out how the reference parser does that and supports LRDs. I know that LEDs caused me a lot of effort to get right, as I had to support backtracking to be able to handle the failure cases properly.