Setext headers: single line?

elibarzilay · April 27, 2015, 2:33pm

According to the setext section:

The line of text must be one that, were it not followed by the setext header underline, would be interpreted as part of a paragraph […]

…

[…] it cannot interrupt a paragraph, so when a setext header comes after a paragraph, a blank line is needed between them.

IIUC, these two bits say that the header text must be a single line–? If so, then IMO it should be more explicit than inferring it from the combination of these sentences.

It could also be nice to have some example like

*foo
bar*
====

as showing a motivation, where without the limitation to a single line this would mean backtracking and changing already-parsed text.

jgm · April 27, 2015, 4:18pm

+++ elibarzilay [Apr 27 15 14:44 ]:

It could also be nice to have some example like
*foo
bar*
====
as showing a motivation, where without the limitation to a single line this
would mean backtracking and changing already-parsed text.

Actually, this isn’t the motivation. We don’t parse inlines until we’ve divided text into blocks, so there is no reason in principle why we couldn’t allow setext headers to interrupt a paragraph. So this might be a good time to reconsider this issue.

If you look at BabelMark2, you’ll see that current Markdown implementations go three different ways on this text:

This is a big
header which *can't
fit on one line*
================

Some parse this as one header (parsedown)
Some do not recognize a header here at all, on grounds that a header can’t interrupt a paragraph (commonmark, python-markdown, pandoc, kramdown)
Some parse this as a paragraph followed by a header, with only the last line as the header’s contents (Markdown.pl, marked, RedCarpet, Maruku, cheapskate)

Given that there isn’t complete uniformity here, a case might be made for moving commonmark into group 1, or group 3. What are people’s thoughts on this? Group 2 doesn’t seem a very useful behavior.

I know that I have periodically had requests to allow multiline setext headers in pandoc. Those who hard-wrap text and include links in their headers can often run into problems fitting a header into their column width.

Allowing multiline setext headers would also open up the possibility of having headers with hard line breaks in them (which I’ve also had requests for in pandoc).

elibarzilay · April 27, 2015, 4:53pm

Actually, this isn’t the motivation. We don’t parse inlines until we’ve divided text into blocks, so there is no reason in principle why we couldn’t allow setext headers to interrupt a paragraph. So this might be a good time to reconsider this issue.

OK, repharse a bit: it’s not arbitrary back-tracking, just one block. IOW, you can’t do a simple parse-block-then-parse-inlines, since the inlines can change if there’s a setext marker following the text. It’s of course doable, but I was relieved that there’s no need to do that…

Given that there isn’t complete uniformity here, a case might be made for moving commonmark into group 1, or group 3. What are people’s thoughts on this? Group 2 doesn’t seem a very useful behavior.

(A big “FWIW” around the following.)

IMO #2 makes most sense.

With #1, and especially when just one - is needed it seems easy to end up with huge mistaken headers. (Plus, imagine an editor with highlights: if you want enter some - foo bullet after some text you’ll see it flash as a header line; not a strong point, but I think that such flashes are signs of surprise text parsing which is problem.)

Option #3 seems less dangrous for that, but adds a result that looks even more broken.

With both of these it becomes impossible to use ---s to separate a block of text with rules without newlines (and that seems to me like a useful thing).

Maybe a good summary is that I think that the problem of breaking text where people expect #1/#3 is smaller IMO than breaking a #2 expectation.

(Aren’t # more suited for multiple-lines?)

jgm · April 27, 2015, 7:03pm

+++ elibarzilay [Apr 27 15 17:04 ]:

OK, repharse a bit: it’s not arbitrary back-tracking, just one block.
IOW, you can’t do a simple parse-block-then-parse-inlines, since the
inlines can change if there’s a setext marker following the text. It’s
of course doable, but I was relieved that there’s no need to do that…

The reference parser doesn’t parse any inline content until the block structure of the entire document has been determined. (Parsing inlines as you go doesn’t make sense anyway, since you don’t know about reference link definitions until you’ve gotten to the end of the document, and that affects how you parse inlines.) So, even if we allowed setext headers to interrupt paragraphs, this could be done without any backtracking.

(The parser currently just adds lines to a Paragraph node, and when it encounters a setext header line, then, if there’s only one line in the Paragraph node, it converts the node to a Header node. If we wanted a #1 type behavior, we could just remove the “only one line” condition. For a #3 type solution, we could create a new node if there’s more than one line in the Paragraph node, and move the last line into the new Header node. Either way there’d be no backtracking.)

lu_zero · April 28, 2015, 10:04am

Moving to the parsedown approach probably is the most useful all considered.

jgm · April 29, 2015, 5:22am

I think I prefer #1 to #2. Against #1, you say “it seems easy to end up with huge mistaken headers.” But the #2 behavior is probably a mistake too – you probably didn’t want a long string of = signs in the middle of a paragraph. If there’s going to be a mistake, it seems better that it be BIG and easily recognizable as such, rather than hidden away.

The point about rules interrupting paragraphs is a good one. But, to me, it seems a bit unexpected that

foo
---
bar

is a header followed by a paragraph, while

foo
bar
---
baz

is two paragraphs with a horizontal rule between. Besides, it makes sense to leave blank lines around a horizontal rule separator, so I don’t think one loses much by not being able to do it without the blanks. (Perhaps an intervening blank line should always be required?)

Aren’t # more suited for multiple-lines?

It’s pretty entrenched now in Markdown land that paragraph text can start right after an ATX (#) header, without an intervening blank line.

elibarzilay · April 29, 2015, 10:36am

Parsing inlines as you go doesn’t make sense anyway, since you don’t know about reference link definitions until you’ve gotten to the end of the document […]

Yeah, that’s a point.

I think I prefer #1 to #2. Against #1, you say “it seems easy to end up with huge mistaken headers.” But the #2 behavior is probably a mistake too – you probably didn’t want a long string of = signs in the middle of a paragraph. If there’s going to be a mistake, it seems better that it be BIG and easily recognizable as such, rather than hidden away.

I still prefer #2 as the conservative option, thinking about a stray - or = messing things up. (And I can imagine real editing sequences that will end up with such strays.) To refine the thing I said earlier about flashing text: I think that the main principle that I think makes sense is that small changes in the source shouldn’t lead to big changes in the output — and that’s why I’m so wrried about a stray character messing things up. Reading your text, the thing that jumps out is “long” — if there was some required length that is >> 1, I’d probably feel much more comfortable with it.

[…] so I don’t think one loses much by not being able to do it without the blanks. (Perhaps an intervening blank line should always be required?)

Yeah, it makes sense to me to require empty lines: if I wanted that I’d probably puth them in either way. But again, re the point for a header, this is also talking about a longer-than-one sequence…

Aren’t # more suited for multiple-lines?
It’s pretty entrenched now in Markdown land that paragraph text can start right after an ATX (#) header, without an intervening blank line.

No, I meant

# Some long
# header text

But that was a comment, I know that it too much of a break…

xim · September 4, 2015, 7:39pm

I whish for multi-line headings quite often, especially when converting existing documents to markdown. I was a bit surprised when I noticed this topic was not mentioned in the spec at all. So I clearly opt for #1.

Just as an example, this is an actual heading I wrote today (source):

# Following May Be Said as to What We May Expect by Way of Implementation of Basic Soviet Policies on Unofficial, or Subterranean Plane, i.e., on Plane for Which Soviet Government Accepts No Responsibility.