End conditions within end conditions

tvaneerd · March 27, 2017, 11:04pm

The end condition for <pre> is, basically, </pre>. The end condition for <table> and/or <td> includes blank lines.

So if I have <pre> within a table or <td>, then the pre ends “early” - the blank line is seen as the end of the table (or td or both, hard to tell).

Which means you can’t reliably put code in tables. Which sucks.

Example:

I don’t know why the example doesn’t show up. I assumed these comments were in markdown…

examples can be found at https://github.com/tvaneerd/cpp17_in_TTs/blob/master/if_constexpr.md

So my suggestion would be for the end conditions to be scoped - the table or td can’t end because the pre hasn’t ended yet, for example.

class Foo {
    // no blank line above here
    should_be_indented();};

class Foo {
// blank line above here
should_be_indented();

};

kivikakk · March 28, 2017, 2:02am

It’s possible to work around this with a simple newline before each <pre>:

<table>
<tr>

<pre>
hi

there
</pre>
</tr>
<tr>

<pre>
hi

ok
</pre>
</tr>
</table>

This works perfectly fine because the </pre> matches the end condition for start condition 1, and then </tr> matches start condition 6 again. No extra <p> tags or similar are created, and the output is thus as desired.

tvaneerd · March 28, 2017, 1:44pm

This sounds more like “happens to work” than “follows from the spec”. So thanks - that does work in github, so solves my immediate problem. But I think the spec needs clarification.

According to the spec, shouldn’t the blank line before the <pre> end the <tr>?
“it ends with the first subsequent line that meets a matching end condition”
“End condition: line is followed by a blank line.”

And if not that blank line, why not the blank line between hi and there? Should that not end the <tr> or the <table> (or both)? ie nothing says anything about nesting blocks (in fact HTML is marked as a Leaf block not a Container block) nor whether the same blank line can be the end condition of more than one start condition. Nothing says blank lines within <pre> are somehow not blank lines for purposes of end conditions.

mgeier · March 28, 2017, 7:06pm

I’m sure the wording of the spec could be improved, but I think if you ponder on it long enough, eventually you will see what it means …

I’m not sure if I’m there yet, but let me try to explain how I understand it:

First of all, CommonMark is not a full HTML parser. It has a bunch of rules to take care of many common patterns, but you’ll always find some edge cases that CommonMark will “misunderstand”.

Then, there are two types of HTML blocks: Ones that look for a matching end token and therefore should yield “whole” HTML elements (those are HTML blocks 1 to 5), and the ones that just detect a “partial” HTML element (those are the kinds 6 and 7).

The latter kind doesn’t care if it gets an opening or a closing tag and it doesn’t try to match them!

So in the example above, the part starting with <table> is a HTML block of kind 6, which will end with a blank line. It does not check for a closing </table> tag! The next line containing <tr> could be anything, the CommonMark parser doesn’t care if it even looks like an HTML tag, as long as it’s not a blank line. The next blank line closes the HTML block of kind 6. It doesn’t really “know” that it’s an opening “table” tag and an opening “table row” tag. It could as well be some nonsense like:

<table *xyz*
>tr<<<>>>rt<

This would be the same kind 6 of HTML block. And this is the whole block. If there happens to follow a matching closing tag further down, the CommonMark parser doesn’t care.

You can check this by clicking on “AST” on this page.

No, because the <tr> is meaningless for the parser. The blank line (in line 3) ends the HTML block consisting of

<table>
<tr>

… and doesn’t care if the table or the row tag is ever closed.

And now to your original problem:

If you have a <pre> following a line that starts with <table, and no blank line in between, the HTML block of kind 6 “eats” your <pre>. There will be no HTML block of kind 1, as you probably would expect. Therefore, as @kivikakk suggested, you should insert a blank line to make sure the kind 6 block is finished, in order for the block with <pre> to properly become a kind 1 block.

The spec actually illustrates this behavior with several examples, you should have a look around example 132.

tvaneerd · March 28, 2017, 8:31pm

OK, so the main rule is if it sees something “HTML-ish” it says “I don’t know what that is, but I heard it was important, and I should pass it along untouched”, and continues that way until a blank line or matching tag (for most HTML things). so <table><tr> followed by blank is a complete “HTML-ish” block. That HTML-ish block is considered totally unrelated to the </tr></table> HTML-ish block at the end (if it exists at all).

And <pre> and a few others are “special” by needing an actual end tag, blank lines are not endings.

OK, makes sense.
I guess.

Goes a bit against the original goal of “html just works”, but you need to balance it against the other magic going on, so it can’t be perfect.

Thanks

mgeier · March 28, 2017, 10:01pm

I guess the end conditions of HTML block kinds 6 and 7 could be extended to also include that they should stop if the line is followed by another line that satisfies one of the start conditions of kinds 1 to 5. This should fix your problem.

This would make parsing slightly more complicated and the same line might be checked two times for the same condition (in a simple implementation), but I think that’s acceptable.

I don’t know if that was discussed before or if there are any drawbacks to this …

kivikakk · March 30, 2017, 2:02am

I feel like the spec could be updated with some examples to make this behaviour more clear. It’s definitely correct on a close reading, but the result can still be surprising and it can take some explanation to communicate why it happens as it does.

A suggestion: (@jgm?)

HTML blocks continue until they are closed by their appropriate [end condition], or the last line of the document or other [container block]. This means any HTML within an HTML block that might otherwise be recognised as a start condition will be ignored by the parser and passed through as-is, without changing the parser’s state.

For instance, <pre> within a HTML block started by <table> will not affect the parser state; as the HTML block was started in by start condition 6, it will end at any blank line. This can be surprising:
<table><tr><td>
<#pre>
**Hello**,

_world_.
</pre>
</td></tr></table>
In this case, the HTML block is terminated by the newline — the **hello** text remains verbatim — and regular parsing resumes, with a paragraph, emphasised world and inline and block HTML following.

(Ignore the # in <pre>, Discourse’s Markdown support is a bit fun.)

jgm · March 30, 2017, 7:10am

@kivikakk I think your suggestion is good, if you want to
submit a PR.

jgm · March 30, 2017, 7:11am

+++ Matthias Geier [Mar 28 17 22:12 ]:

I guess the end conditions of HTML block kinds 6 and 7 could be
extended to also include that they should stop if the line is followed
by another line that satisfies one of the start conditions of kinds 1
to 5. This should fix your problem.

This is also an interesting suggestion (which I read only after
replying to the other suggestion). Let me think about it.

kivikakk · March 30, 2017, 8:07am

I think I actually prefer this idea (that 6 and 7 can be ‘interrupted’ by 1–5), but it starts to get hairy at edge cases. For instance:

<table>
<tr>
<pre>

Still part of the HTML block. **Not Markdown**.
</pre>
</tr>
</table>

Yet:

<table>
<tr><pre>
**Not Markdown**.

No longer part of HTML block (`<pre>` didn't trigger
because not at start of line).
**Markdown**.
</pre></tr>
**Still Markdown**.
</table>
Finally not Markdown (SC6 triggered on `</table>`).

I feel like the smarter we try to be here, the worse we fail on the edge cases, as they become ever more surprising. (And the only way to completely remove the edge cases is to do full HTML parsing.)

mgeier · April 12, 2017, 8:12am

@kivikakk The line has to be drawn somewhere, and I think it should be exactly between your two examples above.

HTML blocks have to start on their own line (and with good reason), so it makes sense that in your first example the <pre> block should be recognized as such.

In your second example there is something before the <pre>, so it wouldn’t make sense to start a HTML block there, even though it is valid HTML.

I think your first example is much more common and worth changing the spec for.
The second example will just stay one of those sad edge cases …

kivikakk · April 13, 2017, 1:33am

@mgeier I totally agree. I think it’ll cover the 95% case for GitHub’s users, too, judging by the support tickets I’ve seen come through.

@jgm Would you be interested in PRs to cmark and the spec that made such an adjustment? I think the effort and complexity in cmark won’t be too high.