Parsing strategy for tables?

xoofx · February 27, 2016, 2:49pm

Hi there!

I have been recently implementing the CommonMark specs for another .NET library, and while all tests are passing with latests specs (\o/ but god, never thought that Markdown parsing would be sooo difficult!), I’m tackling now the problem of implementing some extensions like tables.

I have followed the parsing strategy described in CommonMark specs and it has been working well. But for tables, I’m not sure if it can fit in this strategy.

If we take the principle that a table contains at least a pipe | we still need to analyze things like backsticks and escapes in order to avoid creating a TableBlock instead of say a ParagraphBlock.

So for example, this should be correctly parsed as a table:

`Column1 |` | Column2
----------- | -------
0           | 1

but this should not:

`Column1 |` Column2
0       `|` 1

But now, if we follow the parsing strategy from the specs, the parsing of inlines is supposed to happen after the parsing of the block structure. So we would not be able to track the backsticks/escapes…etc.

I have been thinking of allowing to transform the block structure (the current leaf block) while parsing the inlines if we are finding the pipe |. Not sure it is entirely feasible, but it seems the most obvious way.

Do you have any other ideas about how to handle this?

bp_ · February 27, 2016, 7:08pm

I think the spec has the right idea. That markup might look silly, but what about this?

Keybind | Meaning
------- | ---------------------------------------------------
Ctrl-`  | Switch to the next window (see also alt-`)
Alt-`   | Switch to the previous window (see also ctrl-`)

jgm · February 27, 2016, 8:31pm

This is a nice observation. I hadn’t really considered how one might add tables. You’re right that this puts some pressure on the idea that block parsing can be done independent of inline parsing.

We could bite the bullet and say that vertical bars always indicate table cell divisions, even if escaped or inside backticks. I don’t like this idea, though, because it makes it impossible to put certain things in table cells.

Babelmark2 shows some differences in current parsers. Pandoc, PHP Markdown Extra, Minima, cebe/markdown recognize the vertical bar in a code span; RDiscount, Maruku, Parsedown don’t.

We might have to break the block/inline parsing separation in this case. (I had thought it might work to tokenize the line into backtick spans and other characters, but consider <a href="`">foo`</a>; you can’t just blindly look for backpack spans without considering at least some other inline constructions.)

As a matter of table syntax design: It would help a lot if there were an unambiguous signal (not requiring inline parsing) that the line was to be parsed as a table line. For example, we might require a | character at the beginning and end of the table row. Then the parser would know, ahead of time, that the line should be parsed as a table row. We could then set up a version of the inline parser that returns when it hits an unescaped | character. Without the signal of the leading |, we’d have to parse every line as inlines (or at least every line containing a |), and this would be inefficient.

xoofx · February 27, 2016, 11:54pm

It would, but markdown is more about making life easier for writers than for implementers!

In fact, in my parser, extensibility points are important, so adding a new inline parser is possible without hurting the performance (there is an early lookup for an opening character which is just a table lookup). So I will try processing the table at inline processing time instead of block time (and allowing the inline parser to change the block structure if necessary). I will get back here with some feedbacks.

xoofx · February 27, 2016, 11:57pm

Indeed! That would require either an escape for the backstick in order to work or process the current stack of inlines when we hit a | (as we do for a ] for inline image). I will try the 2nd approach if possible.

[Edit] Actually, this is incorrect. The only possible workaround is using the \ [/Edit]

jgm · February 28, 2016, 1:55am

Yes, that approach might work too. Let me know how it works for you.
Precedents: that is how we currently handle reference link definitions in cmark and commonmark.js.

MathieuDuponchelle · February 28, 2016, 11:31pm

My test table extension (https://github.com/MathieuDuponchelle/cmark/commits/extensions_draft_3, https://github.com/jgm/cmark/issues/100) fails to parse the first sample, as the “setext-like” line is correctly interpreted as 2 columns, and the first line is “incorrectly” interpreted as being 3 columns.

I can imagine a solution out of this though, that preserves the block / inlines parsing separation, it’s a pretty nitpicky interpretation of the rules though:

It is postulated that blocks bind more closely than any inline parsing rules, it could thus be considered “legal” to do the following:

Change the first line to :

`Column1 \|` | Column2

At block parsing time, we could decide to interpret the backslash as an escape character, even though the rule in inline code nodes is to not interpret these. As we do match (and with two columns now, yay), we remove the backslash character for the inline parsing phase. Would that be too much bending of the rules?

MathieuDuponchelle · February 28, 2016, 11:42pm

By the way afaiu the second example should be parsed as table (if you don’t require the setext-like second line), following the same logic, ie block-level syntax rules bind more closely than inline ones.

MathieuDuponchelle · February 29, 2016, 1:09am

I have modified my implementation to what I consider to be the correct behaviour:

Given this input:

| `Column1 |` | Column2 |
| ----------- | ------- |
| 0           | 1       |

the output is:

[meh@meh-host cmark]$ ./build/src/cmark example.md -e piped-tables -t xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text>| </text>
    <code>Column1 |</code>
    <text> | Column2 |</text>
    <softbreak />
    <text>| ----------- | ------- |</text>
    <softbreak />
    <text>| 0           | 1       |</text>
  </paragraph>
</document>
[meh@meh-host cmark]$

Given this input:

| `Column1 \|` | Column2 |
| ------------ | ------- |
| 0            | 1       |

The output is:

[meh@meh-host cmark]$ ./build/src/cmark example.md -e piped-tables -t xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <table>
    <table_header>
      <table_cell>
        <text> </text>
        <code>Column1 |</code>
      </table_cell>
      <table_cell>
        <text> Column2</text>
      </table_cell>
    </table_header>
    <table_row>
      <table_cell>
        <text> 0</text>
      </table_cell>
      <table_cell>
        <text> 1</text>
      </table_cell>
    </table_row>
  </table>
</document>
[meh@meh-host cmark]$

Given this input:

| `Column1 |` Column2 |
| -------- | -------- |
| 0       `|` 1       |

Output is:

[meh@meh-host cmark]$ ./build/src/cmark example.md -e piped-tables -t xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <table>
    <table_header>
      <table_cell>
        <text> </text>
        <text>`</text>
        <text>Column1</text>
      </table_cell>
      <table_cell>
        <text>`</text>
        <text> Column2</text>
      </table_cell>
    </table_header>
    <table_row>
      <table_cell>
        <text> 0       </text>
        <text>`</text>
      </table_cell>
      <table_cell>
        <text>`</text>
        <text> 1</text>
      </table_cell>
    </table_row>
  </table>
</document>
[meh@meh-host cmark]$

Finally, if one wishes to have this last sample not render as a table, it is enough to escape any of the pipes outside of a code block:

| `Column1 |` Column2 \|
| -------- | -------- |
| 0       `|` 1       |

Output is:

[meh@meh-host cmark]$ ./build/src/cmark example.md -e piped-tables -t xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text>| </text>
    <code>Column1 |</code>
    <text> Column2 </text>
    <text>|</text>
    <softbreak />
    <text>| -------- | -------- |</text>
    <softbreak />
    <text>| 0       </text>
    <code>|</code>
    <text> 1       |</text>
  </paragraph>
</document>
[meh@meh-host cmark]$

I’m happy with this behaviour, seems to solve the issues that were raised here as far as I can tell?

xoofx · February 29, 2016, 11:56pm

The parsing at inline time worked pretty well, no annoying issues and overall code to implement this feature was quite simple.

MathieuDuponchelle · March 1, 2016, 12:26am

Any impact on performance ?

xoofx · March 1, 2016, 12:32am

The impact is null if tables are not used in a document. The paragraph to table transformation is only triggered if there is at least one | found while processing the text on the first line of a paragraph. Then at the end of the paragraph it will process all | but early exit if it is not a valid table (e.g found lines without a |…etc.)

MathieuDuponchelle · March 1, 2016, 12:34am

Hm yeah obviously a non-activated extension should not have an effect on performance.

However I still wonder if my interpretation of the rules is the correct one, as this is a pretty novel syntax type it would be good to make sure.

MathieuDuponchelle · March 1, 2016, 12:37am

To be more clear, it seems to me that considering your first table example as an actual table, with its header row containing 2 columns is in contradiction with the rule that block syntax should take precedence over inline syntax, which is also what enables an absolute separation between the block and inline parsing phases.

xoofx · March 1, 2016, 12:55am

I meant that the extension is activated, but it is only triggered if a | is found in a paragraph block at inline parsing time. On a regular document not using | it has no impact (like the CommonMark specs)

Note that I have just followed the syntax for pipe tables that is common in many markdown processor (e.g PHP Markdown Extra and Pandoc), I haven’t changed the existing “rules”.

Concerning the rule of “block syntax should take precedence over inline syntax”, I don’t mind having an inline parser that can transform a block if necessary. It allows some context sensitive discovery without putting the burden on the syntax and writer (and well, Markdown is all about this!). It is more to allow late discovery of block syntax at inline time, and this should not be restricted by a rule in the spec (it is more a parsing strategy/recommendation)

MathieuDuponchelle · March 1, 2016, 1:08am

It allows some context sensitive discovery without putting the burden on the syntax and writer

This edge case is equivocal for the reader / writer of a markdown document, ie your mind sees

foo `|` bar

as text + inline code + text, and my mind sees it as a table row, and no one’s right or wrong

Consistent rules here are beneficial both to the reader / writer and the implementer, and IMHO the rule is already set that disambiguation should happen in favor of block-level rules.

As for “many existing parsers”, here’s the result with github for this input:

| Hello   `|` You |
| -------- | ---- |
| Foo      | Bar  |

GitHub - MathieuDuponchelle/dbus-deviation: A project for parsing D-Bus introspection XML and processing it in various ways. (disregard the random github repo)

It does consider this as both a table and a line with inline code, which is silly but shows defining clear rules is really needed

xoofx · March 1, 2016, 1:26am

The precedence rule of \`` over |` is fine for me. It allows also things like inline HTML to take precedence over pipe tables…etc. That’s why it has to be done at inline time.

Regarding the rules and results from babelmark2 result, for sure, rules will have to be clarified for an extension spec, spec is for this, corner cases. At least the general behavior of \`` that it can escape a |` (same for inline HTML) is quite solid to me. I will be happy to join the wagon and participate to CommonMark pipe-table specs later when CommonMark 1.0 will be out!

MathieuDuponchelle · March 1, 2016, 1:33am

Well not much more to say there, I think I’ve made my point pretty clear, let’s see what @jgm and others have to say.

xoofx · March 8, 2016, 1:16am

I have added support for grid tables which don’t suffer the problem of pipe tables with | escape, not surprising, but at least no special case needed for them,

Thinking a bit more about it, imho, the handling of pipe tables at inline parsing time is the least worse way to go. Using escape \| would break many exiting inlines that were using | (including content in links [...], HTML inlines, code spans…), and even if there are some babelmark2 discrepancies (for those that handle this case for course), it is how they are currently handling this case (even if for some of them in some ugly way, like doing the split by | on the final HTML string for example, which is not really robust…)

MathieuDuponchelle · March 12, 2016, 11:19pm

Well I still disagree for the same reasons I already stated. To be honest I should also say that not “breaking” things that you describe as not robust doesn’t really preoccupy me