Parsing strategy for tables?

vsch · December 5, 2016, 6:10pm

I don’t see how tables can be properly implemented if they cannot do pipe parsing with inlines taken into account.

I implemented a table extension in my parser, flexmark-java (https://github.com/vsch/flexmark-java), using the paragraph pre-processing strategy like reference definitions. So tables have to be the first element of a paragraph. Which is the case for all processors that have this extension, that I know of. I split the line on pipe boundaries along with inline processing for the table and save the parsed inlines for the table cells so that they don’t need to be re-parsed.

Without doing this you get all kinds of issues with pipes embedded in inline code, typographical quotes, etc and makes table usage difficult and non-standard compared to already existing markdown table extensions. My implementation is intuitive, works great with all inline markup and does not interfere with block parsing strategy, which I love for its speed, ease of maintenance and ease of extending.

The extension implements column spans using consecutive | at the end of a cell. One extra pipe for every extra column span. I think this is Php Markdown Extra or MultiMarkdown syntax. I got the syntax from pegdown, which I used to use for my markdown plugin for JetBrains IDEs before writing my own parser.

I found in my implementation of extensions that there is no proper way of handling some of them without extending the core parser and/or core inline parser, which is exactly what I did with the original commonmark-java. I removed all the special cases and exposed them as extension API usable by extensions.

Special case paragraph reference preprocessing became a ParagraphPreProcessor extension interface that worked well for implementing the table extension.

* flexmark-java is a rewrite of commonmark-java for a lossless source based AST with source tracking for all elements and parts of elements.

mity · December 5, 2016, 10:13pm

I have recently implemented GitHub-style tables for MD4C (as an extension which has to be turned explicitly on), and it is implemented at a block level, not inline level.

Details what is or is not parsed as a table by MD4C is more or less described in tables.txt which serves a little bit as a preliminary documentation as well as a test suite. (It’s the same format as CommonMark’s spec.txt.)

As of MD4C, the basic approach for parsing is based on this:

Table cannot interrupt a paragraph.
Table must have table header underline (specific mixture of -, : | characters and some optional whitespace, no other char is allowed; see the file tables.txt).
If such underline line is encountered and we have only one preceding line in the block currently being built, we check whether the preceding line contains (unescaped) |.
If yes we call it a table. (The preceding line that contains the head part of the table/)
Then we continue adding into the table any subsequent lines as long as they contain at least one (unescaped) |. First line without the | ends the table block.

EDIT: I believe the presence of the table head underline line (which cannot contain any backtick) is strong enough indicator we can interpret it all as a table. And given the table cell is actually block too (leaf one), then the pipe on each line should have higher priority then code span, to keep the uniformity with the rule “blocks first, inlines then”.

mity · December 5, 2016, 10:29pm

Please no. Consider that if you allow to recurse from block analysis via inline analysis back to block analysis, you may easily open a can of worms with new class of pathologic input with horrible computational complexity.

vsch · December 5, 2016, 10:40pm

There is no need to break the block/inline model which in my opinion is gold for Markdown parsing, if you have the option of extracting tables or other elements from the start of paragraphs.

At that point all the blocks have been parsed and what remains is paragraph pre-processing before processing it for inlines.

This has the best of both worlds. The block/inline parsing separation and access to inlines. I tried other approaches to parsing tables and nothing is as bullet because without having inlines parsed there are always possibilities of parsing a table block only to realize it does not meet constraints once you parse the inlines.

In my approach, I parse the table and if it does not meet constraints, I simply abort and leave the paragraph the way it was. If constraints are met, the table prefix part of the paragraph is chopped off into its own block node and what remains of the paragraph, if anything, is processed for inlines.

jgm · December 6, 2016, 12:54pm

I do not want to break block over inline priority.

However, how do you all propose to handle cases where you have a code span containing a pipe character inside a table cell?

kivikakk · December 12, 2016, 3:18am

The way we have row parsing implemented currently at GitHub is as follows:

Parse inlines in the input line into a transient paragraph container.
Split the paragraph into table cells whenever we encounter a pipe (|) in a top-level text node.

I like to think that this doesn’t actually break the block/inline priority order — because we post-process inlines only, the table cells (blocks) are essentially one level up from inlines, and there is no block parsed within an inline. It’s a bit of a semantic argument, admittedly. Alternatively, view table cells as inlines within a table row block, and then it’s fairly okay.

jgm · December 12, 2016, 8:46am

I’d love to get comments from others on @kivikakk’s strategy. This is perhaps the most fundamental issue in figuring out how to spec out pipe tables.

If I understand this right, it means that | cell breaks get lower priority than any other inline grouping constructs. So, for example, in the following we’ll just have one table cell rather than two:

| *a | b* |

EDIT: I should also note that @kivikakk’s strategy departs from the original strategy from Matthieu Duponchelle’s patch, which split on | characters before doing any inline parsing, excepting only backslash-escaped pipes. (I believe backslash-escaped pipes were also transformed into regular pipes, even in code spans, when they occurred in table cells.) @kivikakk reports that the original approach broke too many existing tables in people’s repositories.

kivikakk · December 12, 2016, 9:15am

That’s correct; the pipe isn’t in a text node that’s a direct child of the table cell [1], so we don’t split there. Mathieu’s strategy first splits, and so you get two cells with text *a and b*. Which one is “better” feels more like a matter of preference than correctness; I think you could argue for either [2], and we’ve gone with the other way to minimize breakage as we transitioned to CommonMark.

[1] actually, it’s the transient paragraph node at this point.
[2] my argument for splitting at top level pipes only is that it means you can transport an inline code span from outside a table into one without needing to additionally escape anything; likewise a [link](…) with a stray pipe in the URL, link text or title — not making users manually escape pipes in these instances feels like the best user experience.

xoofx · December 12, 2016, 10:17am

For markdig, the strategy followed is that | has a higher priority over emphasis inline elements but lower priority to code backsticks or links. The parsing of the table is done when parsing the inlines of a paragraph block (so that when match occurs, it will transform the ParagraphBlock into a Table)

If you look at this test on babelmark3:

| c | d |
| - | - |
| *a | b* |
| `e | f` |
| [g | h](http://a.com) |

the combination of resuts are still quite different across the different markdown processors.

But with a closer look:

For | *a | b* | most of the markdown parsers treat | as higher precedence than emphasis
For backsticks most parsers treat them with higher precedence over |
For | [g | h](http://a.com) | most parsers treat | with higher precedence than links

So in markdig, I followed 1) and 2), but for 3) I followed to give | lower precedence to links

mity · December 12, 2016, 10:37am

@kivikakk: So, if I can understand, you process inlines in the table block twice: Once during block processing to detect it is still a line belonging to the table (i.e. that it contains some valid cell boundary |) and then again when contents of the cells is actually processed. Correct?

kivikakk · December 12, 2016, 11:57am

@mity: yes, exactly. It’s not super efficient, and tomorrow I’ll probably look at making the twice-processing into once-processing

mity · December 12, 2016, 12:39pm

Well, I am not afraid about performance here: Tables likely are not that frequent/long. It was more about whether I really understand the strategy.

@jgm: I think I can ack this approach. It is both doable and reasonable. I’ll try to update MD4C to match this.

vsch · December 28, 2016, 12:45am

Why not process tables like reference definitions at start of paragraphs?

If it matches a table, it is broken away from the paragraph into a separate node. If it does not then the paragraph is not changed and all proceeds as if nothing happened.

There is no need to pre-parse for table blocks separately or parse the inlines twice since at the point where reference definitions are split off from beginning of paragraphs inline processing is performed.

I used this strategy and look for the cell terminating pipe before processing the next inline sequence. If there is a pipe at this point then it is not part of any inline.

I also use the parsed inlines as the content of the cell so that there is no need to re-parse the cells for inlines.

There is double parsing if a table is partially correct but aborts before completion because it does not meet some constraints. In this case all the parsed inlines will be re-parsed as part of the paragraph. This should be significantly less frequent than correct table parsing and should be more acceptable to pay a penalty on failed table parsing than have to pay it on every parsing correct tables.

I have it implemented in flexmark-java tables extension. Implemented parsing options handle GFM table syntax, Php Extras table syntax: multiple header rows, column spans and caption or can be configured to handle anything in between.

The advantage with this approach is that it does not break the blocks first strategy, does not have to double parse and extends on the existing approach for handling reference definitions at the beginning of paragraphs to make it a generic extension strategy.