I have been recently implementing the CommonMark specs for another .NET library, and while all tests are passing with latests specs (\o/ but god, never thought that Markdown parsing would be sooo difficult!), I’m tackling now the problem of implementing some extensions like tables.
I have followed the parsing strategy described in CommonMark specs and it has been working well. But for tables, I’m not sure if it can fit in this strategy.
If we take the principle that a table contains at least a pipe | we still need to analyze things like backsticks and escapes in order to avoid creating a TableBlock instead of say a ParagraphBlock.
So for example, this should be correctly parsed as a table:
`Column1 |` | Column2
----------- | -------
0 | 1
but this should not:
`Column1 |` Column2
0 `|` 1
But now, if we follow the parsing strategy from the specs, the parsing of inlines is supposed to happen after the parsing of the block structure. So we would not be able to track the backsticks/escapes…etc.
I have been thinking of allowing to transform the block structure (the current leaf block) while parsing the inlines if we are finding the pipe |. Not sure it is entirely feasible, but it seems the most obvious way.
Do you have any other ideas about how to handle this?
This is a nice observation. I hadn’t really considered how one might add tables. You’re right that this puts some pressure on the idea that block parsing can be done independent of inline parsing.
We could bite the bullet and say that vertical bars always indicate table cell divisions, even if escaped or inside backticks. I don’t like this idea, though, because it makes it impossible to put certain things in table cells.
Babelmark2 shows some differences in current parsers. Pandoc, PHP Markdown Extra, Minima, cebe/markdown recognize the vertical bar in a code span; RDiscount, Maruku, Parsedown don’t.
We might have to break the block/inline parsing separation in this case. (I had thought it might work to tokenize the line into backtick spans and other characters, but consider <a href="`">foo`</a>; you can’t just blindly look for backpack spans without considering at least some other inline constructions.)
As a matter of table syntax design: It would help a lot if there were an unambiguous signal (not requiring inline parsing) that the line was to be parsed as a table line. For example, we might require a | character at the beginning and end of the table row. Then the parser would know, ahead of time, that the line should be parsed as a table row. We could then set up a version of the inline parser that returns when it hits an unescaped | character. Without the signal of the leading |, we’d have to parse every line as inlines (or at least every line containing a |), and this would be inefficient.
It would, but markdown is more about making life easier for writers than for implementers!
In fact, in my parser, extensibility points are important, so adding a new inline parser is possible without hurting the performance (there is an early lookup for an opening character which is just a table lookup). So I will try processing the table at inline processing time instead of block time (and allowing the inline parser to change the block structure if necessary). I will get back here with some feedbacks.
Indeed! That would require either an escape for the backstick in order to work or process the current stack of inlines when we hit a | (as we do for a ] for inline image). I will try the 2nd approach if possible.
[Edit] Actually, this is incorrect. The only possible workaround is using the \ [/Edit]
I can imagine a solution out of this though, that preserves the block / inlines parsing separation, it’s a pretty nitpicky interpretation of the rules though:
It is postulated that blocks bind more closely than any inline parsing rules, it could thus be considered “legal” to do the following:
Change the first line to :
`Column1 \|` | Column2
At block parsing time, we could decide to interpret the backslash as an escape character, even though the rule in inline code nodes is to not interpret these. As we do match (and with two columns now, yay), we remove the backslash character for the inline parsing phase. Would that be too much bending of the rules?
By the way afaiu the second example should be parsed as table (if you don’t require the setext-like second line), following the same logic, ie block-level syntax rules bind more closely than inline ones.
The impact is null if tables are not used in a document. The paragraph to table transformation is only triggered if there is at least one | found while processing the text on the first line of a paragraph. Then at the end of the paragraph it will process all | but early exit if it is not a valid table (e.g found lines without a |…etc.)
To be more clear, it seems to me that considering your first table example as an actual table, with its header row containing 2 columns is in contradiction with the rule that block syntax should take precedence over inline syntax, which is also what enables an absolute separation between the block and inline parsing phases.
I meant that the extension is activated, but it is only triggered if a | is found in a paragraph block at inline parsing time. On a regular document not using | it has no impact (like the CommonMark specs)
Note that I have just followed the syntax for pipe tables that is common in many markdown processor (e.g PHP Markdown Extra and Pandoc), I haven’t changed the existing “rules”.
Concerning the rule of “block syntax should take precedence over inline syntax”, I don’t mind having an inline parser that can transform a block if necessary. It allows some context sensitive discovery without putting the burden on the syntax and writer (and well, Markdown is all about this!). It is more to allow late discovery of block syntax at inline time, and this should not be restricted by a rule in the spec (it is more a parsing strategy/recommendation)
The precedence rule of \`` over |` is fine for me. It allows also things like inline HTML to take precedence over pipe tables…etc. That’s why it has to be done at inline time.
Regarding the rules and results from babelmark2 result, for sure, rules will have to be clarified for an extension spec, spec is for this, corner cases. At least the general behavior of \`` that it can escape a |` (same for inline HTML) is quite solid to me. I will be happy to join the wagon and participate to CommonMark pipe-table specs later when CommonMark 1.0 will be out!
I have added support for grid tables which don’t suffer the problem of pipe tables with | escape, not surprising, but at least no special case needed for them,
Thinking a bit more about it, imho, the handling of pipe tables at inline parsing time is the least worse way to go. Using escape \| would break many exiting inlines that were using | (including content in links [...], HTML inlines, code spans…), and even if there are some babelmark2 discrepancies (for those that handle this case for course), it is how they are currently handling this case (even if for some of them in some ugly way, like doing the split by | on the final HTML string for example, which is not really robust…)