Parsing strategy for tables?


#34

But conversely, any issues with the implementation(s) will be seen by less eyes.


#35

This is true. In cases of inconsistencies in other implementations, I think that is a good reason to improve the spec tests (which all CommonMark implementations should pass).


#36

To give you a glimpse, If I wanted to implement this in C, It would be a major rewrite of the implem (and would be pretty similar to my C# implem), not an incremental evolution and I would do it very differently from how it is done today. Even in your branch inlines.c, the switch case is hardcoded in the old-school way in this method, even if the try_extensions is added as an escape, which is bttw far from being performance friendly, as you need to go through all extensions everytime a character is not handled by static cases to find a good candidate while all of this should be cached in advanced and be performed by a simple table lookup.

In my code, everything, including standard markdown parsing is done through plugins, I can disable code backsticks if I want, an extension could decide to change the opening character of for parsing heading text and replace # by @ for example…etc. The main parsing method for inlines is just a loop of 100 lines, and everything else is self contained in one feature per file (instead of having a big blob file like inlines.c). So yes, that justifies what I talked about “radical” changes.

Also, as @chrisalley noted about implems strengthening the specs, it is typically what happened while I implemented CommonMark from scratch: See for example issue #395


#37

Even in your branch inlines.c, the switch case is hardcoded in the old-school way in this method, even if the try_extensions is added as an escape, which is bttw far from being performance friendly, as you need to go through all extensions everytime a character is not handled by static cases to find a good candidate while all of this should be cached in advanced and be performed by a simple table lookup.

That’s a very trivial remark. The check is only done for special characters, which makes it mostly harmless, and making this use a table would take around 5 minutes if profiling showed it was indeed draining performance.

So yes, that justifies what I talked about “radical” changes.

Not really, that simply shows you did this in your parser, which you develop in a closed-source manner, making it difficult to compare it factually and benchmark against the current cmark implementation.


#38

I will release it as soon as the API is stable, naming of the project is “secured”, another related project is finished, and the proper website and documentation are done… It will be published on github and announced on this forum. It will come also along accompanied benchmark compare to other .NET solutions and to cmark for raw performance (for now, at least 20-40% slower, but not feature wise equivalent). So, stay tuned…


#39

Cool :slight_smile: Care to share the list of extensions you currently have implemented, and detail the issues you might have had if any?


#40

So far, I have implemented the following extensions:

  • abbreviations
  • auto identifiers and auto link (similar to pandoc)
  • css bootstrap (to output some classes specific to bootstrap for some elements)
  • custom containers (mostly block ::: and inline ::)
  • definition lists
  • Emojy/Emoticons
  • Emphasis extra (strikethrough, subscript, superscript, inserted, marked)
  • Figures (as described here)
  • Footnotes
  • Special/Attached attributes (allows to attach css attributes to the current block or the previous inlines with the syntax {...}), works with all blocks and all inlines, including headings, fencedblocks
  • Softline as Hardline
  • Lettered list (a. b., A. B., i, ii. iv, …etc.)
  • Mathematics (block/inline escape $$...$$, and $...$)
  • Medias (renders differently image with url that have a mime/type that is a video or music, or links to youtube/vimeo)
  • Smartypants
  • Pipe tables
  • Grid tables (slightly extended version of pandoc)

No real issues. Maybe attached attributes are a bit specials, as they need to have hooks for some other block parsers (like fenced code blocks, to parse the attributes before parsing the info string, or for headings before removing the optional trailings ###).

In terms of performance, the side effect of adding proper extensibility for all these extensions (including the ability to remove/modify standard CommonMark parsing) has slow down the whole by around 20% (in C#), which is quite acceptable.


#41

I don’t see how tables can be properly implemented if they cannot do pipe parsing with inlines taken into account.

I implemented a table extension in my parser, flexmark-java (https://github.com/vsch/flexmark-java), using the paragraph pre-processing strategy like reference definitions. So tables have to be the first element of a paragraph. Which is the case for all processors that have this extension, that I know of. I split the line on pipe boundaries along with inline processing for the table and save the parsed inlines for the table cells so that they don’t need to be re-parsed.

Without doing this you get all kinds of issues with pipes embedded in inline code, typographical quotes, etc and makes table usage difficult and non-standard compared to already existing markdown table extensions. My implementation is intuitive, works great with all inline markup and does not interfere with block parsing strategy, which I love for its speed, ease of maintenance and ease of extending.

The extension implements column spans using consecutive | at the end of a cell. One extra pipe for every extra column span. I think this is Php Markdown Extra or MultiMarkdown syntax. I got the syntax from pegdown, which I used to use for my markdown plugin for JetBrains IDEs before writing my own parser.

I found in my implementation of extensions that there is no proper way of handling some of them without extending the core parser and/or core inline parser, which is exactly what I did with the original commonmark-java. I removed all the special cases and exposed them as extension API usable by extensions.

Special case paragraph reference preprocessing became a ParagraphPreProcessor extension interface that worked well for implementing the table extension.

* flexmark-java is a rewrite of commonmark-java for a lossless source based AST with source tracking for all elements and parts of elements.


#42

I have recently implemented GitHub-style tables for MD4C (as an extension which has to be turned explicitly on), and it is implemented at a block level, not inline level.

Details what is or is not parsed as a table by MD4C is more or less described in tables.txt which serves a little bit as a preliminary documentation as well as a test suite. (It’s the same format as CommonMark’s spec.txt.)

As of MD4C, the basic approach for parsing is based on this:

  1. Table cannot interrupt a paragraph.
  2. Table must have table header underline (specific mixture of -, : | characters and some optional whitespace, no other char is allowed; see the file tables.txt).
  3. If such underline line is encountered and we have only one preceding line in the block currently being built, we check whether the preceding line contains (unescaped) |.
  4. If yes we call it a table. (The preceding line that contains the head part of the table/)
  5. Then we continue adding into the table any subsequent lines as long as they contain at least one (unescaped) |. First line without the | ends the table block.

EDIT: I believe the presence of the table head underline line (which cannot contain any backtick) is strong enough indicator we can interpret it all as a table. And given the table cell is actually block too (leaf one), then the pipe on each line should have higher priority then code span, to keep the uniformity with the rule “blocks first, inlines then”.


#43

Please no. Consider that if you allow to recurse from block analysis via inline analysis back to block analysis, you may easily open a can of worms with new class of pathologic input with horrible computational complexity.


#44

There is no need to break the block/inline model which in my opinion is gold for Markdown parsing, if you have the option of extracting tables or other elements from the start of paragraphs.

At that point all the blocks have been parsed and what remains is paragraph pre-processing before processing it for inlines.

This has the best of both worlds. The block/inline parsing separation and access to inlines. I tried other approaches to parsing tables and nothing is as bullet because without having inlines parsed there are always possibilities of parsing a table block only to realize it does not meet constraints once you parse the inlines.

In my approach, I parse the table and if it does not meet constraints, I simply abort and leave the paragraph the way it was. If constraints are met, the table prefix part of the paragraph is chopped off into its own block node and what remains of the paragraph, if anything, is processed for inlines.


#45

I do not want to break block over inline priority.

However, how do you all propose to handle cases where you have a code span containing a pipe character inside a table cell?


#46

The way we have row parsing implemented currently at GitHub is as follows:

  • Parse inlines in the input line into a transient paragraph container.
  • Split the paragraph into table cells whenever we encounter a pipe (|) in a top-level text node.

I like to think that this doesn’t actually break the block/inline priority order — because we post-process inlines only, the table cells (blocks) are essentially one level up from inlines, and there is no block parsed within an inline. It’s a bit of a semantic argument, admittedly. Alternatively, view table cells as inlines within a table row block, and then it’s fairly okay.


#47

I’d love to get comments from others on @kivikakk’s strategy. This is perhaps the most fundamental issue in figuring out how to spec out pipe tables.

If I understand this right, it means that | cell breaks get lower priority than any other inline grouping constructs. So, for example, in the following we’ll just have one table cell rather than two:

| *a | b* |

EDIT: I should also note that @kivikakk’s strategy departs from the original strategy from Matthieu Duponchelle’s patch, which split on | characters before doing any inline parsing, excepting only backslash-escaped pipes. (I believe backslash-escaped pipes were also transformed into regular pipes, even in code spans, when they occurred in table cells.) @kivikakk reports that the original approach broke too many existing tables in people’s repositories.


#48

That’s correct; the pipe isn’t in a text node that’s a direct child of the table cell [1], so we don’t split there. Mathieu’s strategy first splits, and so you get two cells with text *a and b*. Which one is “better” feels more like a matter of preference than correctness; I think you could argue for either [2], and we’ve gone with the other way to minimize breakage as we transitioned to CommonMark.

[1] actually, it’s the transient paragraph node at this point.
[2] my argument for splitting at top level pipes only is that it means you can transport an inline code span from outside a table into one without needing to additionally escape anything; likewise a [link](…) with a stray pipe in the URL, link text or title — not making users manually escape pipes in these instances feels like the best user experience.


#49

For markdig, the strategy followed is that | has a higher priority over emphasis inline elements but lower priority to code backsticks or links. The parsing of the table is done when parsing the inlines of a paragraph block (so that when match occurs, it will transform the ParagraphBlock into a Table)

If you look at this test on babelmark3:

| c | d |
| - | - |
| *a | b* |
| `e | f` |
| [g | h](http://a.com) |

the combination of resuts are still quite different across the different markdown processors.

But with a closer look:

  1. For | *a | b* | most of the markdown parsers treat | as higher precedence than emphasis
  2. For backsticks most parsers treat them with higher precedence over |
  3. For | [g | h](http://a.com) | most parsers treat | with higher precedence than links

So in markdig, I followed 1) and 2), but for 3) I followed to give | lower precedence to links


#50

@kivikakk: So, if I can understand, you process inlines in the table block twice: Once during block processing to detect it is still a line belonging to the table (i.e. that it contains some valid cell boundary |) and then again when contents of the cells is actually processed. Correct?


#51

@mity: yes, exactly. It’s not super efficient, and tomorrow I’ll probably look at making the twice-processing into once-processing :slight_smile:


#52

Well, I am not afraid about performance here: Tables likely are not that frequent/long. It was more about whether I really understand the strategy. :wink:

@jgm: I think I can ack this approach. It is both doable and reasonable. I’ll try to update MD4C to match this.


#53

Why not process tables like reference definitions at start of paragraphs?

If it matches a table, it is broken away from the paragraph into a separate node. If it does not then the paragraph is not changed and all proceeds as if nothing happened.

There is no need to pre-parse for table blocks separately or parse the inlines twice since at the point where reference definitions are split off from beginning of paragraphs inline processing is performed.

I used this strategy and look for the cell terminating pipe before processing the next inline sequence. If there is a pipe at this point then it is not part of any inline.

I also use the parsed inlines as the content of the cell so that there is no need to re-parse the cells for inlines.

There is double parsing if a table is partially correct but aborts before completion because it does not meet some constraints. In this case all the parsed inlines will be re-parsed as part of the paragraph. This should be significantly less frequent than correct table parsing and should be more acceptable to pay a penalty on failed table parsing than have to pay it on every parsing correct tables.

I have it implemented in flexmark-java tables extension. Implemented parsing options handle GFM table syntax, Php Extras table syntax: multiple header rows, column spans and caption or can be configured to handle anything in between.

The advantage with this approach is that it does not break the blocks first strategy, does not have to double parse and extends on the existing approach for handling reference definitions at the beginning of paragraphs to make it a generic extension strategy.