Parsing strategy for tables?

Why not work on this in the reference implementation instead of reinventing the wheel though? C should be portable enough not to go for vendor locked-in languages instead?

Good question! :slight_smile: First, my main language of development for the past 8 years has been C#. I worked already in C/C++ for years before that (and java, and several others), and I dislike a lot too many things in these languages to enjoy using them anymore, moreover when I can develop something much more efficiently in a different language, with an acceptable 20-40% performance degradation (at the same feature level). Also, C# has been evolving to some great directions in the past 2 years, as It is not really no longer vendor locked-in, all the runtime and compilers are MIT on github, and it is now running on Linux and MacOSX.

I had also a look at CommonMark.NET but as the implementation was a port of the C implem, it was not suitable to the kind of extensibility I was looking for. Also, I was not really happy about some design decisions in their implem and the work done in their new implem branch was not going into the kind of direction/simplicity I’m seeking. And well, I have been through many cases in practice in the past that it is often not possible to make “radical” changes to an existing project you don’t “own”. Sometimes, the wheels are so different that, e.g, square vs circle, and things will not move the same way! :smiley:

I also think that it has many benefits for the CommonMark initiative that implems are not coming from the same implem seed

I also think that it has many benefits for the CommonMark initiative that implems are not coming from the same implem seed

I actually think that is detrimental, the only reimplementation which I perceive as having any value is the javascript one, as browsers will not let you execute native code (at least yet), and I don’t know whether transpiling the C implementation to javascript would be a viable solution.

As for the other implementations, I simply see them as wasted energy and useless fragmentation when all they do is reimplement the same parsing strategy in a different language. libcmark’s upstream is willing to make these “radical” changes you’re interested in, it’s of course difficult to get things upstream because there’s a requirement of quality, but getting through that admittedly frustrating process will mean better code for everyone in the end.

Having numerous implementations may be less efficient, but having a CommonMark ecosystem (rather than a single implementation) will make the spec more robust in the long run. For example, as each implementer has to understand the spec before implementing it, any issues with the spec will be seen by more eyes.

But conversely, any issues with the implementation(s) will be seen by less eyes.

This is true. In cases of inconsistencies in other implementations, I think that is a good reason to improve the spec tests (which all CommonMark implementations should pass).

To give you a glimpse, If I wanted to implement this in C, It would be a major rewrite of the implem (and would be pretty similar to my C# implem), not an incremental evolution and I would do it very differently from how it is done today. Even in your branch inlines.c, the switch case is hardcoded in the old-school way in this method, even if the try_extensions is added as an escape, which is bttw far from being performance friendly, as you need to go through all extensions everytime a character is not handled by static cases to find a good candidate while all of this should be cached in advanced and be performed by a simple table lookup.

In my code, everything, including standard markdown parsing is done through plugins, I can disable code backsticks if I want, an extension could decide to change the opening character of for parsing heading text and replace # by @ for example…etc. The main parsing method for inlines is just a loop of 100 lines, and everything else is self contained in one feature per file (instead of having a big blob file like inlines.c). So yes, that justifies what I talked about “radical” changes.

Also, as @chrisalley noted about implems strengthening the specs, it is typically what happened while I implemented CommonMark from scratch: See for example issue #395

Even in your branch inlines.c, the switch case is hardcoded in the old-school way in this method, even if the try_extensions is added as an escape, which is bttw far from being performance friendly, as you need to go through all extensions everytime a character is not handled by static cases to find a good candidate while all of this should be cached in advanced and be performed by a simple table lookup.

That’s a very trivial remark. The check is only done for special characters, which makes it mostly harmless, and making this use a table would take around 5 minutes if profiling showed it was indeed draining performance.

So yes, that justifies what I talked about “radical” changes.

Not really, that simply shows you did this in your parser, which you develop in a closed-source manner, making it difficult to compare it factually and benchmark against the current cmark implementation.

I will release it as soon as the API is stable, naming of the project is “secured”, another related project is finished, and the proper website and documentation are done… It will be published on github and announced on this forum. It will come also along accompanied benchmark compare to other .NET solutions and to cmark for raw performance (for now, at least 20-40% slower, but not feature wise equivalent). So, stay tuned…

Cool :slight_smile: Care to share the list of extensions you currently have implemented, and detail the issues you might have had if any?

So far, I have implemented the following extensions:

  • abbreviations
  • auto identifiers and auto link (similar to pandoc)
  • css bootstrap (to output some classes specific to bootstrap for some elements)
  • custom containers (mostly block ::: and inline ::)
  • definition lists
  • Emojy/Emoticons
  • Emphasis extra (strikethrough, subscript, superscript, inserted, marked)
  • Figures (as described here)
  • Footnotes
  • Special/Attached attributes (allows to attach css attributes to the current block or the previous inlines with the syntax {...}), works with all blocks and all inlines, including headings, fencedblocks
  • Softline as Hardline
  • Lettered list (a. b., A. B., i, ii. iv, …etc.)
  • Mathematics (block/inline escape $$...$$, and $...$)
  • Medias (renders differently image with url that have a mime/type that is a video or music, or links to youtube/vimeo)
  • Smartypants
  • Pipe tables
  • Grid tables (slightly extended version of pandoc)

No real issues. Maybe attached attributes are a bit specials, as they need to have hooks for some other block parsers (like fenced code blocks, to parse the attributes before parsing the info string, or for headings before removing the optional trailings ###).

In terms of performance, the side effect of adding proper extensibility for all these extensions (including the ability to remove/modify standard CommonMark parsing) has slow down the whole by around 20% (in C#), which is quite acceptable.

2 Likes

I don’t see how tables can be properly implemented if they cannot do pipe parsing with inlines taken into account.

I implemented a table extension in my parser, flexmark-java (https://github.com/vsch/flexmark-java), using the paragraph pre-processing strategy like reference definitions. So tables have to be the first element of a paragraph. Which is the case for all processors that have this extension, that I know of. I split the line on pipe boundaries along with inline processing for the table and save the parsed inlines for the table cells so that they don’t need to be re-parsed.

Without doing this you get all kinds of issues with pipes embedded in inline code, typographical quotes, etc and makes table usage difficult and non-standard compared to already existing markdown table extensions. My implementation is intuitive, works great with all inline markup and does not interfere with block parsing strategy, which I love for its speed, ease of maintenance and ease of extending.

The extension implements column spans using consecutive | at the end of a cell. One extra pipe for every extra column span. I think this is Php Markdown Extra or MultiMarkdown syntax. I got the syntax from pegdown, which I used to use for my markdown plugin for JetBrains IDEs before writing my own parser.

I found in my implementation of extensions that there is no proper way of handling some of them without extending the core parser and/or core inline parser, which is exactly what I did with the original commonmark-java. I removed all the special cases and exposed them as extension API usable by extensions.

Special case paragraph reference preprocessing became a ParagraphPreProcessor extension interface that worked well for implementing the table extension.

* flexmark-java is a rewrite of commonmark-java for a lossless source based AST with source tracking for all elements and parts of elements.

2 Likes

I have recently implemented GitHub-style tables for MD4C (as an extension which has to be turned explicitly on), and it is implemented at a block level, not inline level.

Details what is or is not parsed as a table by MD4C is more or less described in tables.txt which serves a little bit as a preliminary documentation as well as a test suite. (It’s the same format as CommonMark’s spec.txt.)

As of MD4C, the basic approach for parsing is based on this:

  1. Table cannot interrupt a paragraph.
  2. Table must have table header underline (specific mixture of -, : | characters and some optional whitespace, no other char is allowed; see the file tables.txt).
  3. If such underline line is encountered and we have only one preceding line in the block currently being built, we check whether the preceding line contains (unescaped) |.
  4. If yes we call it a table. (The preceding line that contains the head part of the table/)
  5. Then we continue adding into the table any subsequent lines as long as they contain at least one (unescaped) |. First line without the | ends the table block.

EDIT: I believe the presence of the table head underline line (which cannot contain any backtick) is strong enough indicator we can interpret it all as a table. And given the table cell is actually block too (leaf one), then the pipe on each line should have higher priority then code span, to keep the uniformity with the rule “blocks first, inlines then”.

Please no. Consider that if you allow to recurse from block analysis via inline analysis back to block analysis, you may easily open a can of worms with new class of pathologic input with horrible computational complexity.

There is no need to break the block/inline model which in my opinion is gold for Markdown parsing, if you have the option of extracting tables or other elements from the start of paragraphs.

At that point all the blocks have been parsed and what remains is paragraph pre-processing before processing it for inlines.

This has the best of both worlds. The block/inline parsing separation and access to inlines. I tried other approaches to parsing tables and nothing is as bullet because without having inlines parsed there are always possibilities of parsing a table block only to realize it does not meet constraints once you parse the inlines.

In my approach, I parse the table and if it does not meet constraints, I simply abort and leave the paragraph the way it was. If constraints are met, the table prefix part of the paragraph is chopped off into its own block node and what remains of the paragraph, if anything, is processed for inlines.

I do not want to break block over inline priority.

However, how do you all propose to handle cases where you have a code span containing a pipe character inside a table cell?

The way we have row parsing implemented currently at GitHub is as follows:

  • Parse inlines in the input line into a transient paragraph container.
  • Split the paragraph into table cells whenever we encounter a pipe (|) in a top-level text node.

I like to think that this doesn’t actually break the block/inline priority order — because we post-process inlines only, the table cells (blocks) are essentially one level up from inlines, and there is no block parsed within an inline. It’s a bit of a semantic argument, admittedly. Alternatively, view table cells as inlines within a table row block, and then it’s fairly okay.

I’d love to get comments from others on @kivikakk’s strategy. This is perhaps the most fundamental issue in figuring out how to spec out pipe tables.

If I understand this right, it means that | cell breaks get lower priority than any other inline grouping constructs. So, for example, in the following we’ll just have one table cell rather than two:

| *a | b* |

EDIT: I should also note that @kivikakk’s strategy departs from the original strategy from Matthieu Duponchelle’s patch, which split on | characters before doing any inline parsing, excepting only backslash-escaped pipes. (I believe backslash-escaped pipes were also transformed into regular pipes, even in code spans, when they occurred in table cells.) @kivikakk reports that the original approach broke too many existing tables in people’s repositories.

That’s correct; the pipe isn’t in a text node that’s a direct child of the table cell [1], so we don’t split there. Mathieu’s strategy first splits, and so you get two cells with text *a and b*. Which one is “better” feels more like a matter of preference than correctness; I think you could argue for either [2], and we’ve gone with the other way to minimize breakage as we transitioned to CommonMark.

[1] actually, it’s the transient paragraph node at this point.
[2] my argument for splitting at top level pipes only is that it means you can transport an inline code span from outside a table into one without needing to additionally escape anything; likewise a [link](…) with a stray pipe in the URL, link text or title — not making users manually escape pipes in these instances feels like the best user experience.

For markdig, the strategy followed is that | has a higher priority over emphasis inline elements but lower priority to code backsticks or links. The parsing of the table is done when parsing the inlines of a paragraph block (so that when match occurs, it will transform the ParagraphBlock into a Table)

If you look at this test on babelmark3:

| c | d |
| - | - |
| *a | b* |
| `e | f` |
| [g | h](http://a.com) |

the combination of resuts are still quite different across the different markdown processors.

But with a closer look:

  1. For | *a | b* | most of the markdown parsers treat | as higher precedence than emphasis
  2. For backsticks most parsers treat them with higher precedence over |
  3. For | [g | h](http://a.com) | most parsers treat | with higher precedence than links

So in markdig, I followed 1) and 2), but for 3) I followed to give | lower precedence to links

1 Like