How to move ahead with extending CommonMark

chrisalley · November 26, 2020, 4:57am

Since there was some interest in making Tables the first CommonMark extension, there’s the question of how to proceed with extending CommonMark. @jgm writes:

It might be worth starting with a feature that is less complex than tables as a proof of concept for how an extension spec could work.

There’s the question of authorship and ownership for these extensions, e.g. would extensions be an official part of the project, hosted on the CommonMark website and GitHub repository, or created as a seperate project on someone else’s repository? There’s middle option as well, which is what the ReactJS community did:

The reactjs GitHub organization (henceforth “reactjs” or “the org”) was created to ensure critical open source projects in the React community receive long-term support and maintenance.

So in React’s case they have important libraries in the ecosystem created by seperate authors but hosted under the same organisation so that maintenance is looked after by the group.

There has also been some discussion about adding features to the spec versus creating a new extension spec. For comparison, you have the HTML spec which is a single specification, and CSS snapshots which are comprised of multiple CSS module specs at a certain point in time. Given how large the existing CommonMark spec is, a modular approach to extensions that can be maintained independently might be the way to go.

Crissov · November 26, 2020, 8:53pm

There have been similar threads before:

jgm · November 26, 2020, 9:18pm

Let me just elaborate a bit on why this can be tricky. (My own practical experience is with my Haskell commonmark libary, which supports a number of extensions: https://github.com/jgm/commonmark-hs/).

GitHub style table cells are separated by |. If we go for the policy that block structure can be discerned independently of inline structure – this is embodied in the current spec – then we have to worry about how to deal with | characters that are not supposed to be cell separators. In regular text they can just be escaped as \|. But what about | characters that appear in inline code, e.g. `a|b`? These shouldn’t be cell separators. But we can’t tell they’re in inline code without parsing inlines, and we’re trying to discern block structure before we do that.

The solution adopted by Mathieu Duponchelle, who devised the table extension that was taken over by GitHub, was to require | characters inside inline code to be backslash escaped, when they occur inside table cells. This is an exception to the normal behavior of inline code backticks, which normally interpret backslash literally.

For better or worse, this is embodied now in GitHub’s table format (https://github.github.com/gfm/#tables-extension-). But that means that this extension implies an exception to the core spec (and gfm’s own spec does not update the section on backslash escapes or inline code to reflect that). It still says: “Backslash escapes do not work in code blocks, code spans, autolinks, or raw HTML.”

OK, say we decide to live with this exception (which we could codify properly if we added tables to the spec). But then someone comes along and adds an extension for LaTeX math (say, between $ characters). We then get the same problem with pipe characters inside LaTeX math, e.g. \{ x | x < 3 \}. Someone who is using this in a table may find that it unexpected creates a new table cell. So we need to say that | characters in LaTeX math need to be backslash escaped, and that the backslash will be removed in parsing. To specify this properly, we need the independent extensions to talk about each other.

Bottom line is that extensions can have a lot of unexpected ramifications both for the core spec and for other extensions. Developing a composable set of extensions is really difficult. And tables (at least the common pipe table format) create special difficulties.

codinghorror · November 27, 2020, 5:00am

I think we’re stuck with the | escaping exception on the table one for historical reasons, though it might be worth reaching Mathieu and see what he thinks about this? I guess it depends how willing we are to break backwards compatibility for people that used the | character inside a table, and how bad the breakage would be?

We can consider it another “quirk” of the “original” unspecifed Markdown and indicate that we will strongly avoid this for future extensions, and no one should use it as anything other than a historical artifact – certainly never as an example of things future extensions can do.

ohAitch · December 1, 2020, 12:54am

Given I still regularly encounter github READMEs / docs broken by the “# headers must have a space” change, I would err on the side of pessimism.

zamfofex · December 3, 2020, 11:37pm

Personally, I never really liked how the syntax for tables in GFM only makes sense if you use a monospaced font, especially because GitHub’s input fields (e.g. “new issue”, “comment”, etc.) do not use a monospaced font.

Maybe I’m being too hopeful, but I wish a new syntax could be coined.

chrisalley · December 5, 2020, 11:13pm

This approach seems like a pragmatic way forward. If we accept that the existing implementations aren’t perfectly uniform but are widely used enough for there to be value in strongly specifying them, that goal could be prioritised over uniformity. The lessons learned during specifying Markdown and it’s various commonly used extensions could be used when designing a successor language, should applications wish to migrate to one in the future.

codinghorror · December 5, 2020, 11:31pm

Let’s see… Mathieu Duponchelle is here on Twitter and has a website at https://www.centricular.com/ ?

I went ahead and pinged him via his the contact form on the website and pointed him here, to this topic, to see if he has any feedback.

cben · December 6, 2020, 4:56pm

I think that idea currently works because all block structure is encoded at start of lines: indentation, list bullets, >, blank lines. Will any this be a problem for any attempt to introduce block structure in the middle of a line?

Possibly silly questions: In what way do we want tables to be block structrure? Could we only consider table start/end to be block structure, and cell boundaries to be inline structure?

Well, putting code spans and \| problem aside, it makes typographic sense to think of each cell as having a separate inline structure. For example:

| table | head |
|-------|------|
| A*B   | C*D  |

GFM and almost all table implementations treat these as unmatched asterisks, not markup.
a couple (maruku, s9e/TextFormatter) make B and/or D italic , but still treat it C as a “fresh start of independent cell”.
nobody makes B C italic across cells. Good! That would make little sense as AST and would not fit HTML at all…
But mutlimarkdown and cebe/gfm have an interesting alternative: a single cell “AB | CD” where the inner | is NOT a cell separator, just regular text.

=> I guess this is what it means to treat cell boundaries as inline structrure.
I suspect it’s a bit more error-prone than parsing each cell separately, and more previews will flicker more during editing… But at least it’s a consistent position!

Also, what about escaped \| outside backticks resulting in a single cell with textual “|”?
Well, backslashes can inhibit block AND inline constructs in markdown, so it’s consistent with both positions. And it’s important to have a way to spell “|” inside a cell (other than ugly | or |) .

Not let’s talk code spans. I’d think that if we want:

| I`J   | K`L  |

to mean a single cell with a “IJ | KL” content, we better treat “AB | CD” similarly.

Unfortunately, the reality is more fragmented: https://babelmark.github.io/?text=|+table+|+head+| |-------|------| |+A*B+++|+C*D++| |+E*F++\|+G*H++| |+I`J+++|+K`L++| |+M`N++\|+O`P++|

github/cmark and a few others are consistent in first parsing cell boundaries, then treating A*B and I`J as unterminated asterisk an backtick.
markdown-it and a few others do A*B but a single cell with J | K code span.
maruku does the opposite! But its table support is weird in other ways, and apparently it doesn’t allow escaping | by any way — neither \ nor code span nor even \ inside code span
multimarkdown consistently treats all 4 combinations as a single cell. But nobody else does.
There is more variation about \| inside code span becoming | vs \| in the output
I’ll post more thoughts about this soon.

mity · December 6, 2020, 6:52pm

My counter-argument is as follows:

The CommonMark specification is a specification of CommonMark, not specification of HTML. Unlike in HTML, in GFM-like tables, the cell separators actually IMHO behave more like inline marks, not block marks. They don’t allow to specify any multi-line contents, do they?

Also, I believe we don’t really need to do any inline analysis for the pipes in most lines to determine whether to start a table or not: The 2nd table line (i.e. underline of the table header) is imho specific enough to be used as an indicator of a table. And for that line you don’t really need full inline analysis either, you can specify it similarly as a Setext underline is specified.

I.e., when the parser encounters it (and it follows a normal paragraph; either of a single line or any number of lines; depending on whether we want to allow tables to interrupt paragraphs), the parser would just change the preceding line interpretation to a table header, again similarly as we already do for Setext headers. The number of --- sections delimited by pipes in the header underline then specifies count of columns in the table.

And the table then continues until a blank line (or an enclosing container block ends).

It’s imho no problem if some lines (including the header) provide a different count of columns (there may even be no cell separator at all): We simply ignore the extra cells, and implicitly add virtual empty cells to those lines with too few cells.

And last but not least: Feel free to experiment with such approach. MD4C works exactly this way. AFAIK, there are two differences from the cmark-gfm’s behavior:

GFM requires the 1st line and the 2nd line to provide the same number of cells. (It does not require the same for non-header lines). MD4C on the hand does not, exactly to avoid the inline analysis while still determinig whether it is a table block.
GFM requires the pipes inside the codes pans to be escaped. MD4C does not, MD4C still sees a code span contents purely verbatim. I consider this a good thing, given my arguments above, and because I consider it a very important feature for code spans in general. So, in MD4C, the | is simply an inline mark with lower precedence than the code spans.

jgm · December 6, 2020, 7:07pm

Beni Cherniavsky-Paskin via CommonMark Discussion
noreply@talk.commonmark.org writes:

jgm:

If we go for the policy that block structure can be discerned independently of inline structure – this is embodied in the current spec – then we have to worry about how to deal with | characters that are not supposed to be cell separators.

I think that idea currently works because all block structure is encoded at start of lines: indentation, list bullets, >, blank lines. Will any this be a problem for any attempt to introduce block structure in the middle of a line?

Yes, I believe so.

Possibly silly questions: In what way do we want tables to be block structrure? Could we only consider table start/end to be block structure, and cell boundaries to be inline structure?

Normally tables can’t occur inside paragraphs, so yes, they are conceptually block structure. (In addition, one would hope eventually for a table syntax that allows block-level elements inside cells: this is not uncommon in real tables.)

Well, putting code spans and \| problem aside, it makes typographic sense to think of each cell as having a separate inline structure. For example:
> table | head |
>-------|------|
> A*B   | C*D  |
GFM and almost all table implementations treat these as unmatched asterisks, not markup.

Yes, this follows from our general policy of discerning block structure first and only then inline structure. (And this is the policy that makes pipes in code spans problematic here.)

MathieuDuponchelle · December 7, 2020, 1:23pm

Hey @codinghorror, re. table syntax specifically here’s the relevant discussion: https://talk.commonmark.org/t/parsing-strategy-for-tables/2027 , I came up with an interpretation of the spec that at least put my own mind at ease re backslash-escaping in https://talk.commonmark.org/t/parsing-strategy-for-tables/2027/7 , this thread is pretty old and I haven’t actively thought about the issue since then so if you have a more specific question please ask away

Re (composable) extensions, it’s a thorny subject One interesting question is whether one wants to solve the problem of composing extensions from multiple sources, as the parsing process requires knowledge of whether a given node type in the AST can be contained by / contain another node type.

In any case, guaranteeing that any extension is compatible with any other one is not feasible, as two extensions may want to interpret the same character for two different purposes. All the extension system could strive for is detecting such conflicts at runtime, my extension system doesn’t implement that, but as syntax extensions are expected to register special characters it might be implementable.

The more pragmatic approach, which is in effect what github uses my extension system for, is to consider that extensions can only be used as part of a “distribution”, ie you have the core implementation and a finite set of syntax extensions, which know about each other at compile-time.

jgm · December 7, 2020, 6:14pm

My counter-argument is as follows:

The CommonMark specification is a specification of CommonMark, not specification of HTML. Unlike in HTML, in GFM-like tables, the cell separators actually IMHO behave more like inline marks, not block marks. They don’t allow to specify any multi-line contents, do they?

Currently not. But it would be desirable to support multiline and block-level content in table cells in the future, and there have been some proposals about how this might be donee

Also, I believe we don’t really need to do any inline analysis for the pipes in most lines to determine whether to start a table or not: The 2nd table line (i.e. underline of the table header) is imho specific enough to be used as an indicator of a table. And for that line you don’t really need full inline analysis either, you can specify it similarly as a Setext underline is specified.

It’s not just a question of recognizing when you have a table. It’s also a question of splitting the cells in each row. For that you need to be able to distinguish pipes that serve as separators from pipes in other inline contexts (code spans, math, etc.).

But I do agree that it would be possible to treat | an inline element called a “cell separator” that would have a context-sensitive meaning: in a paragraph it would simply render as a literal |, and in a table row it would split cells. This approach would require making tables a part of the core spec (which is not a problem really). This is certainly an approach worth considering.