CommonMark Formal Grammar

My point was, that if you specify the Markdown grammar using PEG, you will mandate all implementers to use PEG toolchain (whichever they choose out of many) to build a conforming implementation.

Moreover, PEG-based parser generators produce extremely complex parsers which are very difficult to reason about for humans. Even parsers created using different PEG toolchains might have a different corner-case behavior because PEG grammar is not universally suitable to express Markdown syntax. See jgm’s comments above.

There was peg-markdown and lunamark output on Babelmark2 in the past. They are both based on PEG and even written by the same author (jgm) and yet I saw differences in their corner-case behavior. I cannot prove it now, because seemingly peg-markdown is not there anymore.

You cannot compare PEG with ABNF. ABNF is much easier to understand or implement a parser based on it.

I’m not interested in ratholing on the relative merits of PEG versus other formalisms until we’ve figured out which class of language STMD actually falls into. jgm’s complaint about “isn’t the same as having a single PEG grammar” is vacuous, because PEGs are closed under composition. I’d have to compare the peg-markdown grammar and the lunamark grammar to determine why they have different corner-case behaviour, but the most likely reason is that they’re not actually implementing the same grammar (the equivalence of two arbitrary PEGs is undecidable, so to be sure that they’re implementing the same grammar, they have to express identical rules) and the next most likely reason is that the parser generators themselves differ in corner-case behaviour (typically having to do with differing opinions on how to handle left-recursion). Both of these are implementation flaws, not flaws in PEG itself. (If you want to argue that the fact that PEG generator implementors can’t agree on how to handle left-recursion is a show-stopper, I’ll accept that.)

In any case, I’ve finished a preliminary pass on a mostly ABNF grammar; I say “mostly” because it’s more like an L-attributed boolean grammar, and RFC5234 doesn’t have notation for attributes or for disjunction. It’s still multi-stage, following the lines → blocks → inlines sequence described in the spec. I’m going to do a second pass to make sure I got the precedences correct and sanity-check the rules for lists and blockquotes, and probably do a prototype in hammer, and then I’ll push to my fork.

1 Like

Proving a grammar unambiguous is in the general case undecidable; Hopcroft, Motwani and Ullman show this by equivalence to the Post correspondence problem in Introduction to Automata Theory, Languages and Computation. But every language that can be parsed with a deterministic pushdown automaton is unambiguous, and IIRC there are some other techniques as well.

I read through your STOKE paper, and I’m a little confused as to how transformation correctness would translate to this use case; is the generator function essentially the grammar itself? (I have a bad habit of thinking of grammars as discrete Markov chains, so I may be confusing myself.) Either way, though, it’s a really clever way of looking at the problem and I wouldn’t mind exploring it further.

I think I’ve gotten this one down to an L-attributed boolean grammar (CFG plus conjunction and negation), and if I understand Okhotin’s paper correctly, it sounds like it’s actually fairly easy to determine whether a boolean grammar is ambiguous or not. (I might well be wrong here, though.)

2 Likes

whoosh

Single word exclamation, accompanied by a gesture where the hand is swept palm down over the head from front to back with about three inches clearance.

Indicates that the post just read was too sophisticated for the reader and has gone “way over their head”.

Would be really interesting to see the CFG for Markdown.

So admittedly, STOKE was never intended for parsing, but for compiler optimization (e.g. as an alternative to running GCC -O3). But the MCMC algorithm used in STOKE can be driven by basically any cost function, so it seems to me that if you made the cost function something like edit distance, then you could potentially find all ASTs which render to the same string (of which there should hopefully be exactly one).

You’re right though that walking the space of ASTs is not so very different from doing a random walk over a CFG (or other grammar), so perhaps this doesn’t provide any net benefit. Most likely, I have muddled my thinking. I guess my impression was that:

  1. People here seem to be pessimistic about the feasibility of writing a formal grammar
  2. On the other hand, walking the space of ASTs seems not so hard by comparison
  3. And at any rate, the spec itself is phrased constructively (in terms of how to build blocks out of inlines and other blocks, etc.), so such an algorithm would have a convenient correspondence to the spec

I suspect the sticking point would probably be making sure that the transformation rules satisfy all the requirements for MCMC (e.g. ergodicity), which is easy enough with straight-line assembly and not so obvious with an arbitrary AST.

I will go read the Okhotin paper to try grok the rest of your posts, but I am hopeful that you will be successful.

I should have noted in my last post that MCMC itself is just an optimization; as long as you can enumerate all possible ASTs and render each one you can still write the algorithm… it might just take the rest of eternity to get around to the string you happen to be interested in parsing. I might give this a go to see exactly how hard it is and whether e.g. inline syntax makes this difficult.

Well, my thinking was, it’s certainly possible for a random walk (esp. w/MCMC, which I have played with a little and like a lot) to outperform a traditional parser for sufficiently complicated grammars. (“How complicated?” is an interesting question and I wonder if anyone’s investigated it.) You’re right about ergodicity, though; I’m going to have to chew on that one. I hope you do give it a try and I’m interested to see your results.

Have any of you considered using Marpa?

2 Likes

Looks very interesting, though a JS implementation would be even more useful.

I don’t see a JS implementation happening in the short term, and we’re not short of practical applications as it is.

What’s the status here? Does anyone currently work on a formal grammar? Has anyone tried and given up? If so, where was the main problem? Is there a more recent thread on this?

@tin-pot mentioned (in Enumerated lists without explicit number, ATX headings with explicit number) some unpublished material related to a grammar. I’d like to hear more about that as well.

To me it looks as if the “match number of backticks resp. tildes, no matter how many there are” specification of a fenced code block would be particularly difficult to express. Looking at the Wikipedia category “Formal Languages” for likely candidates, I found indexed grammars to be the most promising. One could use the index stack to count various kinds of things, most notably the number of characters in a code fence, or the number of spaces required to line up a list item. Counting two things in parallel (i.e. spaces and fence characters) would already be tricky, I think, so the translation would be far from obvious.

So far I know very little about indexed grammars. In particular I know of no tools making use of these, and I know of no methods to decide whether a given index grammar is ambiguous. So this may need some more research. If someone found a way to make standard context free grammars or something close to that work for CommonMark, that would of course be great, but I have my doubts.

In a comment on Vanilla-flavored Markdown as basis for state machine spec, @riking conjectured that CommonMark would have to be Chomsky 0. If that were indeed the case, indexed grammars would be bound to fail as well, since they are only Chomsky 1. However, I don’t yet see the argument why this has to be Chomsky 0.

For the sake of completeness: Indian parallel grammars sounded like a good candidate for CommonMark as well. The parallel rewriting of non-terminals can be used to insert matching fences, or matching indentation at the beginning of lines. I can’t see a way to forbid longer runs of backticks inside a fenced code block using this formalism, though. I know of no approach which would combine such parallelism with conjunctions.

@gagern I’d love to have a formal grammar; I just don’t know how to do it.

A while back I wrote a couple of PEG grammars for Markdown:
peg-markdown (in C) and lunamark (in lua).

I had to work around various rough edges (the backtick code spans were one of them, and I think I just supported up through 5 backticks). I’m not happy with either of these efforts.

I’m not familiar with some of the more esoteric grammars you mention in your post, so if you want to investigate this, it would be welcome. I think it’s possible to use the Pumping Lemma to prove that the grammar for backtick code spans is not context-free.

1 Like

Only because you specifically asked about me mentioning some formal grammar work: This was (long ago and not related nor updated to CommonMark) based using (ISO 14977 standard) EBNF, augmented with a simple parameterization facility, so that one can specify “rule schemes”, so to say.

If you’re interested, drop me a line, to “mh, commercial at, tin-pot, full stop, net”. [ I can’t comment or post here. ]

I just started doing research in conjunction with my own attempt to write a CommonMark PEG grammar Lua, and ran across this attempt at an EBNF for CommonMark. At this point I have no idea about the completeness or correctness, but it’s clearly an interesting foray. It seems to have cropped up a month or two after the last activity in this thread.

I just started doing research in conjunction with my own
attempt to write a CommonMark PEG grammar
Lua
, and ran across
this attempt at an EBNF for
CommonMark
.
At this point I have no idea about the completeness or
correctness, but it’s clearly an interesting foray. It seems to
have cropped up a month or two after the last activity in this
thread.

I can see at a glance that this is not even close to being right.
Just look at the rule for emphasis and compare to the commonmark
spec.

As someone who has written two PEG grammars for markdown
implementations (jgm/peg-markdown and jgm/lunamark on GitHub),
I’m skeptical that this can be done for commonmark.

1 Like

I’m resurrecting a zombie thread for what might be a silly idea: For the backtick problem, how about we consider any consecutive backticks as a single token? For example, ````` (five backticks) would be considered as a single terminal symbol, say, `5. And we will have rule schemes like anything trapped inside two `ns will be considered as inline code blocks for any integer n >= 1. I haven’t checked but this might be the same idea from what @tin-pod mentions as EBNF with parametrization. I wonder what could be other rough corners for formalizing CommonMark.

2 Likes