CommonMark Formal Grammar


PEG grammar or re2c are not suitable for inclusion into the standard specification. They are both too specific and tied to their respective parsing toolchains. Specifying STDM in such way would make it look more like a parsing implementation than a format specification. Let alone difficulties to elegantly express complete markdown using either parsing language.

I think that a human readable format specification, plus examples to cover corner cases, is the most future-proof way to make the standard effective and extensible.


Not to discurage you, but have you taken a look at jgm’s (yes, the one on this thread) Pandoc?


PEG isn’t a toolchain, it’s a class of grammars, like context-free or mildly context-sensitive. There are many toolchains that can handle PEGs, including Haskell’s parsec (which has been cloned into quite a few languages), Scala’s scala.util.parsing.combinator, Python’s pyparsing, and my own Hammer, which is written in C but provides bindings to many scripting languages.

RFCs that define formats – such as IP datagrams or TCP packets – include a formal grammar defined in ABNF (which is itself currently defined in RFC 5234) so that implementors have an unambiguous reference for the layout of the format. Such grammars are included in an appendix, after the natural-language description of the protocol. People who don’t know how to read BNF aren’t expected to understand them, but they are an essential component of the specs they belong to, and without them, “implementation-dependent” behaviour proliferates and we’re back to why we wanted a Standard Markdown spec in the first place.

The STDM spec is incomplete without a formal grammar. If the formal grammar does not cover the corner cases you describe, then the formal grammar is incorrect and incomplete. But this standard must not ship without one.


Maradydd speaks sense, she’s an expert in her field and is one of the creators of the language security or langsec fields. Hammer was born out of successful attacks against slight incongruities between parser implementations of the same grammar. I’d go as far as to say one should certainly use hammer. As langsec is so new, many people just are not aware of the implications.

(I’ve never met maradydd, nor accepted any money from her, I’m keen on seeing a finalized, correctly designed and parsed grammar.)


(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)


People will probably think I’m crazy for suggesting this, but what about generative grammars?

Basically, instead of trying to parse a given string, you do a randomized walk over the space of possible ASTs, rendering each one until you happen to generate the string you’re looking for. In order for the grammar to be unambiguous, the generator function must never find two ASTs that produce the same output string, and in order for the grammar to be complete, the generator must be able to produce all possible strings from at least one AST.

I know it sounds insane, but we do this already with assembly in STOKE, and are able to generate codes up to hundreds of instructions (small, I know, but much better than prior work).

Now, this may make the two proofs (unambiguity and completeness) intractable (I’m not an expert in parsers), but the advantage of the approach would be that it would at least reflect the structure of the English language spec as it is currently written.


My point was, that if you specify the Markdown grammar using PEG, you will mandate all implementers to use PEG toolchain (whichever they choose out of many) to build a conforming implementation.

Moreover, PEG-based parser generators produce extremely complex parsers which are very difficult to reason about for humans. Even parsers created using different PEG toolchains might have a different corner-case behavior because PEG grammar is not universally suitable to express Markdown syntax. See jgm’s comments above.

There was peg-markdown and lunamark output on Babelmark2 in the past. They are both based on PEG and even written by the same author (jgm) and yet I saw differences in their corner-case behavior. I cannot prove it now, because seemingly peg-markdown is not there anymore.

You cannot compare PEG with ABNF. ABNF is much easier to understand or implement a parser based on it.


I’m not interested in ratholing on the relative merits of PEG versus other formalisms until we’ve figured out which class of language STMD actually falls into. jgm’s complaint about “isn’t the same as having a single PEG grammar” is vacuous, because PEGs are closed under composition. I’d have to compare the peg-markdown grammar and the lunamark grammar to determine why they have different corner-case behaviour, but the most likely reason is that they’re not actually implementing the same grammar (the equivalence of two arbitrary PEGs is undecidable, so to be sure that they’re implementing the same grammar, they have to express identical rules) and the next most likely reason is that the parser generators themselves differ in corner-case behaviour (typically having to do with differing opinions on how to handle left-recursion). Both of these are implementation flaws, not flaws in PEG itself. (If you want to argue that the fact that PEG generator implementors can’t agree on how to handle left-recursion is a show-stopper, I’ll accept that.)

In any case, I’ve finished a preliminary pass on a mostly ABNF grammar; I say “mostly” because it’s more like an L-attributed boolean grammar, and RFC5234 doesn’t have notation for attributes or for disjunction. It’s still multi-stage, following the lines -> blocks -> inlines sequence described in the spec. I’m going to do a second pass to make sure I got the precedences correct and sanity-check the rules for lists and blockquotes, and probably do a prototype in hammer, and then I’ll push to my fork.


Proving a grammar unambiguous is in the general case undecidable; Hopcroft, Motwani and Ullman show this by equivalence to the Post correspondence problem in Introduction to Automata Theory, Languages and Computation. But every language that can be parsed with a deterministic pushdown automaton is unambiguous, and IIRC there are some other techniques as well.

I read through your STOKE paper, and I’m a little confused as to how transformation correctness would translate to this use case; is the generator function essentially the grammar itself? (I have a bad habit of thinking of grammars as discrete Markov chains, so I may be confusing myself.) Either way, though, it’s a really clever way of looking at the problem and I wouldn’t mind exploring it further.

I think I’ve gotten this one down to an L-attributed boolean grammar (CFG plus conjunction and negation), and if I understand Okhotin’s paper correctly, it sounds like it’s actually fairly easy to determine whether a boolean grammar is ambiguous or not. (I might well be wrong here, though.)



Single word exclamation, accompanied by a gesture where the hand is swept palm down over the head from front to back with about three inches clearance.

Indicates that the post just read was too sophisticated for the reader and has gone “way over their head”.


Would be really interesting to see the CFG for Markdown.


So admittedly, STOKE was never intended for parsing, but for compiler optimization (e.g. as an alternative to running GCC -O3). But the MCMC algorithm used in STOKE can be driven by basically any cost function, so it seems to me that if you made the cost function something like edit distance, then you could potentially find all ASTs which render to the same string (of which there should hopefully be exactly one).

You’re right though that walking the space of ASTs is not so very different from doing a random walk over a CFG (or other grammar), so perhaps this doesn’t provide any net benefit. Most likely, I have muddled my thinking. I guess my impression was that:

  1. People here seem to be pessimistic about the feasibility of writing a formal grammar
  2. On the other hand, walking the space of ASTs seems not so hard by comparison
  3. And at any rate, the spec itself is phrased constructively (in terms of how to build blocks out of inlines and other blocks, etc.), so such an algorithm would have a convenient correspondence to the spec

I suspect the sticking point would probably be making sure that the transformation rules satisfy all the requirements for MCMC (e.g. ergodicity), which is easy enough with straight-line assembly and not so obvious with an arbitrary AST.

I will go read the Okhotin paper to try grok the rest of your posts, but I am hopeful that you will be successful.


I should have noted in my last post that MCMC itself is just an optimization; as long as you can enumerate all possible ASTs and render each one you can still write the algorithm… it might just take the rest of eternity to get around to the string you happen to be interested in parsing. I might give this a go to see exactly how hard it is and whether e.g. inline syntax makes this difficult.


Well, my thinking was, it’s certainly possible for a random walk (esp. w/MCMC, which I have played with a little and like a lot) to outperform a traditional parser for sufficiently complicated grammars. (“How complicated?” is an interesting question and I wonder if anyone’s investigated it.) You’re right about ergodicity, though; I’m going to have to chew on that one. I hope you do give it a try and I’m interested to see your results.


Have any of you considered using Marpa?


Looks very interesting, though a JS implementation would be even more useful.


I don’t see a JS implementation happening in the short term, and we’re not short of practical applications as it is.


What’s the status here? Does anyone currently work on a formal grammar? Has anyone tried and given up? If so, where was the main problem? Is there a more recent thread on this?

@tin-pot mentioned (in Enumerated lists without explicit number, ATX headings with explicit number) some unpublished material related to a grammar. I’d like to hear more about that as well.

To me it looks as if the “match number of backticks resp. tildes, no matter how many there are” specification of a fenced code block would be particularly difficult to express. Looking at the Wikipedia category “Formal Languages” for likely candidates, I found indexed grammars to be the most promising. One could use the index stack to count various kinds of things, most notably the number of characters in a code fence, or the number of spaces required to line up a list item. Counting two things in parallel (i.e. spaces and fence characters) would already be tricky, I think, so the translation would be far from obvious.

So far I know very little about indexed grammars. In particular I know of no tools making use of these, and I know of no methods to decide whether a given index grammar is ambiguous. So this may need some more research. If someone found a way to make standard context free grammars or something close to that work for CommonMark, that would of course be great, but I have my doubts.

In a comment on Vanilla-flavored Markdown as basis for state machine spec, @riking conjectured that CommonMark would have to be Chomsky 0. If that were indeed the case, indexed grammars would be bound to fail as well, since they are only Chomsky 1. However, I don’t yet see the argument why this has to be Chomsky 0.

For the sake of completeness: Indian parallel grammars sounded like a good candidate for CommonMark as well. The parallel rewriting of non-terminals can be used to insert matching fences, or matching indentation at the beginning of lines. I can’t see a way to forbid longer runs of backticks inside a fenced code block using this formalism, though. I know of no approach which would combine such parallelism with conjunctions.


@gagern I’d love to have a formal grammar; I just don’t know how to do it.

A while back I wrote a couple of PEG grammars for Markdown:
peg-markdown (in C) and lunamark (in lua).

I had to work around various rough edges (the backtick code spans were one of them, and I think I just supported up through 5 backticks). I’m not happy with either of these efforts.

I’m not familiar with some of the more esoteric grammars you mention in your post, so if you want to investigate this, it would be welcome. I think it’s possible to use the Pumping Lemma to prove that the grammar for backtick code spans is not context-free.


Only because you specifically asked about me mentioning some formal grammar work: This was (long ago and not related nor updated to CommonMark) based using (ISO 14977 standard) EBNF, augmented with a simple parameterization facility, so that one can specify “rule schemes”, so to say.

If you’re interested, drop me a line, to “mh, commercial at, tin-pot, full stop, net”. [ I can’t comment or post here. ]