CommonMark Formal Grammar

What have your experiences been like with memoization? I implemented this in a rough form but found that in some cases it was spending more time doing the memoization than the actual parsing - which ended up slowing things down. Do you think memoization is viable in this context?

Can the re2c specifications be made portable so that the two implementations (and, in the future, additional implementations) could rely on one grammar?

Can the C and JS implementations of stmd output the serialized in-memory representation somehow? In the JS case, this could be a JSON object with tokens and line numbers, perhaps even with nesting (i.e. H2s are nested under H1s).

FWIW, if PEG turns out to be a non-starter, it’s worth taking a stab at one of the context-sensitive formalisms. I maintain hammer, a multi-backend parser generator with a parser combinator API, and our packrat backend has a few extras that allow it to express PEGs that are stronger than mildly context-sensitive (e.g., ones that have non-adjacent attribute dependencies, such as a length field that doesn’t immediately precede the value it bounds).

Is ordered choice an issue with markdown? If so, ALL* might be a better algorithm than PEG or packrat. Hammer doesn’t implement ALL* yet, but it’s on my to-do list, and of course ANTLR already does.

A funny thing that Joe Rozner, one of the contributors to hammer, noticed when working with our packrat backend is that not memoizing short partial parses – 8 bits or fewer, IIRC – speeds up parsing considerably. He has a Go implementation he’s been experimenting with, although I don’t think he’s released it yet.

1 Like

I think the only reason PEG is a non-starter (given @jgm’s experience writing PEG-based Markdown parsers) is performance. ANTLR seems like a good way to try the BNF side of Markdown parsing.

+++ imsky [Sep 03 14 21:12 ]:

Can the re2c specifications be made portable so that the two
implementations (and, in the future, additional implementations) could
rely on one grammar?

What do you mean, “portable”?

Yes, and that is exactly what both the C and javascript
of stmd do.

Can the C and JS implementations of stmd output the serialized
in-memory representation somehow? In the JS case, this could be a JSON
object with tokens and line numbers, perhaps even with nesting (i.e.
H2s are nested under H1s).

You can use the --ast flag with the C implementation, and it will
give you a representation of the syntax tree (not in any
machine-readable format, but that could be added).

With the js implementation, you can easily get it to output the parse
tree. Apply the following patch to js/markdown:

--- js/markdown 2014-08-14 11:00:36.000000000 -0700
+++ js/markdownast      2014-09-03 14:25:30.000000000 -0700
@@ -10,6 +10,5 @@
     return console.log(err);
   var parser   = new stmd.DocParser();
-  var renderer = new stmd.HtmlRenderer();
-  console.log(renderer.render(parser.parse(data)));
+  console.log(util.inspect(parser.parse(data), { depth: null }));

The re2c files, as I understand, are used to create a lexer, but (AFAIK) there’s no way to use them for other backends like Python or Ruby. “Portable” in this sense just means “something that can be used with a multi-backend lexer generator.”

That’s excellent, thanks.

I advocate a state machine style specification like the tokenization and tree construction algorithms described in the HTML5 spec.


PEG grammar or re2c are not suitable for inclusion into the standard specification. They are both too specific and tied to their respective parsing toolchains. Specifying STDM in such way would make it look more like a parsing implementation than a format specification. Let alone difficulties to elegantly express complete markdown using either parsing language.

I think that a human readable format specification, plus examples to cover corner cases, is the most future-proof way to make the standard effective and extensible.

Not to discurage you, but have you taken a look at jgm’s (yes, the one on this thread) Pandoc?

PEG isn’t a toolchain, it’s a class of grammars, like context-free or mildly context-sensitive. There are many toolchains that can handle PEGs, including Haskell’s parsec (which has been cloned into quite a few languages), Scala’s scala.util.parsing.combinator, Python’s pyparsing, and my own Hammer, which is written in C but provides bindings to many scripting languages.

RFCs that define formats – such as IP datagrams or TCP packets – include a formal grammar defined in ABNF (which is itself currently defined in RFC 5234) so that implementors have an unambiguous reference for the layout of the format. Such grammars are included in an appendix, after the natural-language description of the protocol. People who don’t know how to read BNF aren’t expected to understand them, but they are an essential component of the specs they belong to, and without them, “implementation-dependent” behaviour proliferates and we’re back to why we wanted a Standard Markdown spec in the first place.

The STDM spec is incomplete without a formal grammar. If the formal grammar does not cover the corner cases you describe, then the formal grammar is incorrect and incomplete. But this standard must not ship without one.


Maradydd speaks sense, she’s an expert in her field and is one of the creators of the language security or langsec fields. Hammer was born out of successful attacks against slight incongruities between parser implementations of the same grammar. I’d go as far as to say one should certainly use hammer. As langsec is so new, many people just are not aware of the implications.

(I’ve never met maradydd, nor accepted any money from her, I’m keen on seeing a finalized, correctly designed and parsed grammar.)

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

People will probably think I’m crazy for suggesting this, but what about generative grammars?

Basically, instead of trying to parse a given string, you do a randomized walk over the space of possible ASTs, rendering each one until you happen to generate the string you’re looking for. In order for the grammar to be unambiguous, the generator function must never find two ASTs that produce the same output string, and in order for the grammar to be complete, the generator must be able to produce all possible strings from at least one AST.

I know it sounds insane, but we do this already with assembly in STOKE, and are able to generate codes up to hundreds of instructions (small, I know, but much better than prior work).

Now, this may make the two proofs (unambiguity and completeness) intractable (I’m not an expert in parsers), but the advantage of the approach would be that it would at least reflect the structure of the English language spec as it is currently written.

1 Like

My point was, that if you specify the Markdown grammar using PEG, you will mandate all implementers to use PEG toolchain (whichever they choose out of many) to build a conforming implementation.

Moreover, PEG-based parser generators produce extremely complex parsers which are very difficult to reason about for humans. Even parsers created using different PEG toolchains might have a different corner-case behavior because PEG grammar is not universally suitable to express Markdown syntax. See jgm’s comments above.

There was peg-markdown and lunamark output on Babelmark2 in the past. They are both based on PEG and even written by the same author (jgm) and yet I saw differences in their corner-case behavior. I cannot prove it now, because seemingly peg-markdown is not there anymore.

You cannot compare PEG with ABNF. ABNF is much easier to understand or implement a parser based on it.

I’m not interested in ratholing on the relative merits of PEG versus other formalisms until we’ve figured out which class of language STMD actually falls into. jgm’s complaint about “isn’t the same as having a single PEG grammar” is vacuous, because PEGs are closed under composition. I’d have to compare the peg-markdown grammar and the lunamark grammar to determine why they have different corner-case behaviour, but the most likely reason is that they’re not actually implementing the same grammar (the equivalence of two arbitrary PEGs is undecidable, so to be sure that they’re implementing the same grammar, they have to express identical rules) and the next most likely reason is that the parser generators themselves differ in corner-case behaviour (typically having to do with differing opinions on how to handle left-recursion). Both of these are implementation flaws, not flaws in PEG itself. (If you want to argue that the fact that PEG generator implementors can’t agree on how to handle left-recursion is a show-stopper, I’ll accept that.)

In any case, I’ve finished a preliminary pass on a mostly ABNF grammar; I say “mostly” because it’s more like an L-attributed boolean grammar, and RFC5234 doesn’t have notation for attributes or for disjunction. It’s still multi-stage, following the lines -> blocks -> inlines sequence described in the spec. I’m going to do a second pass to make sure I got the precedences correct and sanity-check the rules for lists and blockquotes, and probably do a prototype in hammer, and then I’ll push to my fork.

1 Like

Proving a grammar unambiguous is in the general case undecidable; Hopcroft, Motwani and Ullman show this by equivalence to the Post correspondence problem in Introduction to Automata Theory, Languages and Computation. But every language that can be parsed with a deterministic pushdown automaton is unambiguous, and IIRC there are some other techniques as well.

I read through your STOKE paper, and I’m a little confused as to how transformation correctness would translate to this use case; is the generator function essentially the grammar itself? (I have a bad habit of thinking of grammars as discrete Markov chains, so I may be confusing myself.) Either way, though, it’s a really clever way of looking at the problem and I wouldn’t mind exploring it further.

I think I’ve gotten this one down to an L-attributed boolean grammar (CFG plus conjunction and negation), and if I understand Okhotin’s paper correctly, it sounds like it’s actually fairly easy to determine whether a boolean grammar is ambiguous or not. (I might well be wrong here, though.)



Single word exclamation, accompanied by a gesture where the hand is swept palm down over the head from front to back with about three inches clearance.

Indicates that the post just read was too sophisticated for the reader and has gone “way over their head”.

Would be really interesting to see the CFG for Markdown.

So admittedly, STOKE was never intended for parsing, but for compiler optimization (e.g. as an alternative to running GCC -O3). But the MCMC algorithm used in STOKE can be driven by basically any cost function, so it seems to me that if you made the cost function something like edit distance, then you could potentially find all ASTs which render to the same string (of which there should hopefully be exactly one).

You’re right though that walking the space of ASTs is not so very different from doing a random walk over a CFG (or other grammar), so perhaps this doesn’t provide any net benefit. Most likely, I have muddled my thinking. I guess my impression was that:

  1. People here seem to be pessimistic about the feasibility of writing a formal grammar
  2. On the other hand, walking the space of ASTs seems not so hard by comparison
  3. And at any rate, the spec itself is phrased constructively (in terms of how to build blocks out of inlines and other blocks, etc.), so such an algorithm would have a convenient correspondence to the spec

I suspect the sticking point would probably be making sure that the transformation rules satisfy all the requirements for MCMC (e.g. ergodicity), which is easy enough with straight-line assembly and not so obvious with an arbitrary AST.

I will go read the Okhotin paper to try grok the rest of your posts, but I am hopeful that you will be successful.