CommonMark Formal Grammar

imsky · September 3, 2014, 7:45pm

Is a language-independent formal grammar for CMD available? The re2c files in the stmd repo are a good reference point, but a standalone PEG would be really useful.

peterkelly · September 3, 2014, 8:01pm

As far as I’m concerned this is an absolute requirement for any standardisation. I had a brief look at the spec, and it seems to be comprised entirely of examples and informal prose, which is the same problem with Gruber’s original spec.

Multimarkdown already has a very comprehensive grammar defined (see master/parser.leg in the MultiMarkdown repository on github).

This grammar file has C code mixed in with the actual grammar definitions, but that could be removed easily enough so we just have the pure grammar. This can then be used as input to parser generators, to deal with how they like.

The way I view markdown is that although it presents itself as “just plain text with some special notation”, there’s really an underlying schema/object model underlying it, of the same basic nature as HTML, WordProcessingML (part of OOXML), and ODF text documents. For any program that’s converting from Markdown to another format, it’s easiest to first get an in-memory representation of this object model, and then have a serialisation process that writes out the document in the target format.

I’m currently in the process of launching a project to convert between all the major word processing/writing formats, for use in both editors and conversion utilities (see https://github.com/uxproductivity/DocFormats), and my plans for Markdown to date have been based on a parser built from MultiMarkdown’s grammar. To be honest, I’d find it difficult to justify going with an informally-specified variant like this as opposed to MultiMarkdown due the lack of an available formal grammar.

I’m willing to work on producing such a grammar and tools to process it if there’s interest. I already have much of the code to do this for PEG-based grammars (separate part of the DocFormats project above) which I can make available, and perhaps this could form the basis of an alternate implementation (I always think it’s good to have at least two implementations of a given specification before it comes an official standard).

Forgive me if I’ve missed anything in the approach - I’ve only have a very brief look at the spec, and it’s very late here But I think the notion of formal specification is something really important we should be aiming for.

imsky · September 3, 2014, 8:15pm

I’m very interested in researching the uses for the Markdown “schema/object model.” Could you provide a link to your code if possible?

peterkelly · September 3, 2014, 8:39pm

It’s currently a bit of a mess, and I haven’t done anything on it over 9 months. But given the interest in standard markdown, I’ll try to get it into a decent form and made available in the next few days.

Basically the idea is that you provide a PEG-based grammar and an input file, and from that it produces an XML document conforming to a schema which is derived from the PEG grammar. Then you can use traditional techniques for transforming XML to produce another XML-based format, or alternatively a tree-traversal algorithm which outputs text.

I’ll post here again once I have something in a usable form.

imsky · September 3, 2014, 8:46pm

Sounds interesting, looking forward to the release.

jgm · September 3, 2014, 8:50pm

+++ peterkelly [Sep 03 14 20:12 ]:

peterkelly [1]peterkelly
September 3

As far as I’m concerned this is an absolute requirement for any
standardisation. I had a brief look at the spec, and it seems to be
comprised entirely of examples and informal prose, which is the same
problem with Gruber’s original spec.

Multimarkdown already has a very comprehensive grammar defined (see
master/parser.leg in the MultiMarkdown repository on github).

I am the author of both the standard markdown spec and the peg
grammar that was the basis for the one in the MultiMarkdown repository.
(MultiMarkdown began as a modification of my peg-markdown, which was
the first PEG-based markdown implementation.) I also wrote a PEG-based
markdown in lua, lunamark.

You’re welcome to try writing a PEG that fits the standard markdown
spec, but in my experience PEG isn’t a great fit with markdown.
peg-markdown handles constructs like lists and blockquotes by
recursively parsing the result of another parser, which isn’t the
same as having a single PEG grammar. And notice that there’s no
way to write a PEG for inline code. peg-markdown fakes it by
only supporting up to 5 backticks for delimiters, and lunamark
uses nonstandard PEG extensions in the lpeg library.

The parsing algorithm used in the stmd implementation is also far more
efficient than peg-markdown, which in addition to being a bit slow
will go exponential on certain inputs. stmd has been extensively
fuzz tested; you can throw it even garbage and parse time should vary
roughly linearly with the length of input.

The way I view markdown is that although it presents itself as “just
plain text with some special notation”, there’s really an underlying
schema/object model underlying it, of the same basic nature as HTML,
WordProcessingML (part of OOXML), and ODF text documents. For any
program that’s converting from Markdown to another format, it’s easiest
to first get an in-memory representation of this object model, and then
have a serialisation process that writes out the document in the target
format.

Yes, and that is exactly what both the C and javascript implementations
of stmd do.

peterkelly · September 3, 2014, 8:54pm

What have your experiences been like with memoization? I implemented this in a rough form but found that in some cases it was spending more time doing the memoization than the actual parsing - which ended up slowing things down. Do you think memoization is viable in this context?

imsky · September 3, 2014, 9:02pm

Can the re2c specifications be made portable so that the two implementations (and, in the future, additional implementations) could rely on one grammar?

Can the C and JS implementations of stmd output the serialized in-memory representation somehow? In the JS case, this could be a JSON object with tokens and line numbers, perhaps even with nesting (i.e. H2s are nested under H1s).

maradydd · September 3, 2014, 9:11pm

FWIW, if PEG turns out to be a non-starter, it’s worth taking a stab at one of the context-sensitive formalisms. I maintain hammer, a multi-backend parser generator with a parser combinator API, and our packrat backend has a few extras that allow it to express PEGs that are stronger than mildly context-sensitive (e.g., ones that have non-adjacent attribute dependencies, such as a length field that doesn’t immediately precede the value it bounds).

Is ordered choice an issue with markdown? If so, ALL* might be a better algorithm than PEG or packrat. Hammer doesn’t implement ALL* yet, but it’s on my to-do list, and of course ANTLR already does.

maradydd · September 3, 2014, 9:19pm

A funny thing that Joe Rozner, one of the contributors to hammer, noticed when working with our packrat backend is that not memoizing short partial parses – 8 bits or fewer, IIRC – speeds up parsing considerably. He has a Go implementation he’s been experimenting with, although I don’t think he’s released it yet.

imsky · September 3, 2014, 9:21pm

I think the only reason PEG is a non-starter (given @jgm’s experience writing PEG-based Markdown parsers) is performance. ANTLR seems like a good way to try the BNF side of Markdown parsing.

jgm · September 3, 2014, 9:31pm

+++ imsky [Sep 03 14 21:12 ]:

Can the re2c specifications be made portable so that the two
implementations (and, in the future, additional implementations) could
rely on one grammar?

What do you mean, “portable”?

Yes, and that is exactly what both the C and javascript
implementations
of stmd do.
Can the C and JS implementations of stmd output the serialized
in-memory representation somehow? In the JS case, this could be a JSON
object with tokens and line numbers, perhaps even with nesting (i.e.
H2s are nested under H1s).

You can use the --ast flag with the C implementation, and it will
give you a representation of the syntax tree (not in any
machine-readable format, but that could be added).

With the js implementation, you can easily get it to output the parse
tree. Apply the following patch to js/markdown:

--- js/markdown 2014-08-14 11:00:36.000000000 -0700
+++ js/markdownast      2014-09-03 14:25:30.000000000 -0700
@@ -10,6 +10,5 @@
     return console.log(err);
   }
   var parser   = new stmd.DocParser();
-  var renderer = new stmd.HtmlRenderer();
-  console.log(renderer.render(parser.parse(data)));
+  console.log(util.inspect(parser.parse(data), { depth: null }));
 });

imsky · September 3, 2014, 9:40pm

The re2c files, as I understand, are used to create a lexer, but (AFAIK) there’s no way to use them for other backends like Python or Ruby. “Portable” in this sense just means “something that can be used with a multi-backend lexer generator.”

That’s excellent, thanks.

mmun · September 4, 2014, 1:54am

I advocate a state machine style specification like the tokenization and tree construction algorithms described in the HTML5 spec.

jks · September 4, 2014, 3:03pm

PEG grammar or re2c are not suitable for inclusion into the standard specification. They are both too specific and tied to their respective parsing toolchains. Specifying STDM in such way would make it look more like a parsing implementation than a format specification. Let alone difficulties to elegantly express complete markdown using either parsing language.

I think that a human readable format specification, plus examples to cover corner cases, is the most future-proof way to make the standard effective and extensible.

mb21 · September 4, 2014, 3:51pm

Not to discurage you, but have you taken a look at jgm’s (yes, the one on this thread) Pandoc?

maradydd · September 4, 2014, 7:30pm

PEG isn’t a toolchain, it’s a class of grammars, like context-free or mildly context-sensitive. There are many toolchains that can handle PEGs, including Haskell’s parsec (which has been cloned into quite a few languages), Scala’s scala.util.parsing.combinator, Python’s pyparsing, and my own Hammer, which is written in C but provides bindings to many scripting languages.

RFCs that define formats – such as IP datagrams or TCP packets – include a formal grammar defined in ABNF (which is itself currently defined in RFC 5234) so that implementors have an unambiguous reference for the layout of the format. Such grammars are included in an appendix, after the natural-language description of the protocol. People who don’t know how to read BNF aren’t expected to understand them, but they are an essential component of the specs they belong to, and without them, “implementation-dependent” behaviour proliferates and we’re back to why we wanted a Standard Markdown spec in the first place.

The STDM spec is incomplete without a formal grammar. If the formal grammar does not cover the corner cases you describe, then the formal grammar is incorrect and incomplete. But this standard must not ship without one.

sjm · September 4, 2014, 11:55pm

Maradydd speaks sense, she’s an expert in her field and is one of the creators of the language security or langsec fields. Hammer was born out of successful attacks against slight incongruities between parser implementations of the same grammar. I’d go as far as to say one should certainly use hammer. As langsec is so new, many people just are not aware of the implications.

(I’ve never met maradydd, nor accepted any money from her, I’m keen on seeing a finalized, correctly designed and parsed grammar.)

jc404 · September 5, 2014, 2:44am

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

elliottslaughter · September 5, 2014, 2:53am

People will probably think I’m crazy for suggesting this, but what about generative grammars?

Basically, instead of trying to parse a given string, you do a randomized walk over the space of possible ASTs, rendering each one until you happen to generate the string you’re looking for. In order for the grammar to be unambiguous, the generator function must never find two ASTs that produce the same output string, and in order for the grammar to be complete, the generator must be able to produce all possible strings from at least one AST.

I know it sounds insane, but we do this already with assembly in STOKE, and are able to generate codes up to hundreds of instructions (small, I know, but much better than prior work).

Now, this may make the two proofs (unambiguity and completeness) intractable (I’m not an expert in parsers), but the advantage of the approach would be that it would at least reflect the structure of the English language spec as it is currently written.