Vanilla-flavored Markdown as basis for state machine spec

An existing concern with the current STMD spec is that it defines many plain-english examples of Markdown parsing but does not contain a formal grammar. It’s been suggested that a state machine like that found in the HTML5 spec would be more appropriate.

Vanilla-flavored Markdown, licensed as MIT and whose author has commented on STMD [1], contains a fully-defined specification [2] in the form of a state machine parser that may be exactly what implementors are looking for aside from grammar examples.

This project has not yet been brought up on the forums, so I thought I would suggest the idea of collaboration in a new topic. I’m not affiliated with the project.

[1] https://news.ycombinator.com/item?id=8267039
[2] http://www.vfmd.org/vfmd-spec/specification/

2 Likes

Don’t get me wrong, I love state machines, but holy crap those regexps make me want to put my eyes out.

Has anyone even tried writing a context-free-grammar version of a markdown spec? I’m starting to get the impression that the answer is no.

I’m guessing that the answer is “Yes, they failed.”

Markdown is not context-free, it has several constructs that are clearly context-sensitive, then you have lists which introduce recursion, and hello! You have arrived at level 0 of the Chomsky hierarchy.

2 Likes

Fine, then a context-sensitive formalism it is. But even context-free grammars can recurse on existing elements, including in lists; it is no problem at all, for instance, to have rules like

foo ::= bar* baz qux
bar ::= '[' foo ']'

That form of recursion is what distinguishes the context-free languages from the regular languages.

I’m working on an ABNF spec (per RFC 5234) in my Copious Free Time and am aiming to have that done by the weekend.

Thanks for bringing up vfmd here.

In the HN conversation you mentioned, jgm said he wanted to see what the substantive differences are between CommonMark and vfmd. Following that, I made a comparison of the CommonMark and vfmd syntaxes, and I think this will be a good place to link it from:

@roop, fantastic, for some reason I just noticed this. It will be very helpful.

(Note: the most recent version of the emph/strong spec restores the symmetry between **foo *bar* and *foo *bar**.)

@maradydd, if you can do it that would be lovely! But I despair. Consider code spans, for example (assuming we want them to be compatible with John Gruber’s syntax description). Can you start by giving a BNF for them? I just don’t see how to do it. Note that the number of backticks you can have as delimiters on each side of the code span depends on the length of the longest contiguous string of backticks in the code you’re quoting.

@roop, a few comments on your nice write-up:

  • I didn’t write the statement that said one of our goals was to make something easy to parse, and it wasn’t in fact one of my major goals in writing the spec. (If we ended up with something that just couldn’t be parsed efficiently, I think we’d need to go back to the drawing board; but readability and predictability were the main things on my mind.)

  • I hadn’t realized that vfmd had a similar rule for HTML blocks; that’s interesting. I agree that, if we stay with this rule, making an exception for pre, script, and style blocks is a good idea.

  • I think it makes sense for tight/loose to be per-list, since lists with some tight and some loose items look bad. Also, although a per-item distinction makes sense in HTML, it makes less sense in other output formats we might want to target, e.g. LaTeX.

  • As I mentioned, the new emph/strong rules restore the symmetry between **foo* bar* and *foo *bar**. We still skip over spans of four or more * or _ characters, but I will experiment with dropping this. One reason for including it is that people sometimes write ________ for a blank in a form etc. But I do think that, if we’re allowing emph within emph, it makes sense to allow strong within strong by symmetry, and to allow multiple levels, all of which suggest relaxing the four-character limitation.

  • I’m not so concerned that CommonMark will yield bad HTML when bad HTML is fed into it.

  • I wouldn’t want two blank lines to end a code block; after all, you might have code with two blank lines in it. Indeed, I’m now tempted to remove the two-blank-lines breaks out of a list rule, because it violates the principle that the meaning of a run of text stays the same when it is put into a list item.

  • As you point out, currently CommonMark does not require a blank line between a paragraph and a following list or blockquote. I do see the arguments for requiring a blank line, but the consideration alluded to above, that the meaning of a text should not change when it’s put into a list item, makes me think that a blank line should not be required before a list. Compare:

    Foo
    - bar
    ```
    ```
    1. Foo
       - bar
    ```
    Then I'd argue for a parallel treatment of blockquotes. This discussion occurs somewhere else on this forum.
    
    
  • Alt text should probably be preserved as is, rather than processed as it currently is.

  • Recognizing international email addresses is probably a good idea; I just used the non-normative regex from the HTML 5 spec.

@jgm Glad you consider that my comparison piece could be of help. You might also want to take a look at my other post too in case you hadn’t already.

If “ease-of-parsing” is not a goal of your spec, I urge you to reword that statement from the homepage (the one that says: “one of our major goals is to make Markdown easier to parse”) asap, since it’s misleading. (I’m surprised you haven’t corrected it already.)

Nevertheless, I felt that some parts of the CommonMark spec did seem to prioritize ease-of-parsing over readability / intuitiveness (e.g., the suggestion to use 
 within pre elements).

I would consider it essential to make an exception for those tags. Do note that vfmd goes further and checks whether the pre / script / style tags are actually closed correctly as well.

In my opinion, a Markdown spec needn’t restrict itself to the common-denominator-only feature set. If it makes sense in HTML but doesn’t in LaTeX, then that particular feature can have an effect in the output HTML but not in the output LaTeX.

Even if “lists with some tight and some loose items look bad”, if that’s what the input suggests, that’s what we should interpret it as.

Great. I haven’t studied the updated spec, though.

I don’t understand what’s the problem with _____ for use in forms - wouldn’t the ___s be surrounded by whitespace on either side and therefore not interpreted as emphasis anyway?

Are you talking about the stuff under the “Span-level HTML” heading in my comparison piece? If so, I don’t think the inputs *foo <u>bar* baz</u>* and *<p>foo</p>* can be called bad input HTML.

I think it’s quite obvious that the alt text shouldn’t be subject to any processing. I thought that was a bug in your code that seeped into the spec.


I don't have a strong opinion on the rest of the bullet points in your post. They can all have valid arguments either way, and I think it's best to just pick one and run with it.
1 Like

+++ Roopesh Chander [Nov 03 14 10:13 ]:

Nevertheless, I felt that some parts of the CommonMark spec did seem to prioritize ease-of-parsing over readability / intuitiveness (e.g., the suggestion to use &#10; within pre elements).

The general approach to HTML blocks certainly does give ease of parsing, but it has an independent motivation – making it easy to have Markdown content inside block-level HTML tags, if you want to. I just didn’t think of the idea of treating pre, script, and style specially.

I think it’s quite obvious that the alt text shouldn’t be subject to any processing. I thought that was a bug in your code that seeped into the spec.

I think it’s very natural to parse the alt text as Markdown. After all, ![my *text*](/url) is very like [my *text*](/url, and in the latter case the contents of the square brackets are parsed as Markdown. So my inclination is to handle this in the HTML renderer, rendering the inlines in the alt attribute without formatting. Note that in pandoc and some other implementations, a paragraph containing just an image gets rendered as a figure with a caption (derived from the alt text). Obviously one might want formatting in the caption. Having the formatting stored in the AST makes this kind of thing possible. But I agree that we shouldn’t be rendering HTML tags inside the alt attribute.