Consider rewriting the spec in a non-declarative style


#1

The CommonMark spec uses a declarative style to describe what the syntax for each functionality is, with additional details and corner cases explained by examples. This does not appear to be a good design for a Markdown spec.

Ambiguity

A lot of the ambiguities in parsing Markdown are about how to handle interleaving of syntax constructs, and the spec is silent on many of these aspects, so the spec is ambiguous.

For example:

  1. If a line like ~~~ (or [ref]: /url) is followed by a setext underline, is that a header, or is that the start of a fenced code block (or ref definition)? Looking at the setext header section, it appears that it should be interpreted as a setext header. Looking at the fenced code block section (or the link ref defn. section), it appears that it should be interpreted as a fenced code block start (or a link ref defn.).

  2. Similarly, if a list is immediately followed by a setext underline, the spec is ambiguous about whether the last line of the list should be a setext underline or a list. If we look at the list section, it appears that it should be interpreted as a list, and if we look at the setext header section, it appears that it should be interpreted as a setext header.

     * One
     * Two
     ---
    
  3. Extending (2) above, it the setext underline is indented to match the list indentation, and if the indentation is less than 3 spaces, it still matches the description of a top-level setext header as given in the setext header section.

     * One
     * Two
       ---
    
  4. Going by what is said in the raw HTML tag and the code span sections, it appears that a HTML tag enclosed in backticks shouldn’t be interpreted as a code-span, which is obviously wrong.

  5. As a pedantic example, it is unclear how _foo *bar_ baz* should be parsed. Based on the descriptions in the spec, it could be parsed either emphasizing foo *bar or emphasizing bar_ baz.

The spec does resolve some of these interleaved constructs in examples, and the above examples can of course be similarly mentioned as examples in the spec, but there will always be potential corner cases like this, so we can never be sure that the spec is totally unambiguous.

So, just like John Gruber’s original Markdown spec, the CommonMark spec has ambiguities, and it looks unlikely that it becomes totally unambiguous as long as it sticks to the declarative style.

Not the best fit for parser developers

A fallout of this example-based resolving of corner cases is that information on parsing a particular construct is not restricted to one part of the spec and is spread out. To understand how to handle a particular construct in a parser, a parser-developer can not restrict herself just reading that section in the spec, and might have to look out for examples involving that construct that might be placed anywhere in the spec. Even if she spots all those examples, it is not always apparent what strategy should be followed so that the resulting parser’s behaviour is consistent with all the provided examples.

At things stand now, it appears that a parser developer who wants to write a compliant Markdown parser is better off using one of the CommonMark implementations as a reference point rather than the specification document. Note that this was precisely the situation earlier as well (many Markdown parsers are adaptations of an implementation to another programming language).

Not the best fit for document writers

It appears that one of the reasons for the declarative style of the spec (as opposed to a algorithm-based/state-machine style), was because the declarative style was “closer to the way a human reader or writer would think, as opposed to a computer”.

While the declarative style itself is a good fit for document writers, the multitude of examples used to resolve the corner cases can make it unnecessarily complicated for that audience. Making a readable specification for document-writers and making an unambiguous specification for parser-developers are opposing objectives. The document writer asks “What should I do to get a heading?”, while a parser developer asks “How should I interpret a line starting with a hash?”. So it’s better to have a different document explaining the syntax for document writers.

Summary

The declarative style of the spec:

  • makes it ambiguous (even though one of the goals of the project is an unambiguous spec)
  • makes it hard for a parser-developer to use the spec as a reference (even though one of the goals of the project is to make Markdown easier to parse)

I suggest that:

  • there be separate documents targeting (a) document writers, and (b) parser developers
  • the spec meant for parser developers should not be in the declarative style (algorithm-based/state-machine style is good)

The above concerns have been previously discussed at https://news.ycombinator.com/item?id=8267039 and http://roopc.net/posts/2014/eval-stmd/.


Encoding ambiguities in CommonMark
#2

Why couldn’t you simply collapse the sections of the spec that contain many detailed examples of corner cases? Hide them behind a “click to show 40 more examples” or similar?

I understand what you’re saying, but am not convinced they need to be two totally different documents. Think of it as footnotes that lead to lots of additional detail perhaps?


#3

I have to disagree with the topic title. The declarative style of the test cases in the spec don’t make its goals unachievable or make the specification ambiguous. But it (obviously) doesn’t guarantee success either. But IMHO the spec is a great start.

But I agree its’ not the best format for end-users who want to write in CommonMark. End-users aren’t a specifications’s target audience, its primarily for implementer and testers. Its testing that the declarations serve, and they do so quite well, which is part of what make this a good spec.

As to the ambiguities you list, have you tried testing the cases you cite in the reference implementation at http://jgm.github.io/stmd/js/? You may have identified some test-case gaps that could be valuable to include in future versions of the specification.

P.S. I tried several of your examples in the reference implementation, the output seems fine to me. But I don’t understand your point #4.


#4

Just to put things in perspective, I’d like to quote jgm from our HN conversation:

We considered writing the spec in the state machine vein, but I advocated for the
declarative style. It may be worth rethinking that and rewriting it, essentially spelling
out the parsing algorithm. As you suggest, a parallel document could be created for writers.

(Source: https://news.ycombinator.com/item?id=8274869)

@Burt_Harris: I’m talking about ambiguities in the spec, and you suggest that I see what the js implementation does, and state that the output of the js implementation is fine.

Are you saying that the best way to figure out what a parser should do on an input would be to see what the js implementation does, as opposed to referring the spec? If yes, you’re just reinforcing my point.


#6

Thanks for updating the title of the topic.

I’m not sure everyone understands the reason the way the spec written this way. I tried to explain in reply to a different posting normative-informative. Please have a look at that and see if it helps. Also please look at improving the introduction to see if that might address your concerns.


#7

One major frustration I have with the spec is that many of the links that you’d think would point to unambiguous descriptions of the linked-to element simply point back to the first time the name of the element is used. For example, the Laziness rule in section 5.2 refers to “paragraph continuation text” without any explanation of what “paragraph continuation text” even is, and all links to “paragraph continuation text” simply point back to the Laziness rule. Does a line in a blockquote count as paragraph continuation text if it has no ‘>’ and starts with a space, for instance? There’s no way to know because it’s never defined, and having to trial-and-error it in the dingus is simply unacceptable for a specification.

I’m at the Software Language Engineering conference this week in Vasteras, Sweden, and brought up CommonMark as a spec in need of attention from the field. Several luminaries in language construction (e.g., Sebastian Erdweg) are now very focused on trying to formalise this spec so that correct implementations can be written, but the spec’s vagueness makes it extremely resistant to formalisation.

At the same time we’ve also identified some shortcomings in the language design community’s tools, e.g. SDF3, that will make it easier to create an executable specification once we’ve remedied them. But let’s work both ends toward the middle here, please.


#8

Thanks @maradydd. I agree there are still some ambiguities in the spec, and paragraph continuation text one I noticed as well.

Markup languages have a long history of defying classification within language engineering formalism, SGML for example, was very feature rich, but very hard to categorize in formal language notation, let alone write a parser for.

Those who started work on CommonMark have chosen a different appoach, its not really declarative or based ABNF, but one that’s instead more test driven. The reason for this is a desire to conform to a de facto standard that’s not well formalized, and I for one think this is a very sensible decision.

So it seems to me that if it becomes possible to write an executable SDF3 that conforms to the test case in a final version of the CommonMark, then that would be an implementation, not a specification. What we need to do in discussing the CommonMark spec discussion is incrementally improve it, rather than wish it to be something it isn’t.