Specify when (if ever) parsers should give up

Hoylen · January 3, 2015, 4:28am

All inputs should be parsed

It would be really good if this stance was explicitly stated in the beginning of the specification. That helps sets one’s frame of mind when reading it.

Currently, the only mention of errors is in Section 1.2 where it says, “to make matters worse, because nothing in Markdown counts as a ‘syntax error’, the divergence often isn’t discovered right away.” That could mislead the reader in thinking that it would be useful to have syntax errors and then incorrectly assuming that CommonMark introduces syntax errors.

So essentially there is no such thing as invalid CommonMark. Any sequence of characters is valid CommonMark; and any conforming parser must successfully consume it. The results might not be what the document author had intended, but it won’t be “wrong” as far as CommonMark syntax is concerned. Same for output: any program that generates any sequence of characters generates valid CommonMark.

There are inputs that raise errors, but those errors occur outside of CommonMark parsing. For example, malformed UTF-8 encodings – which is outside the scope of CommonMark, since CommonMark operates on Unicode characters and not bytes. Perhaps the presence of U+0000 can be pushed outside the scope of CommonMark? Anko could define the behaviour of his “bytes to characters” converter to flag U+0000 as invalid input, before it is passed to the CommonMark parser.

P.S. What does “replace the Unicode character” mean: replace it with what? To avoid confusion, how are making “must strip” the only way to handle U+0000?