Specify when (if ever) parsers should give up

anko · December 18, 2014, 5:17pm

If CommonMark intends to parse all inputs even if they look really bad, we should clearly document that. If not, that too. I can’t find any part of the current 0.13 spec take a definite stance on this.

Actually, I’m not even sure what the definite stance is.

The current spec requires

For security reasons, a conforming parser must strip or replace the Unicode character U+0000.

To me (as someone who writes parsers for programming languages), presence of a null character in input indicates disaster, but I do understand that a parser accepting input generated by humans needs to be more lenient.

Does CommonMark aim to parse any input?

jgm · December 18, 2014, 7:32pm

All inputs should be parsed. And libcmark/cmark is designed to parse even really pathological input in linear time.

+++ anko [Dec 18 14 17:29 ]:

Hoylen · January 3, 2015, 4:28am

All inputs should be parsed

It would be really good if this stance was explicitly stated in the beginning of the specification. That helps sets one’s frame of mind when reading it.

Currently, the only mention of errors is in Section 1.2 where it says, “to make matters worse, because nothing in Markdown counts as a ‘syntax error’, the divergence often isn’t discovered right away.” That could mislead the reader in thinking that it would be useful to have syntax errors and then incorrectly assuming that CommonMark introduces syntax errors.

So essentially there is no such thing as invalid CommonMark. Any sequence of characters is valid CommonMark; and any conforming parser must successfully consume it. The results might not be what the document author had intended, but it won’t be “wrong” as far as CommonMark syntax is concerned. Same for output: any program that generates any sequence of characters generates valid CommonMark.

There are inputs that raise errors, but those errors occur outside of CommonMark parsing. For example, malformed UTF-8 encodings – which is outside the scope of CommonMark, since CommonMark operates on Unicode characters and not bytes. Perhaps the presence of U+0000 can be pushed outside the scope of CommonMark? Anko could define the behaviour of his “bytes to characters” converter to flag U+0000 as invalid input, before it is passed to the CommonMark parser.

P.S. What does “replace the Unicode character” mean: replace it with what? To avoid confusion, how are making “must strip” the only way to handle U+0000?

anko · January 3, 2015, 9:40pm

@jgm @Hoylen Thanks for your inputs! (No pun intended.)

It seems we have an obvious conclusion to “should parsers give up”: they shouldn’t—all inputs are valid. So I’ve created a pull request to clarify the spec.

As for null (U+0000) characters:

That seems to be the current situation. I’m OK with that. “The parser always succeeds” is a really strong guarantee. I wouldn’t want to dull it by making it conditional on whether the input includes null characters.

I wondered that too. But since it’s a separate issue, let’s take that to a new topic.

anko · January 3, 2015, 11:41pm

For closure: This was clarified in commit 7817cd7.

rlidwka · January 4, 2015, 1:24am

Is it acceptable for a parser to enforce limits on the input (max input length, max tag nesting, etc.) if some tags are impossible to parse in linear time? And if that’s so, what should be the behavior if such inputs are encountered?

vitaly · January 4, 2015, 3:20am

I do not agree with “All”. There are special case, when input is intentionally malformed with very deep nestings at block and inline level.

Currently we have limited nesting to 20 nested blocks + 20 nested inlines (tuneable via options), and cut output tail when this happens. This overflow can never happen in normal case. I see no reasons to overcomplicate code and increase CPU load for people, who try to fuckup parser intentionally.

I think, for markdown-it goals, output & cpu safety is more significant, than having unlimited nesting support. So, i’d vote against official statement in spec, that parser “sould process everything”. That’s not needed for all and can restrict developpers.

jgm · January 4, 2015, 5:15pm

We’ve been very careful in designing libcmark to handle deep nestings.
This required eliminating recursion in the code so as not to blow the
stack. You can have a 50,000-deep emphasis, link brackets, or block
quote with no problem, and these don’t tie up the CPU. (See
test/pathological_tests.py.) So, it IS possible.

commonmark.js is more delicate and will blow the stack on some deep
nestings. It might be possible to rewrite it to avoid this, using some
of the techniques we used in libcmark.

I think you could just say your parser is CommonMark compatible but limits
nesting to 20 (tuneable).

anko · January 4, 2015, 9:26pm

Pathological inputs will always exist because computers are linear bounded automaton complete—they have finite resources. That’s the computer’s limitation, not CommonMark’s.

jgm · January 9, 2015, 6:25pm

Update: I have revised the data structures and algorithms in commonmark.js to avoid recursion completely. So, now it will happily parse nested structures like block quotes or emphasis 50,000 deep, without getting bogged down or imposing any artificial limits on nesting. The change also dramatically simplifies the renderer code.