Pulldown-cmark (CommonMark in Rust)

raph · June 7, 2015, 9:52pm

It is indeed a nice feature of cmark. However, the need to handle deeply nested constructs is arguably a spec issue as well. These deeply nested constructs are hard to parse, especially for a simple pull parser like pulldown-cmark. The problem that I have is that really large values of n only benefit potential attackers; they’re exceedingly unlikely to be valuable to real users.

When you have the luxury of being able to parse from the inside out (ie, the order in which you generate elements doesn’t have to match either the input or output sequence), then the cost (in code complexity, etc) doesn’t matter much. In a pull parser, you have to scan all the way to the closing construct to decide what to do with the opening one - that’s where the O(n²) or worse behavior is coming from.

I haven’t yet decided what I’m going to do about this. It’s very possible I’ll just come up with O(n) algorithms that work well in the pull parser architecture, because it’s a chance to show off my prowess in solving programming puzzles. But right now it’s clearly not the best use of my time, because I need to sort out tables and so on before it becomes usable for rustdoc. Another possibility would be to add language to the spec introducing maximum nesting limits, as is done in several other places. These could even be enforced, to prevent two implementations from having two different outputs (so that '[' * n + 'a' + ']' * n + '(/url)' would not be parsed as a link above a defined value of n).

It would be extremely helpful if the efficient parsing algorithm were documented somewhere (and this is even more true for the state machine @vitaly describes in #7, if he hopes other implementors to adopt it). I wasn’t able to find any, other than looking at the code, and I was trying not to rely too much on that. I found the description in the spec to be extremely declarative and thus not all that helpful - I mostly worked through examples to understand what was going on. For me personally, an imperative description in terms of a stack would actually be clearer than the declarative language in the spec.

I also struggled quite a bit with the fact that whether link references resolve profoundly affects the parse. I started down the route of partially parsing them, then fixing up the output later when all references were known, but had to abandon it. I don’t think it would be considered a good design in a from-scratch light markup language, but for this issue I probably wouldn’t change the spec to simplify parsing - it would probably break compatibility with much existing input. When parsing CommonMark, you just have to accept that you need to take a full pass through link references before you can properly begin parsing inline.

Regarding extensions, my ideas are still bubbling, but I do not think that a plugin architecture, by itself, solves the problem of divergence of extensions.