Pulldown-cmark (CommonMark in Rust)

lu_zero · June 5, 2015, 9:58am

This managed to beat the nom author in implementing CommonMark in Rust.

jgm · June 5, 2015, 10:36am

This looks very good. I tried it out, and an optimized build was actually a little faster than cmark in my benchmark. It also passed the 0.19 tests. I’ve added it to the list of implementations, which is starting to look pretty full.

raph · June 5, 2015, 4:32pm

I’m glad you like it! I was going to write a post here, but looks like @lu_zero beat me to it.

My benchmarks also show it being a little faster than cmark, but slower than hoedown. I think the reason it’s faster is that it’s very careful not to allocate or copy - it’s a pull parser architecture, so is based on returning events from an iterator method.

It’s also a completely independent codebase, not based on any existing implementation. I tried to use the spec as my primary source, mostly playing with cmark as a black box when I was confused about edge cases. For the most part, I felt this worked well, though I think the spec is still a long way from being a definition of a total function from input to output. Now that it’s released, I’m happy to expand on issues I found or answer any questions.

vitaly · June 5, 2015, 6:27pm

@raph, if we speak about global problems of markdown parsers design, there are 2 big pending questions:

syntax plugins support (available in markdown-it, but not ideal)
sourcemap support (partially available in reference parser, but for blocks only, and without tabs)

I think, it’s a goal of next wave parsers. All other issues are minor IMHO.

jgm · June 6, 2015, 9:10am

@raph Try (from cmark directory):

% python3 test/pathological_tests.py --prog ../pulldown-cmark/target/release/pulldown-cmark

raph · June 7, 2015, 6:03am

Thanks. It’s sorta fixed now. Nested links are still O(n^3), which means your test would take about 10 days to run, but this doesn’t worry me greatly, because it’s just not something that’s going to happen with non-malicious input. If I was concerned about denial of service, I’d probably put in an error on exceeding a nesting limit.

Plugins are hard. Adding arbitrary inline markup should be relatively easy through a general mechanism (math, strikethrough, etc.). But consider one of the most important extensions, tables. In Kramdown syntax, the criterion for whether a block is a table is very complicated (it has to do with the presence of vertical bars), as is the structure of the resulting parse. It’s hard for me to imagine an extension mechanism general enough to handle tables without exposing a lot of the internals of the parser. I’m strongly leaning towards just implementing all the popular extensions, and making them available via flag. Being open source makes this approach more viable.

Well, here is good news anyway. My parser supports fine-grained, precise source maps. Even text elements such as entities and backslash escapes get their own text event in the output, each with an offset into the source. I’m looking forward to applications, as I haven’t tested it extensively. I’m sure there are opportunities to make cleaner source maps by ascribing blank lines to different events.

vitaly · June 7, 2015, 7:55am

You are right, it’s not easy. But i see no alternative to make users less dependent on parser and md/cm spec maintainers. IMHO it’s a reason of huge deviations in markdown parsers - one day people begin to create “yet another parser for yet another markdown”, because old one can’t be extended. From another hand, i have some disagreements with jgm, but this doesn’t interfer markdown-it to follow commonmark spec. Because all conflicting cases are moved to plugins.

I’m interested in experience exchange with another authors, who decide to design parser with plugins. We did a lot of work in this direction, but existing state machine requires 1-2 more rework iterations to reach optimum.

Well, i can’t say i’m happy if you decide to move this way, but it’s your choice. Anyway, i advice to dig keywords:markdown-it-plugin - npm search. It has more complete info about what people really need (of cause it includes all popular extensions too).

Try to search in trackers of cmark, commonmark.js and markdown-it for sourcemap requests, and contact issues authors. AFAIK, the most demanded areas are syntax highlight & range selection for copy-paste in editors.

lu_zero · June 7, 2015, 8:19am

@vitaly probably we can open another discussion about it.

vitaly · June 7, 2015, 8:39am

No problems, if you wish. In this topic i write not about specific plugins, but about parser design approaches. Pulldown-cmark is one of rare cases, when new architecture was used.

jgm · June 7, 2015, 6:50pm

+++ raph [Jun 07 15 06:14 ]:

Thanks. It’s sorta fixed now. Nested links are still O(n^3), which
means your test would take about 10 days to run, but this doesn’t worry
me greatly, because it’s just not something that’s going to happen with
non-malicious input. If I was concerned about denial of service, I’d
probably put in an error on exceeding a nesting limit.

This is a nice feature of cmark – it handles these cases
without slowdown, arbitrary limits, or stack overflows.

raph · June 7, 2015, 9:52pm

It is indeed a nice feature of cmark. However, the need to handle deeply nested constructs is arguably a spec issue as well. These deeply nested constructs are hard to parse, especially for a simple pull parser like pulldown-cmark. The problem that I have is that really large values of n only benefit potential attackers; they’re exceedingly unlikely to be valuable to real users.

When you have the luxury of being able to parse from the inside out (ie, the order in which you generate elements doesn’t have to match either the input or output sequence), then the cost (in code complexity, etc) doesn’t matter much. In a pull parser, you have to scan all the way to the closing construct to decide what to do with the opening one - that’s where the O(n²) or worse behavior is coming from.

I haven’t yet decided what I’m going to do about this. It’s very possible I’ll just come up with O(n) algorithms that work well in the pull parser architecture, because it’s a chance to show off my prowess in solving programming puzzles. But right now it’s clearly not the best use of my time, because I need to sort out tables and so on before it becomes usable for rustdoc. Another possibility would be to add language to the spec introducing maximum nesting limits, as is done in several other places. These could even be enforced, to prevent two implementations from having two different outputs (so that '[' * n + 'a' + ']' * n + '(/url)' would not be parsed as a link above a defined value of n).

It would be extremely helpful if the efficient parsing algorithm were documented somewhere (and this is even more true for the state machine @vitaly describes in #7, if he hopes other implementors to adopt it). I wasn’t able to find any, other than looking at the code, and I was trying not to rely too much on that. I found the description in the spec to be extremely declarative and thus not all that helpful - I mostly worked through examples to understand what was going on. For me personally, an imperative description in terms of a stack would actually be clearer than the declarative language in the spec.

I also struggled quite a bit with the fact that whether link references resolve profoundly affects the parse. I started down the route of partially parsing them, then fixing up the output later when all references were known, but had to abandon it. I don’t think it would be considered a good design in a from-scratch light markup language, but for this issue I probably wouldn’t change the spec to simplify parsing - it would probably break compatibility with much existing input. When parsing CommonMark, you just have to accept that you need to take a full pass through link references before you can properly begin parsing inline.

Regarding extensions, my ideas are still bubbling, but I do not think that a plugin architecture, by itself, solves the problem of divergence of extensions.

vitaly · June 7, 2015, 11:04pm

IMHO it’s impossible at current stage (useless waste of time). Problem of markdown is that syntax can not be described in BNF or similar notation. That cause empyrics in code. In theory, spec can be changed in such way, that it will fuckup implementation logic and require serious rewrite (at least in my case). I think, any kinds of algorythm freeze possible only after spec stabilization (read: not soon).

Also, implementations are different, because authors have different priorities. For example:

reference implementation by @jgm is available to many languages and has stackless parser, allowing infinite nesting.
“alternative” js implementation by me & Alex is focused on web & end user’s needs, and expects to generate safe html by default + allow syntax plugins. For example, we just limited nesting level instead of fighting with performance, because users don’t care. Parser flexibility is much more important for them.

Since parser logic can’t be formalized now in one best way, development is iterative process, and each iteration needs a lot of resources for experimenting and checking ideas. For example, we did 1 big rewrite of markdown-it, and i expect 1-2 more (for sourcemaps and plugins improvement). Right now i’m a bit tied of this, and decided to postpone next attempt for 0.5-1 years, when spec become more complete.

I beleive, in far future, someone will invent ideal parser algorythm, that will satisfy all requirements. But it’s not available now. It will be great, if someone will decide to reuse existing experience and move forward. There are serious achievements - @jgm demostrated, that nested tags can be parsed with linear complecity, and me & Alex proven that syntax plugins are possible.

IMHO we can’t completely stop divergence, but extensions divergence is much better than parsers divergence - less splits in developper’s community. There are many people (for example me) who need more than described in CM spec, but don’t see alternatives to CommonMark in general. It’s a waiste of human resource if we send such people to hell

jgm · June 7, 2015, 11:04pm

+++ raph [Jun 07 15 22:02 ]:

When you have the luxury of being able to parse from the inside out
(ie, the order in which you generate elements doesn’t have to match
either the input or output sequence), then the cost (in code
complexity, etc) doesn’t matter much. In a pull parser, you have to
scan all the way to the closing construct to decide what to do with the
opening one - that’s where the O(n^2) or worse behavior is coming from.

I don’t really understand how you can avoid scanning to the closing construct. (This needn’t involve poor performance—cmark does it without any backtracking).

After all, a ** might be an open-strong-emphasis token, but it might just be a literal **. Or it might be, for example, two open-regular-emphasis tokens in a row. I don’t see how you can tell which it is without parsing ahead. Howe do you handle this with the pull parser?

I haven’t yet decided what I’m going to do about this. It’s very
possible I’ll just come up with O(n) algorithms that work well in the
pull parser architecture, because it’s a chance to show off my prowess
in solving programming puzzles. But right now it’s clearly not the best
use of my time, because I need to sort out tables and so on before it
becomes usable for rustdoc. Another possibility would be to add
language to the spec introducing maximum nesting limits, as is done in
several other places.

I’d rather not do this, since it is possible to parse efficiently without nesting limits. But, let’s be pragmatic. I don’t see anything terrible about saying that an implementation implements the CommonMark spec, but with an arbitrary emphasis nesting limit of N.

It would be extremely helpful if the efficient parsing algorithm were
documented somewhere (and this is even more true for the state machine
[2]@vitaly describes in #7, if he hopes other implementors to adopt
it). I wasn’t able to find any, other than looking at the code, and I
was trying not to rely too much on that. I found the description in the
spec to be extremely declarative and thus not all that helpful - I
mostly worked through examples to understand what was going on. For me
personally, an imperative description in terms of a stack would
actually be clearer than the declarative language in the spec.

The idea behind providing spec + code was that those who wanted an algorithm could look at the code, while those who wanted a declarative specification could look at the spec. I agree that a description of the efficient algorithm for parsing nested emphasis would be useful to have – it’s just not a priority for me, since interested parties can find this in the code.

(I don’t know anything about vitaly’s state machine, that’s not part of my implementations. It has something to do with his implementation, markdown-it.)

I also struggled quite a bit with the fact that whether link references
resolve profoundly affects the parse. I started down the route of
partially parsing them, then fixing up the output later when all
references were known, but had to abandon it. I don’t think it would be
considered a good design in a from-scratch light markup language, but
for this issue I probably wouldn’t change the spec to simplify parsing

it would probably break compatibility with much existing input. When
parsing CommonMark, you just have to accept that you need to take a
full pass through link references before you can properly begin parsing
inline.

I was wondering how you handled this in the pull parser! So, are you saying that you take a pass through the whole input just looking for link references, and then do the real parse? It’s pretty non-trivial to recognize link references without doing a full parse, though, especially because in CommonMark (a) link references can occur in list items, blockquotes, etc., and thus needn’t be at the left margin, and (b) things that look exactly like link references can occur in code blocks.

Regarding extensions, my ideas are still bubbling, but I do not think
that a plugin architecture, by itself, solves the problem of divergence
of extensions.

Yes. It’s a hard issue. The same issues that motivated efforts to attain convergence on core elements apply to extensions too. On the other hand, we have to recognize that people have very divergent needs and some customization may often be needed.

raph · June 7, 2015, 11:29pm

I should have been clearer, sorry. I’m not saying you can avoid scanning to the closing construct. That pretty much has to happen. The specific performance problem is that it has to scan to the closing construct for every opening one.

Currently, when it sees a potential open element, it scans ahead to find the closing one. In so doing, it runs a stack, pushing any open tags, and popping when it finds a close. Then (and here’s the performance problem when deeply nesting), it discards the result of that analysis, rerunning it the next time it sees a potential open tag.

I think that’s what I’ll do for now, especially on learning that other implementations are similar. (hoedown also seems to implement limits)

Right now, I just run it twice. The first run is primarily to collect link references, and also a map of which lists are loose. It’s almost identical to the full run, but the list of “active” characters in inline markup is trimmed, avoiding most inline processing other than detecting block end. I think there are opportunities to do a little less work in the first pass, but I’m not sure yet how much it’ll save. In any case, not having separate code keeps things simpler and also makes correctness easier, as you point out.

I totally understand this, being busy myself. But to play devils advocate, has this strategy been successful, in the sense of anyone else implementing the efficient algorithm?

Thanks for engaging the discussion. I’m happy to answer more questions about the implementation.

jgm · June 8, 2015, 5:16am

+++ raph [Jun 07 15 23:39 ]:

raph [1]raph
June 7
jgm:
I don't really understand how you can avoid scanning to the closing
construct.
I should have been clearer, sorry. I’m not saying you can avoid
scanning to the closing construct. That pretty much has to happen. The
specific performance problem is that it has to scan to the closing
construct for every opening one.

Roughly, the way we handle this in cmark and commonmark.js is this (I’m indebted to @Knagis for the idea). As we initially parse inlines, we don’t create any emph or strong elements. Instead, we just store raw text elements for strings of potential strong or emph delimiters. But at the same time, we store pointers to these in a special “emphasis stack.” Each pointer indicates how many delimiters are in the delimiter run, and whether it is eligible to open and/or close emphasis.

Once we’ve parsed all the inlines, we run a function that processes emphasis and strong emphasis (and, in cmark, smart quotes). This uses the emphasis stack. It finds the first eligible closer, then looks back for a matching opener. If it finds a match, it creates an emphasis or strong emphasis element and adjusts the number of delimiters in the delimiter runs. (For example, if it found emphasis, then it reduces this number by 1 on each side.) In some cases this may lead to the removal of text nodes and items in the emphasis stack; in other cases, these nodes are simply shortened.

Then we move to the next eligible closer, and so on until we’re done traversing the emphasis stack.

It’s a pretty efficient algorithm, far better than what you’d naively do if you try to parse emphasis as you go.

You could probably just copy this approach, though it means you have to build up more stuff in memory before emitting any start/end inline events.

it's just not a priority for me, since interested parties can find
this in the code
I totally understand this, being busy myself. But to play devils
advocate, has this strategy been successful, in the sense of anyone
else implementing the efficient algorithm?

Several of the implementations in other languages are ports of the commonmark.js code, so to that extent, yes.

raph · June 9, 2015, 6:28am

Thanks for the description of the algorithm, it’s helpful. I only dimly understood it before, now it’s quite clear.

It’s also clear that my algorithm is computing the same result, but using a rather different data structure and approach. I’ll outline it in a little more detail than before:

When I hit an opening delimiter run, I basically want to know three pieces of information:

whether it has a matching closer
if so, whether it’s 1 or 2 (emph or strong)
also if so, where the closing delimiter is

So I maintain a stack of potential opener run lengths, initialized with the length of the first one. Then I scan forward. When I hit a delimiter run, I first check if it can close. If so, I match it against the stack the same way you do, repeatedly decreasing both top-of-stack and the closing delimiter length by the minimum of the two, popping the stack when it reaches zero. If the stack empties, I have my match, so I emit the opening tag, and push it onto the tag stack along with a “limit” corresponding to the beginning of the closer.

If the delimiter run can open but not close, I push its length onto the stack.

If I scan to the end of the block without emptying the stack, then it’s treated as text rather than markup.

The problem with this scheme is that I compute a bunch of information that I throw away. However, I think right there is the seed to a more efficient solution - just don’t throw it away. By the time I’ve reached the end of scanning (either finding a matching closer or reaching the end of the block), I know all matches in that range. I think the solution is just to store that information in a map keyed by offset of the opening delimiter run. Then, as I encounter delimiter runs inside the scan range, just look them up in the map instead of recomputing.

I think I’ll implement that because it’s elegant, but the priority is low for the reasons given above. I also suspect the same approach can be used for links, but it’s probably more fiddly because there are a bunch of cases (inline links and three kinds of link references, plus whether the reference resolves or not), and there’s very little to be gained over just limiting nesting depth.

Sorry if I came off as aggressive there. Of course, that’s true, and not everything has to be judged on the basis of how easy it is to do a completely independent implementation from the spec alone. In any case, the fact that there can be at least two good implementations, with different parser architectures, strongly suggests that you’re standardizing on the right thing. Eventually someone will write a clear description of what the spec describes and how to implement it efficiently (hopefully drawing on this thread), and then all will be good.

raph · June 10, 2015, 12:03am

As a quick followup, on thinking more deeply through this I find that this fragment gives different results on cmark and pulldown-cmark:

*a __b *c d__ e*

I believe the problem is with my implementation, and the issue is that because d closes b, c is not an opener. My implementation just scans the * delimiters, so the stack is nonempty at the end of the block. Thus, to fix this I think I have to run a stack that contains all inline markup, not just the one for the opener in question. So I will probably rewrite this to be both more correct and O(n). I still think it’s a bit of an edge case, but I do want pulldown-cmark to be a truly high-quality implementation.

It might be worth adding as an example, to illustrate the interactions between delimiter runs of two different types. Looking carefully, I think the declarative language is clear enough, but my experience shows that it’s possible to miss it.

jgm · June 10, 2015, 4:09am

+++ raph [Jun 10 15 00:13 ]:

raph [1]raph
June 10

As a quick followup, on thinking more deeply through this I find that
this fragment gives different results on cmark and pulldown-cmark:
a b *c d e

I believe the problem is with my implementation, and the issue is that
because d closes b, c is not an opener.

The part of the spec that most directly addresses this is:

When two potential emphasis or strong emphasis spans
overlap, so that the second begins before the first ends and
ends after the first ends, the first takes precedence. Thus,
for example, foo _bar baz_ is parsed as foo bar
baz rather than foo bar baz.

But clearly, since your implementation passed all the spec
tests while diverging on this input, we need a spec test
of this kind. I’ll add one.