Why is MD4C so fast? [C]

mity · July 13, 2017, 9:30am

MD4C is, to my knowledge, by far the fastest CommonMark implementation

Thank you

Why is it so fast?

Because it was designed to be fast. At the beginning of its development it involved several complete rewrites to find the right way how to approach the problem.

The main reasons of its success are likely as follows:

It is SAX-like parser, unlike many other implementations which are DOM-parsers. (Actually it’s the only SAX-like implementation in C I am aware of and that was also one of the reasons I started to work on it.) That means MD4C does not construct any AST tree or anything like that, it just calls a callback when beginning/end of block or span is reached. That is advantage (speed) as well as disadvantage (repertoire of available functions). Certain functionality like manipulation with the AST tree cannot be easily done on top of it. But if all you want to do is to parse Markdown input, then MD4C is your soldier: effective and deadly.
It does not copy text buffers here and there (as much as other implementations). Well, sometimes it is inevitable, but mostly it can be avoided. This means the callbacks just get pointers pointing to inside of the buffer of the input document being parsed. The cost is that application must deal with string lengths carefully (callbacks cannot expect the string terminator '\0').
The two points above also minimize work for heap allocator which likely plays important slow-down factor in other parsers. Actually my original aim was to avoid heap allocator altogether. That was eventually found out as too ambitious but still, MD4C likely allocates memory on heap much less frequently then other parsers. Working with the document mostly in the single buffer is also likely more friendly to CPU cache (data locality).
Its inline parsing (parsing of block contents) is very fast because it starts by collecting all potentially meaningful marks into a helper and very compact buffer and most of the work is done with this buffer instead of the full text buffer, and for normal (non-malicious) input, this helper buffer is about an order of magnitude smaller then the corresponding full text.
Because it does not do almost any input Unicode validation. To do its work, Markdown parser needs to understand Unicode in few very limited contexts if you read CommonMark standard carefully, and so MD4C does so only in those contexts. In most cases, it just propagates the (potentially Unicode-invalid) text into the callbacks unchanged.
And last but not least, because I spent hours with profiler, optimizing some bottlenecks quite well. Many developers simply never do that so I believe that in many parsers there may be quite good opportunities to optimize further, but nobody really cared. Just search for ‘optimization’ or ‘optimize’ in https://github.com/mity/md4c/blob/master/md4c/md4c.c. In parsers, the loop unrolling can generally help a lot when used on some hot paths. C compilers are usually able to do this kind of optimization automatically only with some additional hints, e.g. when using profile guided optimization. (BTW, if you try to build any Markdown parser with PGO, you can get quite good boost.)