Why is MD4C so fast? [C]

Demi_Obenour · July 13, 2017, 12:01am

MD4C is, to my knowledge, by far the fastest CommonMark implementation, beating out the non-conformant implementations Hoedown and pulldown-cmark as well as the C reference implementation.

Why is it so fast? What would (say) pulldown-cmark need to do to beat it? (Other than update to the latest version of the spec.)

mity · July 13, 2017, 9:30am

MD4C is, to my knowledge, by far the fastest CommonMark implementation

Thank you

Why is it so fast?

Because it was designed to be fast. At the beginning of its development it involved several complete rewrites to find the right way how to approach the problem.

The main reasons of its success are likely as follows:

It is SAX-like parser, unlike many other implementations which are DOM-parsers. (Actually it’s the only SAX-like implementation in C I am aware of and that was also one of the reasons I started to work on it.) That means MD4C does not construct any AST tree or anything like that, it just calls a callback when beginning/end of block or span is reached. That is advantage (speed) as well as disadvantage (repertoire of available functions). Certain functionality like manipulation with the AST tree cannot be easily done on top of it. But if all you want to do is to parse Markdown input, then MD4C is your soldier: effective and deadly.
It does not copy text buffers here and there (as much as other implementations). Well, sometimes it is inevitable, but mostly it can be avoided. This means the callbacks just get pointers pointing to inside of the buffer of the input document being parsed. The cost is that application must deal with string lengths carefully (callbacks cannot expect the string terminator '\0').
The two points above also minimize work for heap allocator which likely plays important slow-down factor in other parsers. Actually my original aim was to avoid heap allocator altogether. That was eventually found out as too ambitious but still, MD4C likely allocates memory on heap much less frequently then other parsers. Working with the document mostly in the single buffer is also likely more friendly to CPU cache (data locality).
Its inline parsing (parsing of block contents) is very fast because it starts by collecting all potentially meaningful marks into a helper and very compact buffer and most of the work is done with this buffer instead of the full text buffer, and for normal (non-malicious) input, this helper buffer is about an order of magnitude smaller then the corresponding full text.
Because it does not do almost any input Unicode validation. To do its work, Markdown parser needs to understand Unicode in few very limited contexts if you read CommonMark standard carefully, and so MD4C does so only in those contexts. In most cases, it just propagates the (potentially Unicode-invalid) text into the callbacks unchanged.
And last but not least, because I spent hours with profiler, optimizing some bottlenecks quite well. Many developers simply never do that so I believe that in many parsers there may be quite good opportunities to optimize further, but nobody really cared. Just search for ‘optimization’ or ‘optimize’ in https://github.com/mity/md4c/blob/master/md4c/md4c.c. In parsers, the loop unrolling can generally help a lot when used on some hot paths. C compilers are usually able to do this kind of optimization automatically only with some additional hints, e.g. when using profile guided optimization. (BTW, if you try to build any Markdown parser with PGO, you can get quite good boost.)

mity · July 13, 2017, 12:02pm

To elaborate on point 2, many projects say so, but when I took a look, they usually construct buffers for contents of each block (to get rid of block indentations, blockquote marks, list item marks etc.). That effectively means extra copy (part by part) of whole document.

Output of block analysis in MD4C looks very differently, as a vector of MD_LINE structures. The structure has members denoting starting and ending offsets of each line the block is composed of. The beginnings are advanced to skip any block decorations (indentations, blockquote marks, list item marks) and ends are (with exception of code blocks) also trimmed of extra whitespace and new-line characters.

This means that almost all inline processing can access directly the input document buffer without copying its contents. It has just to iterate the chars in two (nested) loops.

It also means the application is fed by multiple calls to the text callback instead of merging the strings together and then making a single call.

notriddle · July 14, 2017, 2:17am

To answer the questions about pulldown-cmark:

pulldown-cmark is more like StAX than SAX, but it should be about the same from a perf standpoint (and, IMO, provides a better API).

pulldown-cmark will copy strings sometimes, if backslash escapes are used. I assume MD4C is the same way.

p-c’s main of the heap are:

It’s internal stacks (pulldown-cmark is built like a pushdown automata, though it probably doesn’t comply with the mathematical definition).
The reference link and loose list tables.
Copying strings with stripped-out backslashes.

It may grow more heap usage in pursuit of O(n) runtime.

pulldown-cmark might want to copy this approach. It seems doable.

pulldown-cmark uses byte indexing for everything. The only UTF-8 validation it does is making sure that the text slices it returns don’t split codepoints in twain. However, if you’re grabbing your data off the hard drive, you’ll need to do a pass of validation or use unsafe to cast your byte array to a string.

pulldown-cmark needs a good dose of some of this. Currently, though, a lot of the work is going into conformance and algorithm-level tweaks (a long-term goal is to make it O(n) for all input).

JoeDuarte · July 14, 2017, 3:10am

Martin, I’d be thrilled if you ever decided to build an HTML minifier using the same principles and level of quality and rigor that you’ve applied to MD4C.

Does md2html generate bloated or minified output? Something that always struck me as strange is that all HTML generators, CMSes, web frameworks, etc. generate bloated HTML instead of minified. It would be so much better if HTML was “born” minified by default. It would have helped if the HTML spec defined HTML as minified, or at least defined some rules for minification.

Demi_Obenour · July 14, 2017, 3:11am

If I understand MD4C, it doesn’t do any copying ever –even when the input
has backslash escapes. Instead, it makes multiple calls to the rendering
callback. pulldown-cmark could do the same.

mity · July 14, 2017, 6:04am

Mostly, no. For foo \* bar, MD4C rather calls text callback twice, for string portion before the backslash and after it.

If I haven’t overlooked anything, MD4C has three main dynamically growing flat buffers (i.e. amortized complexity O(1)). One is used for storing block analysis output. The other is used for inline analysis output (list of potentially meaningful marks) and it is reused for each block. The 3rd is used for building dictionary of reference link definitions.

Other allocations are rather exceptional:

One malloc() per link reference definition, but only if its label is multi-line.
One malloc() per link reference definition, but only if its title is multi-line.
One malloc() per link/image reference, but only if its contents is multi-line.
One malloc() per inline link/image, but only if its title is multi-line.
One malloc() in context where text of different type may appear may appear at once, but where it is not propagated to a callback via the text callback. This happens for link’s href and title attributes, (or src if <img>), and also for info string of fenced code block. (I know. Sounds difficult. Consider for example [a](/url/with/&entity;) or look into md4c.h for structure MD_ATTRIBUTE and where it is used as a member in the other structures.)

Due the last point, allocations are currently O(n) where n corresponds to number of links, images and fenced code blocks (with non-empty info string). I plan to optimize it to avoid the allocation if the string happens to be uniform, which should be most cases in normal input.

For normal input, yes. But making it a guaranty is (with current specification) impossible.

Consider document constructed of link reference definitions and link references. I.e. count of link reference definitions grows lineary with the document size, as the count of link references. Every lookup of link reference definition cannot be better then O(log(n)), making overall complexity about O(n * log(n)) when parsing it.

Demi_Obenour · July 14, 2017, 6:38am

Can hash tables make the effective complexity O(1) in practice? To avoid HashDOS, one can use a universal hash function family with a key generated by a CSPRNG. (This is probably secure as long as the key is chosen after all input has arrived).

mity · July 14, 2017, 7:13am

You got me, you are right. Their amortized complexity can be O(1). Maybe I am too locked in my implementation where I eventually decided to not use the hash table, at least for now, to keep it more simple.

I’m not sure how fast the cryptographically strong hash functions are though, so for how large N it really starts to make sense.

mity · July 14, 2017, 8:11am

As the hash can be completely constructed after we know all reference definition links (i.e. between block analysis and inline analysis), it might be also possible to use perfect hashing. The construction is O(n), retrieval O(1). There is no need to fear what kind of cryptographic weakness shall be found tomorrow or whether you generate key in secure-enough way.

notriddle · July 14, 2017, 1:52pm

That’s exactly how pulldown-cmark does it. The Rust standard library ships with a generic, and increasingly battle tested, hash map.

mity · July 25, 2017, 12:50am

Well, this thread made me finally do what I was postponing for a long time: Doing some better bechmarking of MD4C. For now, I compared it just against Cmark, as it is probably the most relevant competitor.

As these numbers provide some hard data for the discussion above, let me to publish them here.

The testing was done on 64-bit Linux machine (Slackware 14.2). All input files were placed in tmpfs filesystem to mitigate any I/O impact. The script used for the testing can be found in this gist but note it is not easily reusable without some manual tweaking and that it uses some scripts from Cmark’s repo.
Fresh release build of current master head was used both for MD4C as well as for Cmark.

The test composed of several samples. Usually the samples try to target dominantly one particular aspect of the parsers implementation. For example, the test many-paragraphs.md contains just 1,000,000 trivial paragraphs and tries to examine mainly how block parser behaves. Similarly all the tests are just made by huge repetitions of some very simple pattern which tends to be used frequently in any markdown document. (Some tests use different count of repetitions to give some measurable numbers).

Just the sample cmark-benchinput.md is different: It is compilation of mutiple language version of the pro-git book, as generated by make bench from Cmark’s repo. Unlike the other samples, it can be seen as a representative of “normal input”.

Each sample was performed 10 times. Given that stddev was always negligible, the table below contains only mean times in seconds (complete output of the script is in the comment of the script gist)

The benchmaring also helped to find one nasty bug in MD4C (the results below are after the fix applied).

Test name	Sample input	MD4C (seconds)	Cmark (seconds)
`cmark-benchinput.md`	(benchmark from CMark)	0.3650	0.7060
`long-block-multiline.md`	`"foo\n" * 1000000`	0.0400	0.2300
`long-block-oneline.md`	`"foo " * 10 * 1000000`	0.0700	0.1000
`many-atx-headers.md`	`"###### foo\n" * 1000000`	0.0900	0.4670
`many-blanks.md`	`"\n" * 10 * 1000000`	0.0700	0.3110
`many-emphasis.md`	`"foo " * 1000000`	0.1100	0.8460
`many-fenced-code-blocks.md`	`"~~~\nfoo\n~~~\n\n" * 1000000`	0.1600	0.4010
`many-links.md`	`"[a](/url) " * 1000000`	0.2100	0.5110
`many-paragraphs.md`	`"foo\n\n" * 1000000`	0.0900	0.4860

I find quite surprising that the performance ratio between the two competitors varies so much among the samples.

For the cmark-benchinput.md test, I also compared the memory consumption with the memusage(1) utility. Just few numbers from it:

	MD4C	Cmark
Count of `malloc()` calls	5	3
Count of `realloc()` calls	36	1304578
Count of `calloc()` calls	1	1587507
Count of `free()` calls	10	2369043
Heap peak	275504032 bytes (~262.74 MB)	495063570 bytes (~472.13 MB)
Heap total	275508128 bytes (~262.75 MB)	504309058 bytes (~ 480.94 MB)

Given that size of the input document is 110648441 bytes (~105.52 MB) and that MD4C’s simplistic md2html utility renders the output into one big growing memory buffer before outputting it, it shows its overhead (approx. heap peak - 2 * document size) is pretty low.

But keep in mind, the memory comparison is very unfair as MD4C is SAX-style parser and does not build any AST representation of the document. That gives MD4C huge advantage in this regard. Given the number of allocation calls in Cmark, this for sure also plays some role for the performance.

jgm · August 2, 2017, 5:55pm

This is very impressive!

It would be interesting to construct a test that imports cmark and uses its node-constructing functions to create an AST in response to the events generated by MD4C. (That is, use MD4C’s parser to generate a cmark-style node tree.) Then benchmark this against cmark’s own parser. This might give a more meaningful (apples to apples) comparison. We’d see how much of cmark’s extra time and memory usage is due to construction of the AST, and how much is attributable to the different design of the parser.

mity · August 3, 2017, 9:44am

@jgm I tried to cook something. It is still very buggy.
From the standard spec suite, 191 tests still fail.

It seems I do not really understand how the AST should look in many cases. Maybe if you may look at it and give some advice. The most relevant code for you should be in md2html+ast/build_ast.c.

For example code spans, code inlines, tight lists do not work at all. I guess you could tell from the 1st sight, what I should do differently to produce AST palatable by cmark_render_html().

Neverthless I tried few tests where it seems to work already with updated test script. The script now also collects some heap info about all the tests. (But many tests are commented out as they are broken for md2html-ast)

Few notes:

Cmark 0.28 is used with few additions (node setters who accept strings not terminated with zero byte (they have extra size argument) so MD4C callbacks do not need to create temp. buffers to add just the zero, call the Cmark function who just again uses it to just call strlen().
There is still some slowdown caused by the fact that cmark_render_html() returns zero-terminated string forcing the caller to do strlen() on it, making effectively an extra iteration over all output. This makes some slowdown in md2html+ast on its own. It would be good to get rid of it.
IMHO, most users of the API likely have to face the same problems.
I did not study if Cmake internally uses strlen() as extensively on its own. If it does, it may play big role in the performance difference.

So the results gathered so far:

/home/mity/prj/md4c/bin/md2html/md2html [performance]:
samples/empty.md: mean = 0.0000, median = 0.0000, stdev = 0.0000
samples/long-block-oneline.md: mean = 0.0700, median = 0.0700, stdev = 0.0000
samples/many-atx-headers.md: mean = 0.0800, median = 0.0800, stdev = 0.0000
samples/many-paragraphs.md: mean = 0.0800, median = 0.0800, stdev = 0.0000


/home/mity/prj/md4c+ast/bin-release/md2html+ast/md2html+ast [performance]:
samples/empty.md: mean = 0.0000, median = 0.0000, stdev = 0.0000
samples/long-block-oneline.md: mean = 0.0700, median = 0.0700, stdev = 0.0000
samples/many-atx-headers.md: mean = 0.3200, median = 0.3200, stdev = 0.0000
samples/many-paragraphs.md: mean = 0.3170, median = 0.3200, stdev = 0.0048


/home/mity/prj/cmark/build/src/cmark [performance]:
samples/empty.md: mean = 0.0000, median = 0.0000, stdev = 0.0000
samples/long-block-oneline.md: mean = 0.1060, median = 0.1100, stdev = 0.0052
samples/many-atx-headers.md: mean = 0.4610, median = 0.4600, stdev = 0.0032
samples/many-paragraphs.md: mean = 0.4800, median = 0.4800, stdev = 0.0000


/home/mity/prj/md4c/bin/md2html/md2html [memory consumption]:
samples/empty.md:  heap total: 37480, heap peak: 37480, stack peak: 464
samples/long-block-oneline.md:  heap total: 112119465, heap peak: 112117673, stack peak: 1600
samples/many-atx-headers.md:  heap total: 58314592, heap peak: 58310496, stack peak: 1600
samples/many-paragraphs.md:  heap total: 36425988, heap peak: 36421892, stack peak: 1600


/home/mity/prj/md4c+ast/bin-release/md2html+ast/md2html+ast [memory consumption]:
samples/empty.md:  heap total: 37601, heap peak: 37601, stack peak: 1104
samples/long-block-oneline.md:  heap total: 167119864, heap peak: 167113976, stack peak: 1568
samples/many-atx-headers.md:  heap total: 324175128, heap peak: 309560496, stack peak: 1568
samples/many-paragraphs.md:  heap total: 317397400, heap peak: 301171888, stack peak: 1568


/home/mity/prj/cmark/build/src/cmark [memory consumption]:
samples/empty.md:  heap total: 5649, heap peak: 5504, stack peak: 4608
samples/long-block-oneline.md:  heap total: 229714616, heap peak: 169710232, stack peak: 8640
samples/many-atx-headers.md:  heap total: 342620248, heap peak: 342614784, stack peak: 8640
samples/many-paragraphs.md:  heap total: 344231120, heap peak: 344225664, stack peak: 8640

So if these preliminary results can be trusted, the performance md2html+ast is somewhere between md2html and cmark.

EDIT: I have added the bench.sh script into the https://github.com/mity/md4c-ast repo for case someone wants to play with it. Follow its README to perform the tests on your machine.

jgm · August 3, 2017, 4:52pm

Great! Independently of benchmarking, I think it would be valuable to have an optional module for producing an AST with md4c.

With code blocks and spans, I think the problem is that md2html+ast is producing this structure:

CODE
    -- CODE "print('hi')"
    -- CODE "\n"
    -- CODE "exit(0)"

rather than:

CODE "print('hi')\nexit(0"

CODE and CODE_BLOCK nodes are leaf nodes; they have “literal” string content but no children.

For tight lists, I’ll bet the problem is that md4c isn’t producing a PARAGRAPH node to hold the contents of the list item. If you look at the CommonMark.dtd, you’ll see that an ITEM can’t contain inline nodes directly; they have to go in a paragraph container.

Thanks for the remarks about strlen. We went for simplicity in returning a 0-terminated string, but you’re certainly right that it would be more efficient to return a string and a length.

mity · August 3, 2017, 6:58pm

Thanks I will take a look at the CommonMark.dtd, I was not ware of it.

A pity that “leaf node” is not the same as “leaf block”, or that “node with single textual child node” cannot be treated semantically the same as “leaf node with literal set”. IMHO it would make usage of the API definitely easier and much more flexible, and apps creating an AST from scratch could approach the leaf blocks in more uniform way.

jgm · August 3, 2017, 9:23pm

It shouldn’t be too hard to fix this in your build_ast.c. You just need to have the MD_TEXT_CODE event append to the literal content of the current node, rather than adding a new child node.

In the case of lists, you just need to make sure that when you’re adding inline nodes, the parent node can accept them. Note: cmark_node_append_child will return 0 if the parent node is not allowed to contain the node you’re appending, so you can check return status.