MD4C: New implementation in C

mity · November 24, 2016, 2:16pm

Hello folks.

I would like to introduce here new C implementation, MD4C. It is very new but I hope it already deserves some attention.

Why?

Because I needed something with support of Windows Unicode (UTF-16 LE) natively and it would be too much work incorporate that into existing solutions which are too integrated with some UTF-8 code everywhere.
Because cmark is too big to my liking (according to cloc, it is 26.2K lines of code, MD4C has currently 3.7K lines of code).
Because cmark, given it is reference implementation, could never accept any extensions and maintaining forks is often too much work.
Because I disliked Hoedown’s API. (So many callbacks? Should API enforce some kind of its own buffer to an application using it? Etc.)
And last but not least, because I thought it cannot be difficult. Here I was really wrong. Especially the pathological_tests.py from Cmark made my head almost explode.

Current status:

All major CommonMark features are implemented. As of 622 tests provided by CommonMark specification 0.27: 568 tests pass and 54 fail. I hope it will be eventually even better in a foreseeable future.
MD4C deals fine with the infamous cmark’s /pathological_tests.py testsuite.
Few extensions like tables or permissive autolinks are implemented. (They have to be explicitly enabled.)

Performance:

Not yet tested it thoroughly but on my Windows machine with the benchinput.md and pathological_tests.py from cmark, I get these results:

Cmark:

$ time build/src/cmark bench/benchinput.md >/dev/null
real    0m1.634s
user    0m0.015s
sys     0m0.000s

$ time test/pathological_tests.py --program build/src/cmark.exe >/dev/null
real    0m0.718s
user    0m0.015s
sys     0m0.045s

MD4C:

$ time md2html/md2html.exe ../../cmark/bench/benchinput.md >/dev/null
real    0m1.391s
user    0m0.000s
sys     0m0.000s

$ time test/pathological_tests.py --program ../md4c/bin/md2html/md2html.exe >/dev/null
real    0m0.382s
user    0m0.015s
sys     0m0.030s

Disclaimer:

Please note it is still quite fresh piece of code. It needs more testing. So I am pretty sure there are still some bugs lurking.

If you encounter any, please report them, ideally as GitHub issues.

Big thanks:

Big thanks belong to all those people working on the CommonMark specification, cmark test suite and other stuff. MD4C would be much worse then it currently is without it. I especially appreciate that the CommonMark specification tests are responsible for most of the great code coverage.

jgm · November 25, 2016, 11:00am

Interesting! Can you say something about the general parsing approach you took – e.g. how it differs from cmark’s?

Does md4c have a modular extension system, and if so how does it work?

As for code size, I think your comparison is a bit misleading. Most of the size you’re recording for cmark is from scanners.c and scanners.h, which are automatically generated by re2c from scanners.re and shouldn’t really be considered part of the source. (We include it in the repository to avoid a build dependency on re2c.) Omitting these, I get 6k lines of code, including the build system (but excluding tests). And of course a lot of this is for functionality that md4c doesn’t seem to have – multiple output formats (man, xml, latex, html, commonmark itself, including sensible hard wrapping), for example, and an iterator interface for modifying the AST. Note that because we support multiple output formats, we have to translate entities, so part of the code is a giant list of character references and their corresponding unicode characters…

jgm · November 25, 2016, 11:51am

On the benchmarks: a more reliable way to do this is to use make bench from cmark. This will run the program on the benchmark input ten times and take mean and median of the user time.

On the machine I’m using now, cmark does quite a bit better on the benchmark:

md2html:

~/src/cmark % make bench PROG=md2html
{ for x in `seq 1 10` ; do \
		/usr/bin/env time -p md2html </dev/null >/dev/null ; \
		/usr/bin/env time -p md2html bench/benchinput.md >/dev/null ; \
		done \
	} 2>&1  | grep 'real' | awk '{print $2}' | python3 'bench/stats.py'
mean = 1.6660, median = 1.6700, stdev = 0.0097

cmark:

~/src/cmark % make bench PROG=./cmark
{ for x in `seq 1 10` ; do \
		/usr/bin/env time -p ./cmark </dev/null >/dev/null ; \
		/usr/bin/env time -p ./cmark bench/benchinput.md >/dev/null ; \
		done \
	} 2>&1  | grep 'real' | awk '{print $2}' | python3 'bench/stats.py'
mean = 1.2850, median = 1.2750, stdev = 0.0324

I don’t know what’s going on with the numbers you’ve posted above; why is real so much higher than user and why is user 0 in your md2html test?

jgm · November 25, 2016, 12:02pm

Another nice diagnostic from the cmark Makefile is make fuzztest, which runs the program against ten randomly generated 200k strings of UTF-8 characters.

cmark is nice and even:

~/src/cmark % make fuzztest PROG=./cmark
{ for i in `seq 1 10`; do \
	  cat /dev/urandom | head -c 2000000   | iconv -f latin1 -t utf-8 | tee fuzz-$i.txt | \
		/usr/bin/env time -p ./cmark >/dev/null && rm fuzz-$i.txt ; \
	done } 2>&1 | grep 'user\|abnormally'
user         0.04
user         0.05
user         0.05
user         0.05
user         0.04
user         0.05
user         0.04
user         0.05
user         0.05
user         0.05

md2html doesn’t do so well on this:

~/src/cmark % make fuzztest PROG=md2html
{ for i in `seq 1 10`; do \
	  cat /dev/urandom | head -c 2000000   | iconv -f latin1 -t utf-8 | tee fuzz-$i.txt | \
		/usr/bin/env time -p md2html >/dev/null && rm fuzz-$i.txt ; \
	done } 2>&1 | grep 'user\|abnormally'
Command terminated abnormally.
user        20.88
user         0.15
user         0.17
user         0.31
Command terminated abnormally.
user        21.04
Command terminated abnormally.
user        20.99
user         0.22
user         0.23
user         0.38
Command terminated abnormally.
user        20.80

mity · November 25, 2016, 12:16pm

Ad the comparison of sizes:

I see. However I took it more from a perspective of someone who considers embedding into another application rather then from a point of view of its code maintainance. Generated code or not, it would get in. In contrast MD4C is designed to be easily embeddable in 3rd part apps, not as a general purpose utility, although it should be possible to write such utility on top of it.

About the parsing approach:

I don’t know enough about Cmark to make a direct comparison. There were few design decisions made just at the beginning of the project:

As much as possible, MD4C does not work with any buffers but it mainly just provides pointer (and size) into the input buffer back to the caller through the callback functions. This minimizes a need for any temporary buffers and keeps memory allocation under control.
The main goal is to have good parser which can be easily reused in other apps, not on creation of just a command line conversion utility. (Of course an utility (md2html) implementing a simple renderer was created as a testing app.)

The block parsing goes as follows:

Block parser is heavily line-oriented. The core of the work is in the function md_analyze_line(). On input it gets an offset (where the examined line starts) and also pointer to some previous analyzed line which decided about starting of the currently built block (“pivot line”). The function determines line indentation, container marks (and localizes the container nesting by comparison to current containers stack), type of line (e.g. setext underline, blank line, textual line etc.) and of course also where the line ends.
Then md_process_line() is called. This just determines whether the line as analyzed is compatible with the pivot line and whether the same block continues. This function builds vector of blocks and lines info in a relatively condensed fashion (to keep memory consumption in reasonable limits).

Container block starter/closer.
Leaf block head.
Line info (start/end offset, stripping any indentation or container marks from it)

The function md_analyze_line() also manages a stack of current nesting in containers (block quotes and lists). On any change of the nesting it enforces leaf block start/end as appropriate. Each node (struct MD_CONTAINER) also keeps offset into the vector where it opener and this allows to change its flags in processing of some later lines. This is used e.g. to change tight list into a loose one when we see a blank line inside of it. When entering/leaving a container, the appropriate block starter/closer node into the vector mention above is added.
When the vector is completely built (i.e. we reached end of document), we simply iterate over it. For container starter/closer, the appropriate renderer callbacks is called directly. For leaf block, the block info and vector of its line (simply subvector of the main vector) is passed to function able to handle that kind of bloc. Most tricky is of course a normal block with sequence of inlines or spans.

The inlines are parsed as follows:

On input we have a sequence of lines (start and end of them as analyzed above).
We iterate through valid chars inside the lines (i.e. skipping the “gaps” between them) of the whole block contents and collect “marks”, i.e. fundamental characters which need our attention in decision making whether they form some Markdown element or not.
We “resolve” the marks. I.e. we enumerate them from left to right over the collected marks several times.

Each “mark” is resolved a bit differently, but its mostly about finding closer marks to opener marks, e.g. for ‘[’ and ‘]’ we pair them so that any ‘[’ is added int o stack and ‘]’ is then paired with the top ‘]’.
The multiple passes are done to reflect precedence of various marks. (So first pass does e.g. code spans, 2nd pass links and 3rd pass emphasis). The important point is that subsequent passes skip marks inside “resolved pairs”.
Marks of same precedence priority (e.g. ‘*’ and ‘_’) are done in a single pass.
Recursion into link/image contents is then done manually for more nested passes.

Resolver of ‘[’ and ‘]’ is a bit exceptional because links need more context for resolving (e.g. how they are nested in each other). So it only pairs the brackets together and builds list of “potential links”. Then md_resolve_links() is called which handles the context and either really resolve the bracket pair as a link (usually expanding the closer mark to cover also the ‘( … )’ or the 2nd ‘[…]’) or keeps the marks unresolved so the marks are ignored in any further processing.
When all marks are resolved (or decided to be ignored), we do yet another pass when we just call callbacks for enter/leave span (when we reach any resolved mark) or a textual callback for text between any two resolved marks.

The “marks” (struct MD_MARK) and its management is the key why MD4C is quite fast. Each MD_MARK has members next and prev which may point to another marks and forms some chains or lists.

This is used in many situations, e.g.

Many resolver functions (e.g. for ‘[’ ‘]’ or ‘*’) manage list of seen-but-not-yet-resolved potential openers so when closer is reached, opener can be just found by getting last element of the list.
When finally resolved, opening mark’s next points to its closer and similarly, closer’s prev points to its opener. So again we can iterate over the block contents and skip nested spans effectively.

Does md4c have a modular extension system?

No. Just some set of hard-coded extensions which can be turned on or off with some flags when calling md_parse().

mity · November 25, 2016, 12:17pm

Maybe because it was done inside a MSYS environment on Windows.

mity · November 25, 2016, 12:20pm

Have you used a Release build? It makes quite a big difference. On my machine the debug build times are about two times worse then release.

Try

$ export CMAKE_BUILD_TYPE=Release

before running the cmake.

mity · November 25, 2016, 12:22pm

I will try. I plan to do also some proper fuzzing testing with American Fuzzy Lop. But I have to learn how to use it first

jgm · November 25, 2016, 2:25pm

I built the Release version and indeed it beats cmark on our benchmark, by quite a good margin!

cmark:

mean = 1.3980, median = 1.2650, stdev = 0.4438

md2html:

mean = 0.6720, median = 0.6700, stdev = 0.0042

fuzztest still indicates a lot less consistency than with cmark (and some crashes).

Thanks for the writeup about parsing strategy.

As for size: I don’t see why md4c is any easier than cmark to embed in other applications. For cmark we have a portable build system and we generate both a static and a dynamic library, which can be used in the normal way. Anyway, for these purposes the relevant comparison is probably not lines of source, but rather the size of the static library, which (on the system I’m now using) is 57k for md4c and 271k for cmark.

jgm · November 25, 2016, 2:54pm

cmark’s renderers can operate on nodes that are created programatically (in which case there isn’t an input string that was parsed), and the iterator mechanism allows you to insert new nodes into a parsed document (with string contents that might not be substrings of the original input string) or modify existing nodes. But if you don’t need the flexibility to do this, then your approach, storing pointers into the input string, is terrific: it avoids lots of memory allocations and also gives you easy access to source map information. I would guess that the memory allocation savings are a big part of the reason for the speed advantage over cmark.

The way you deal with “marks” seems very similar to how we deal with [, *, and _ (and also smart quotes), to avoid backtracking. However, you extend a similar approach to all the other marks. I’m still not persuaded that your approach here is better, though it might turn out to be.

mity · November 25, 2016, 6:01pm

I would also argue with uniformity of design and with dividing the algorithm into relatively easy-to-understand steps. If you do some one-pass-does all for some set of marks, it likely gets messy easily.

To a degree, it’s about personal preference. I mainly meant by it that MD4C parser is meant to be easily usable like the SQLite amalgamation directly in the 3rd party app, instead of using relatively large library a an external dependency.

BTW, how simple is it to get just the parser without any renderer? Consider you make a GUI app which does its own rendering, why should it include ever-growing set of built-in renderers with the package?

As for a new comer to Cmark package, there is no clear distinction which headers are the public one and which are just internal.

jgm · November 25, 2016, 11:23pm

Currently, you have to take the whole library as a unit. But my impression was that C compilers/linkers were smart enough not to include unused objects when you link against a static library, so the renderers would not be included unless you actually used them.

As for the headers, the only public ones are cmark.h, cmark_export.h, and cmark_version.h. I suppose this isn’t explicit enough, but these are the only ones that get installed with make install and they’re the only ones documented in the man page.

mity · December 4, 2016, 11:27pm

Just a short progress info:

Improved stability a lot. Spent hours with Valgrind and American Fuzzy Lop and fixed all what has been found so far. Afterwards a complete AFL cycle performed without finding any new issues.

CommonMark 0.27 test suite:

610 tests pass (~98%)
12 tests fail (~2%)

Cmark’s make bench:

CMark: mean = 0.6590, median = 0.6600, stdev = 0.0032
MD4C: mean = 0.4410, median = 0.4400, stdev = 0.0032 (~ 33.1% faster)

Cmark’s pathological_tests.py:

Cmark: 0m0.803s
MD4C: 0m0.517s (~ 35.6% faster)

(Tested with current git heads of MD4C and Cmark on a Linux box)

jgm · December 6, 2016, 12:48pm

Those are good numbers! Here’s a comparison run on an iMac (make bench):

cmark: 1.275s
github’s cmark fork, which uses a custom slab allocator: 0.895s
md4c’s md2html: 0.670s

It’s a more pronounced difference than you’re seeing on linux.

mity · December 16, 2016, 10:53am

Let me to bring more hot news from MD4C.

Most notably, MD4C (master head) now passes all 622 CommonMark 0.27 tests.
The parser now has a more solid interface (md4c.h) which is unlikely to change anymore in some fundamentally incompatible ways.
The md2html utility has been significantly refactored so that its HTML renderer can be relatively easily reused. (Some more work may yet happen on this front.)

Overall, I would consider it to be a beta quality and I will think about making a release in a foreseeable future.

tin-pot · December 16, 2016, 11:57am

Congratulations - great work!

I’m looking forward to port and add my own front end to it (ie, as a replacement for md2html), but the latest interface changes kept me on my toes to adapt my - only slightly differing - clone of md2html, so this was a bit daunting

And would you still incorporate some code improvements (in my view), like using table lookup for character classification instead of complex comparison and logical expressions (packaged as macros), before making an “official” release?

mity · December 16, 2016, 12:24pm

I’m looking forward to port and add my own front end to it (ie, as a replacement for md2html), but the latest interface changes kept me on my toes to adapt my - only slightly differing - clone of md2html, so this was a bit daunting

I can imagine changes of these kinds:

some new constants in the enums (e.g. MD_TEXTTYPE, MD_SPANTYPE, MD_BLOCKTYPE) are added as new extensions (or new features to CommonMark) are added, but it should keep source-level compatibility. Other then that the API of the parser is more or less as flexible as we need it.
Related new ‘detail’ structures might be added.
Current detailed strructures might be enriched where it is useful.

All such changes should be mostly source-level compatible and I think there is already time to start taking this into account.

Other then, the way how the interface looks and how it is flexible should be hopefully sufficient for long time. (Fingers crossed.)

And would you still incorporate some code improvements (in my view),
like using table lookup for character classification instead of complex
comparison and logical expressions (packaged as macros), before making
an “official” release?

According to my analysis with gprof there are three functions which deserve some optimizations:

md_collect_marks()
md_analyze_line()
render_html_escaped() (but this is on the HTML renderer side)

Each of these takes about between 20% and 30% of time.

There already is such optimization based on table lookup for some time in md_collect_marks() and it helped way a lot for its performance.

I tried implementing something similar for md_analyze_line() but there was no measurable gain so I did not commit that. I believe that’s because most of the time the function spends in just two lines when scanning for end of the line (just after the label done:).

I did not care about the renderer side much yet and I am unsure whether the render_html_escaped() may be considerably faster then it is.

Impact of all other functions is more or less negligible. It’s IMHO not worth of making them more complicated.

BTW, when talking about performance, I also tried profile guided optimization with gcc using long calibrating input created by concatenating MD4C’s README.md 100 times together. The result was another 30% performance gain over normal Release build without the PGO (measured with Cmark’s make bench).

I don’t know what optimizations gcc did although such analysis would be interesting. I would guess there was some aggressive loop unrolling and/or inlining because the binary was about 10% or 20% larger then without PGO.

nwellnhof · December 22, 2016, 5:03pm

Great work! Embedding C source directly is what we do for Apache Lucy, so MD4C looks like the ideal replacement for cmark (which is getting more and more bloated lately).

mity · January 1, 2017, 11:38pm

On the same Linux box as my previous report, I now get this from Cmark’s make bench:

MD4C (head): mean = 0.3590, median = 0.3600, stdev = 0.0120
Cmark (0.27.1): mean = 0.6360, median = 0.6400, stdev = 0.0052
Cmark (head): mean = 1.0740, median = 1.0700, stdev = 0.0084

So MD4C got better then the last time, for about 20%. It’s mostly due my today’s work on optimizing the two hottest loops in the parser.

(But Cmark now seems to perform much slower then it used to. Seems like some performance regression has been introduced there recently. @jgm Are you aware of it?)

jgm · January 4, 2017, 6:18am

It seems that the performance hit from a PR I recently merged adding accurate source extent information was too high. I’ve reverted that commit for now and suggested that the feature be made conditional on a flag.