Interesting! Can you say something about the general parsing approach you took – e.g. how it differs from cmark’s?
Does md4c have a modular extension system, and if so how does it work?
As for code size, I think your comparison is a bit misleading. Most of the size you’re recording for cmark is from scanners.c and scanners.h, which are automatically generated by re2c from scanners.re and shouldn’t really be considered part of the source. (We include it in the repository to avoid a build dependency on re2c.) Omitting these, I get 6k lines of code, including the build system (but excluding tests). And of course a lot of this is for functionality that md4c doesn’t seem to have – multiple output formats (man, xml, latex, html, commonmark itself, including sensible hard wrapping), for example, and an iterator interface for modifying the AST. Note that because we support multiple output formats, we have to translate entities, so part of the code is a giant list of character references and their corresponding unicode characters…