MD4C: New implementation in C

mity · November 25, 2016, 12:16pm

Ad the comparison of sizes:

I see. However I took it more from a perspective of someone who considers embedding into another application rather then from a point of view of its code maintainance. Generated code or not, it would get in. In contrast MD4C is designed to be easily embeddable in 3rd part apps, not as a general purpose utility, although it should be possible to write such utility on top of it.

About the parsing approach:

I don’t know enough about Cmark to make a direct comparison. There were few design decisions made just at the beginning of the project:

As much as possible, MD4C does not work with any buffers but it mainly just provides pointer (and size) into the input buffer back to the caller through the callback functions. This minimizes a need for any temporary buffers and keeps memory allocation under control.
The main goal is to have good parser which can be easily reused in other apps, not on creation of just a command line conversion utility. (Of course an utility (md2html) implementing a simple renderer was created as a testing app.)

The block parsing goes as follows:

Block parser is heavily line-oriented. The core of the work is in the function md_analyze_line(). On input it gets an offset (where the examined line starts) and also pointer to some previous analyzed line which decided about starting of the currently built block (“pivot line”). The function determines line indentation, container marks (and localizes the container nesting by comparison to current containers stack), type of line (e.g. setext underline, blank line, textual line etc.) and of course also where the line ends.
Then md_process_line() is called. This just determines whether the line as analyzed is compatible with the pivot line and whether the same block continues. This function builds vector of blocks and lines info in a relatively condensed fashion (to keep memory consumption in reasonable limits).

Container block starter/closer.
Leaf block head.
Line info (start/end offset, stripping any indentation or container marks from it)

The function md_analyze_line() also manages a stack of current nesting in containers (block quotes and lists). On any change of the nesting it enforces leaf block start/end as appropriate. Each node (struct MD_CONTAINER) also keeps offset into the vector where it opener and this allows to change its flags in processing of some later lines. This is used e.g. to change tight list into a loose one when we see a blank line inside of it. When entering/leaving a container, the appropriate block starter/closer node into the vector mention above is added.
When the vector is completely built (i.e. we reached end of document), we simply iterate over it. For container starter/closer, the appropriate renderer callbacks is called directly. For leaf block, the block info and vector of its line (simply subvector of the main vector) is passed to function able to handle that kind of bloc. Most tricky is of course a normal block with sequence of inlines or spans.

The inlines are parsed as follows:

On input we have a sequence of lines (start and end of them as analyzed above).
We iterate through valid chars inside the lines (i.e. skipping the “gaps” between them) of the whole block contents and collect “marks”, i.e. fundamental characters which need our attention in decision making whether they form some Markdown element or not.
We “resolve” the marks. I.e. we enumerate them from left to right over the collected marks several times.

Each “mark” is resolved a bit differently, but its mostly about finding closer marks to opener marks, e.g. for ‘[’ and ‘]’ we pair them so that any ‘[’ is added int o stack and ‘]’ is then paired with the top ‘]’.
The multiple passes are done to reflect precedence of various marks. (So first pass does e.g. code spans, 2nd pass links and 3rd pass emphasis). The important point is that subsequent passes skip marks inside “resolved pairs”.
Marks of same precedence priority (e.g. ‘*’ and ‘_’) are done in a single pass.
Recursion into link/image contents is then done manually for more nested passes.

Resolver of ‘[’ and ‘]’ is a bit exceptional because links need more context for resolving (e.g. how they are nested in each other). So it only pairs the brackets together and builds list of “potential links”. Then md_resolve_links() is called which handles the context and either really resolve the bracket pair as a link (usually expanding the closer mark to cover also the ‘( … )’ or the 2nd ‘[…]’) or keeps the marks unresolved so the marks are ignored in any further processing.
When all marks are resolved (or decided to be ignored), we do yet another pass when we just call callbacks for enter/leave span (when we reach any resolved mark) or a textual callback for text between any two resolved marks.

The “marks” (struct MD_MARK) and its management is the key why MD4C is quite fast. Each MD_MARK has members next and prev which may point to another marks and forms some chains or lists.

This is used in many situations, e.g.

Many resolver functions (e.g. for ‘[’ ‘]’ or ‘*’) manage list of seen-but-not-yet-resolved potential openers so when closer is reached, opener can be just found by getting last element of the list.
When finally resolved, opening mark’s next points to its closer and similarly, closer’s prev points to its opener. So again we can iterate over the block contents and skip nested spans effectively.

Does md4c have a modular extension system?

No. Just some set of hard-coded extensions which can be turned on or off with some flags when calling md_parse().