Currently the examples in the spec (which comprise the test suite) use HTML for the intended results. This has the advantage of being familiar, and of being directly useable for testing by parsers that generate HTML (though a normalization step is needed, and provided by spec_tests.py
).
The problem is that this output is the result of two things, parsing the CommonMark input into an AST, and rendering the AST to HTML. It would make more sense, I think, to use a direct representation of the AST that matches it exactly. cmark now has a -t xml
option that produces an XML representation of the AST (there is even a dtd
for the format, commonmark.dtd
).
Here’s an example:
% ./cmark -t xml
Hi *there*.
A new line
A new paragraph.
f >>= g
<div>
raw html
</div>
A [link](/url "title")
^D
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CommonMark SYSTEM "CommonMark.dtd">
<document>
<paragraph>
<text>Hi </text>
<emph>
<text>there</text>
</emph>
<text>.</text>
<softbreak />
<text>A new line</text>
</paragraph>
<paragraph>
<text>A new paragraph.</text>
</paragraph>
<code_block>f >>= g
</code_block>
<html><div>
raw html
</div>
</html>
<paragraph>
<text>A </text>
<link url="/url" title="title">
<text>link</text>
</link>
</paragraph>
</document>
(If the --sourcepos
option is used, start and end line and column information will be included as attributes.)
So, the proposal for consideration is to replace the HTML parts of the spec examples with this XML format. To make it more compact, we’d strip out the <?xml
declaration, the DOCTYPE
, and the top <document>
element.
It’s probably a good idea to retain the ability for people to test a converter that just produces HTML against the spec. We could do this by providing a standalone pipe that converts HTML to CommonMark’s XML format, raising an error if structures are encountered that can’t be converted. The test suite could then have an option that would run the HTML output of the program being tested through this pipe. (This would, in effect, replace normalization.)
Note that currently the parser sometimes produces adjacent text elements. For example, hi&lo
turns into:
<paragraph>
<text>hi</text>
<text>&</text>
<text>lo</text>
</paragraph>
This is equivalent to
<paragraph>
<text>hi&lo</text>
</paragraph>
It would probably be good to tweak the parser so that adjacent text nodes are never produced, but we could also just add a normalization step to the test suite.
Comments on this?