Use XML for the spec examples and tests? (comments welcome)

jgm · January 4, 2015, 4:36pm

Currently the examples in the spec (which comprise the test suite) use HTML for the intended results. This has the advantage of being familiar, and of being directly useable for testing by parsers that generate HTML (though a normalization step is needed, and provided by spec_tests.py).

The problem is that this output is the result of two things, parsing the CommonMark input into an AST, and rendering the AST to HTML. It would make more sense, I think, to use a direct representation of the AST that matches it exactly. cmark now has a -t xml option that produces an XML representation of the AST (there is even a dtd for the format, commonmark.dtd).

Here’s an example:

% ./cmark -t xml
Hi *there*.
A new line

A new paragraph.

    f >>= g

<div>
  raw html
</div>

A [link](/url "title")
^D
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CommonMark SYSTEM "CommonMark.dtd">
<document>
  <paragraph>
    <text>Hi </text>
    <emph>
      <text>there</text>
    </emph>
    <text>.</text>
    <softbreak />
    <text>A new line</text>
  </paragraph>
  <paragraph>
    <text>A new paragraph.</text>
  </paragraph>
  <code_block>f &gt;&gt;= g
</code_block>
  <html>&lt;div&gt;
  raw html
&lt;/div&gt;
</html>
  <paragraph>
    <text>A </text>
    <link url="/url" title="title">
      <text>link</text>
    </link>
  </paragraph>
</document>

(If the --sourcepos option is used, start and end line and column information will be included as attributes.)

So, the proposal for consideration is to replace the HTML parts of the spec examples with this XML format. To make it more compact, we’d strip out the <?xml declaration, the DOCTYPE, and the top <document> element.

It’s probably a good idea to retain the ability for people to test a converter that just produces HTML against the spec. We could do this by providing a standalone pipe that converts HTML to CommonMark’s XML format, raising an error if structures are encountered that can’t be converted. The test suite could then have an option that would run the HTML output of the program being tested through this pipe. (This would, in effect, replace normalization.)

Note that currently the parser sometimes produces adjacent text elements. For example, hi&lo turns into:

  <paragraph>
    <text>hi</text>
    <text>&amp;</text>
    <text>lo</text>
  </paragraph>

This is equivalent to

  <paragraph>
    <text>hi&amp;lo</text>
  </paragraph>

It would probably be good to tweak the parser so that adjacent text nodes are never produced, but we could also just add a normalization step to the test suite.

Comments on this?

jgm · January 4, 2015, 4:43pm

Links to related discussions:

First steps towards an AST (@riking, I’d completely missed your effort when I added the XML export format and DTD; I see from your DTD some ways I can improve mine, so thanks!)
A handful of small spec issues

Here is the DTD for the XML format:

github.com

commonmark/CommonMark/blob/master/CommonMark.dtd

<!-- DTD for CommonMark xml export format -->

<!ENTITY % block
         'block_quote|list|code_block|paragraph|heading|thematic_break|html_block|custom_block'>
<!ENTITY % inline
         'text|softbreak|linebreak|code|emph|strong|link|image|html_inline|custom_inline'>

<!ELEMENT document (%block;)*>
<!ATTLIST document
    xmlns CDATA #FIXED "http://commonmark.org/xml/1.0">

<!-- block elements -->

<!ELEMENT block_quote (%block;)*>

<!ELEMENT list (item)+>
<!ATTLIST list
          type (bullet|ordered) #REQUIRED
          start CDATA #IMPLIED
          tight (true|false) #REQUIRED

This file has been truncated. show original

Knagis · January 4, 2015, 9:33pm

Could the spec contain both HTML and XML? Or at least, can the spec that will be put on the website show both (or allow the user to switch between formats)?

Perhaps the spec could contain a slightly simpler format for others to parse from the .txt by putting an extended XML there:

<example id="123">
  <source>**foo**</source>
  <html>&lt;strong&gt;foo&lt;/strong&gt;</html> <!--either encoded text or xhtml-->
  <document><strong><text>foo</text></strong></document>
<example>

vitaly · January 4, 2015, 11:33pm

I know 2 real cases to solve with XML:

\n between tags. 0.13 solved almost everyting, but there are still couple of cases, where \n removal is preferable (imho).
XML-like output (and kludge option) in markdown-it is needed for CM tests only. Default renderer is targeted for HTML5.

In general should be good. But i never worked with such kind of specs (with embedded xml), and don’t know about possible difficulties.

lu_zero · January 5, 2015, 6:50pm

xml for that is surely better than other serialization even if pretty-printed json usually is nicer to read.

colinodell · January 7, 2015, 8:13pm

+1

I think both formats would be incredibly valuable. XML is a great idea for verifying the AST, whereas HTML is beneficial for human understanding and libraries who’d like to maintain similar output as the reference implementations.

On the website, perhaps the users could toggle between HTML and XML renderings, similar to how Microsoft shows code examples in C# and VB?

zudov · February 23, 2015, 1:32am

That would be nice. For myself, I’ve added this script to spec.html to replace html with xml:

<script type="text/javascript">
var reader = new commonmark.Parser();
var writer = new commonmark.XmlRenderer();

var examples = $(".example");
var toAST = function(el, example) {
    example = $(example);
    var markdown = example.find(".language-markdown"),
        html = example.find(".language-html");
    html.text(writer.render(reader.parse(markdown.text())));
}
examples.map(toAST);
});

(js is not my language)

vitaly · March 11, 2015, 1:32pm

@jgm, we finished refactoring renterer of markdown-it, and i’d like to summarize my personal impressions, related to this topic:

I don’t need XML. Current HTML samples are ok for my needs and more simple.

Details:

Links.

Current links of 0.18 spec a friendly enougth to “critical” normalizations
In href we encode hostname with IDNa, and percent-encode the rest. No conflicts.
In link texts (autolinks, linkify - not manually typed) we decode hostname with IDNa and percent-decode text. No conflicts, because spec has no conditions and tests for that…

Line breaks between tags:

the only difference if that you add \n in emply blockquote. We update fixtures by stupid regexp, that’s enougth.

So, i have nothing against XML, but vote to keep HTML too and leave choice to developpers what to use. For examples HTML is more convenient. For tests - depends on implementation.

tin-pot · October 25, 2015, 6:28pm

Excuse me for “unearthing” this old post; but it relates to some questions and experiences I recently made in my own project.

I think the “right thing” to concentrate on when it comes to testing is a representation which is independant of the vagaries of mark-up: the “Canonical XML” form seems to be exactly that.

So in order to compare actual output with reference material (eg for regression testing), one would either

devise a mode in which Canonical XML is output; or
“canonicalize” the actual output and then compare with the (canonical) reference.

As long as the output has the same canonical form, any remaining differences in the regular XML output should be immaterial.

Note however that the current output of cmark -t xml does include white space which will show up in the canonical form—only when validating against the DTD can the decision be made that this white space is in fact irrelevant (eg inside elements that do not have character data content). So from a testing perspective, and for “sanitary” reasons generally, probably cmark should write XML that does not have this gratuitous white space in it’s canonical form. Or do it optionally, if the resulting XML would be “too ugly”.

jgm · October 27, 2015, 8:49pm

See https://github.com/jgm/CommonMark/issues/274

I’ll note that having white space is very useful for diffs.

tin-pot · December 23, 2015, 2:48am

I have added some remarks on the issue over there at https://github.com/jgm/CommonMark/issues/274.

Btw, what do you mean by “having white space is very useful for diffs”? Maybe that “having the text broken into lines” is very useful for diffs, because diffs are usually line-oriented?