ANN: CommonMark 0.23

Version 0.23 of the CommonMark spec has been released.

I have also released conforming versions of cmark and commonmark.js.


Great to see the DTD element types for “raw HTML content” renamed to html_inline and html_block!

With the addition of CUSTOM_INLINE plus CUSTOM_BLOCK and their corresponding XML element types, the XML model seems actually usable, and I may the other day ditch (or make optional) the SGML reference concrete syntax compatible element type names I currently use in my tool chain, so thank you for that!

In the new 0.23 spec, in example 289 in the [aptly named ;-)] section “6.2 Entity and numeric character references”, something seems to have happened to the & example: it is the only one that appears “escaped” as


and not as an UTF-8 encoded character in the right-hand side, like the rest of the “entity” examples.

Also in section “6.2 Entity and numeric character references”, maybe one could use the “official” name REPLACEMENT CHARACTER for U+FFFD instead of (or in addition to) the term “unknown code point character” as in

Invalid Unicode code points will be replaced by the “unknown code point” character (U+FFFD).

This is purely a matter of terminology, as U+FFFD is certainly the right character to, well, replace invalid input code points.

While I’m in a nitpicking streak … In example 292, the use of


is a bit unfortunate and misleading – the problem is not the length of the entity name (even HTML 3.2 used NAMELEN 65536, effectively eliminating any name length limit), but that the name is not pre-defined in HTML5 (if I understand the example’s intent right).

Note that all the other “nonentities” in example 292 show syntactically invalid attempts of writing entity references, so I suppose the point of this one is in fact to point out that an undefined name can’t be used, rather than a long name (and 34 characters isn’t that long, is it? :wink: ).

[I would argue that it is a bad choice for the CommonMark specification to simply confine the names of entities that are usable in a text and recognized by a parser to a fixed set (be it the set of pre-defined entity names of HTML 5 or whatever); but this is a separate topic …]

One last nitpicking remark regarding “6.2 Entity and numeric character references”: The text introducing example 299, 300, and 301 reads:

Entity and numeric character references are treated as literal text in code spans and code blocks, and in raw HTML.

But comparing example 299 with example 301 clearly shows that these references are treated differently in code blocks and code spans on the one hand, and inside raw HTML on the other (as they should!): the opening “&” is translated to “&only in the first case.

It seems that a precise meaning of “treated as literal text” is missing, and it can’t mean both.

That’s correct. The other characters can all occur verbatim in HTML source, but of course & cannot, so it is escaped, otherwise we’d have invalid HTML.

I’ve just changed it.

I see the point. I’ve changed it to ThisIsNotDefined.

Right. This was a regression I mistakenly introduced at the last minute. I’ve rewritten it so it doesn’t make this claim about raw HTML. Thanks.

That’s correct. The other characters can all occur verbatim in HTML source, but of course & cannot, so it is escaped, otherwise we’d have invalid HTML.

I see - I was confused about the three levels of “escaping” here: the HTML (UTF-8 encoded) source text for the example, how the example is rendered in the spec, and how the “output code” displayed in the example would be rendered in eg a browser …

Hmm. Did something change with regards to smart parsing? The Ruby Commonmarker wrapper is failing on all tests around smart punctuation. For example:

--- expected
+++ actual
@@ -1 +1 @@

The Ruby wrapper simply parses the same smart_punct.txt file as the C tests, which leads me to believe that I am no longer correctly passing the smart option. It looks like this:

doc = CommonMarker.render_doc(testcase[:markdown], :smart)

@gjtorikian I’m sure I know what happened.

I changed the numerical values of the CMARK_OPT_* constants in cmark.h, separating them into two groups (options affecting parsing and options affecting rendering). This wouldn’t affect wrappers that used the symbols themselves, like CMARK_OPT_SMART, but if your wrapper used their numerical values, it would break things.

1 Like

Fantastic, that was it: 7549b7f0 (Followed quickly by the lazier 5cc57641)

Thanks for the quick response. I don’t think there’s a way for me to automate this in the future but I’ll keep it in mind should it occur again.

1 Like