Spec Issues: (Character) Entity References

tin-pot · November 29, 2016, 4:07pm

Thanks for your reply!

Yes, this seems more appropriate (in the sense of “universal” and “stable”) than the WHATWG/W3C HTML5 set.

We seem to agree here.

I admit that it is convenient to have implementations that “know out of the box” about common entity sets like the HTML5 or MathML or HTML 4.01 or ISO 15445 ones, so that they can perform the “checking” (aka validating) and “substitution” functions I mentioned. And it would probably wise if the CommonMark “recommends” some minmal set of such entity names (though I’m not sure about this). But hard-wiring this set into the implementation (or even the spec) seems too unflexible for my taste.

But it is IMO **not obvious what—if anything—**an implementation should substitute for “known” entity references. What is gained when HTML output contains UTF-8 encoded characters instead of HTML5 or HTML 4.01 entity names? What if I want ISO 8859-1 or even ISO 646-IRV encoding of my generated HTML? What if I want ä mapped to U+00E4 “ä” (since that character is available “everywhere”), but leave &CounterClockwiseContourIntegral; alone, since my editor has trouble handling it, or lacks an appropriate font (let alone my trouble entering this as a Unicode character)?

And if for example LaTeX output is desired, it seems to be much easier to have a LaTeX-specific definition for ∞ along the lines of

<!ENTITY infin "\infty"> <!-- LaTeX control word for U+221E INFINITY -->

than to first insert a literal U+221E into the LaTeX text (hope you’re using XeLaTeX …;-)) and then struggle how to map this into the proper CMSY font.

So Unicode/UTF-8 might not be ideal for all “downstream” processing after all.

Technically, one can’t use ENTITY markup declarations in CommonMark like this, because they MUST occur in the internal subset of the generated XML/HTML/SGML document. As far as I understand CommonMark, everything it produces goes into the document instance set (unless a custom-tailored implementation is “smart” enough to keep them apart). So the only “legal” markup declarations in CommonMark would be <!USEMAP ... > and <!USELINK ...> anyway (apart from comment declarations, of course).

But the “dumb” behaviour (like MD4C’s md2html with the --fverbatim-entities option, which is what most other Markdown processors also do) is quite useful already in this case, since references to the appropriate entity set in the output document is trivially to insert (even with sed).

Speaking of custom entities in CommonMark: This is a most important and interesting point, and both external and internal user-defined entities could be added seemingly with minimal disruption (obviating cruft like <!ENTITY ...> declarations in CommonMark text). See the discussion about “transclusion”.

In this case (and only in this or a very similar case) would it be IMO worthwile for a CommonMark processor to maintain an explicit “entity table”, where each name would map to

a Unicode character, or
some other (user-defined) replacement text, or
a URL (user-defined, for “transclusion”), or
to nothing (just indicating a “known” or “valid” entity name).