Entitiy References in CommonMark
In my opinion the way in which (character) entity references are treated in the CommonMark specification and—consequently—the way they are handled in implementations like cmark
is currently unfortunate.
In the following I argue that this actually
- reduces the usefulness of CommonMark as specified, and
- encumbers the implementation of CommonMark processors,
while achieving no practical or significant gain as far as I can tell.
Below I also sketch some ideas how this could be improved and hope that this would invite a useful discussion and subsequent changes in CommonMark.
There were some or the other comment on roughly this topic before, but no thorough discussion that I know of.
Current situation
The CommonMark specification requires that
All valid HTML entity references […] are recognized as such and treated as equivalent to the corresponding Unicode characters.
— CommonMark spec version 0.27, section 6.2
What constitutes a “valid HTML entity reference” is then defined by reference to the WHATWG Entity Set, which is itself defined in a JSON file hosted on the WHATWG HTML web site.
This is declared to be
[…] an authoritative source for the valid entity references and their corresponding code points.
— CommonMark spec version 0.27, section 6.2
And (mostly) true to this wording, the reference implementation has this WHATWG list of entity names hard-coded and built-in into the executable, and will recognize exactly those names as entity references. (That is, “all valid HTML entites” in the spec should really be read as “all valid HTML entities and only these”!)
For example, ∧
(being in the WHATWG set) is replaced by (meaning “treated as equivalent” to?) the UTF-8 encoding of U+2227 LOGICAL AND in the output—regardless whether generating XML, troff man
, or LaTeX source.
On the other hand, &implies;
(being not on the WHATWG list) is not even considered to be an entity reference but gets treated simply as character data, that is, it is output as &implies;
in XML (and similarly “escaped” for the other output formats).
Note that the new “MD4C” implementation provides a “Markdown extension” option --fverbatim-entities
for presumably this reason …
The spec mandates the wrong fixed Character Entity Set
While the WHATWG set—currently containing 2231 names—seems huge, and is in fact also the Entity Set specified in W3C HTML5, it obviously covers only a small fraction of even just the Unicode BMP.
I don’t know if W3C or WHATWG have any plans or mechanisms to extend this set in the future, or how to refer to different versions of it should this happen.
Curiously, the HTML5 spec includes MathML in section 4.7.14 and refers (currently) to MathML Ver. 3.0 2nd Edition. And MathML comes with its own, different Entity Set, namely the “combined HTML MathML entity set” maintained by W3C with the FPI
"-//W3C//ENTITIES HTML MathML Set//EN//XML".
In section 7.3 “Entity Declarations”, ISO/IEC 40314:2016 Information technology — Mathematical Markup Language (MathML) Version 3.0 2nd Edition (and the identical text in W3C MathML Version 3 2nd Ed) it is explained that (emphasis mine):
Earlier versions of this MathML specification included detailed listings of the entity definitions to be used with the MathML DTD. These entity definitions are of more general use, and have now been separated into an ancillary document, XML Entity Definitions for Characters [Entities]. The tables there list the entity names and the corresponding Unicode character references. That document describes several entity sets; not all of them are used in the MathML DTD. The MathML DTD references the combined HTML MathML entity set defined in [Entities].
— ISO/IEC 40314:2016, clause 7.3
So this “combined HTML MathML entity set” referred to as [Entities] is the normative one referenced in ISO/IEC 40314:2016.
This is a proper superset of the JSON-defined Entity Set of HTML5, and contains 112 more entity names, mostly from
"ISO 8879-1986//ENTITIES Greek Letters//EN"
"ISO 8879-1986//ENTITIES Alternative Greek Symbols //EN"
If the CommonMark spec sees a need to define an Entity Set which
- contains “all the characters a Web author would ever need”,
- comprises all entities available in HTML5 and related standards,
- is defined in a recognized, stable, publicly available standard,
- can be identified, referenced, and readily used in XML documents
then "-//W3C//ENTITIES HTML MathML Set//EN//XML"
would be a better choice than “this JSON file on the WHATWG web site”.
(Note that the responsibility for public entity sets was transferred from ISO to W3C some time ago, so W3C is today the official body maintaining such entity sets—for better or worse.)
However, arguments about “the right” entity set to define in the CommonMark specification are in my opinion moot anyway, because the notion of such a “right” set is bogus to start with, and the specification should not mandate any such set.
Mandating a fixed Character Entity Set in the spec is wrong
Standards for public entity sets are a good thing, particularly because there are so many to choose from …
But selecting one such set and mandating it in the spec as “the valid set” of entity names is in my view a bad idea anyway, and misses the whole point and purpose of general entities. The following examples should make this clear.
-
Maybe I just don’t like the “official” entity name. Case in point would be propositional logic, where
A ∧ B &implies; C
is much nicer and clearer than
A ∧ B ⇒ C
The ability in XML/SGML to just define
<!ENTITY implies "⇒">
is very handy here; and it is also “officially” encouraged:
NOTE – If a different name would be more expressive in the context of a particular document, the entity can be redefined within the document.
— ISO 8879:1986, annex D.4.1.3 -
There are (a lot of!) Unicode code points for which there is no name defined in any public entity set. For example, maybe I’d like to use U+2981 Z NOTATION SPOT (introduced in Unicode 3.2), and refer to it with
&spot;
:<!ENTITY spot "⦁">
Or, maybe when targeting HTML, defining this to
<!ENTITY spot "•">
or as a “definitional” entity
<!ENTITY spot SDATA "[spot ]">
-
Other typical examples would be superscript and subscript digits like
³
for U+00B2 SUPERSCRIPT THREE (which—being in ISO 8859-1—has “always” been available in HTML), but also to&sup4;
U+2074 SUPERSCRIPT FOUR. Bad luck, in CommonMark you can’t (and neither in HTML5). -
The need to mention carbon dioxide in texts is sadly commonplace nowadays. If an author wants the “proper” formula “CO2” instead of the slightly wrong “CO2”, it would be very convenient to just use
&co2;
in the text, and define<!ENTITY co2 "CO<SUB>2</SUB>">
or
<!ENTITY co2 "CO∷"> <!-- U+2237 SUBSCRIPT TWO -->
-
It is a common technique to use entity references to place “logos” in documents. Authors of tutorials might want to write about
&TeX;
and invoke by this entity reference the appropriate CSS/FO/LaTeX/groff/pixmap magic in their output document. In CommonMark they can’t. -
Another common use for general entities is simply for “text macros”, for example I find the word “ubiquitous” pretty hard to spell and type, and would prefer to use say
&uq;
<!ENTITY uq "ubiquitous">
when writing about “ubiquitous computing”, for example. Again, in CommonMark you can’t. (But of course I still had to first check that
&uq;
is not already defined in the WHATWG set …)
These are all examples for the intended and proposed uses of general entities:
References permit a number of useful techniques:
A short name can be used to refer to a lengthy or text string, or to one that cannot be entered conveniently with the available keyboard.
Parts of the document that are stored in separate system files can be imbedded.
Documents can be exchanged among different systems more easily because references to system-specific objects (such as characters that cannot be keyed directly) can be in the form of entity references that are resolved by the receiving system.
The result of a dynamically executed processing instruction (such as an instruction to retrieve the current date) can be imbedded as part of the document.
— ISO 8879:1986, annex B.6
but CommonMark precludes them all.
What could be done?
-
The spec should require that (except in code spans etc.) the (simplified, XML-like) syntax
general entity reference = "&" , NAME , ";" ;
should be recognized as a general entity reference and treated “appropriately”, with NAME having the usual definition
NAME = NMSTART , { NMCHAR } ;
(see production [5] in the XML 1.0 spec).
Should the spec mandate—as it does now—that implementations are prepared to handle a thousands of characters long NAME? I think this puts an unreasonable burden on implementors for no recognizable gain. Note that the longest name in the WHATWG/HTML5 entity set is
CounterClockwiseContourIntegral
, consisting of 31 characters. Mandating a “minimum maximum” length—that implementations must be able to handle NAMEs up to say 64 characters long—seems more practical. -
What does “appropriate” treatment of entity references mean? In my opinion, this is largely a “quality of implementation” issue. What it does not and can not mean is that these references should be
[…] treated as equivalent to the corresponding Unicode characters.
— CommonMark spec version 0.27, section 6.2as the spec unfortunately says (but certainly does not mean) now. The cases of
<
and&
make this obvious. -
As a minimum, the
<
and&
references MUST be reproduced in “XML-ish” output (even replacing them with<
rsp&
there would be incorrect!) -
For other entity references, there is a range of sensible implementation behaviour:
-
Just reproduce it in the (XML/HTML/XHTML/SGML) output. This is the simplest to implement, and would suffice for the
&TeX;
and all other examples above. -
Check the name against a list of “known” entity names; warn if the name is unknown, but reproduce the reference anyway (again, see the examples above).
-
If the name is a “known” one, also check it against a list of “known definitions”, and if found there, replace the entity reference accordingly in the output with the defined replacement text.
-
-
Whether or not an implementation has such lists of “known” and optionally “defined” entities, whether and how these lists can be provided or changed by the user—these are all implementation issues in my opinion.
Summary
The important points are:
-
General entities are much more, well, “general” and useful than the spec sees them now, and
-
in particular they are not just stand-in “equivalents” to some Unicode characters.
-
The CommonMark specification and CommonMark implementations should not preclude this usefulness for no good reason.
-
Requiring implementations to handle NAMEs of unconstrained length places an unreasonable burden on implementors without achieving anything practically useful.
-
Requiring implementations to know about a fixed list of entity names also places an unreasonable burden on implementors without achieving anything practically useful. (To the contrary, it reduces the possible uses of implementations!)
-
The notion that entity references should be “treated as equivalent to the corresponding Unicode characters” is misleading at best, if not plain wrong.
-
As always: the specification should not shackle itself and thus authors to (whatever flavour of) HTML.
-
There are three distinct processing aspects that should be dealt with separately in the specification:
- Where do pieces of text (lexically) constitute an entity reference?
- Which (if not all) NAMEs are considered “valid”?
- What (if anything) is substituted for the entity reference?
-
Of these three aspects, only the first is the proper concern of the specification. The other two aspects are largely dependent on the specific application, document type, subsequent processing steps and tools, output format etc. The specification should probably give some guidelines and a “model” scenario to encourage interoperability between implementations, but should certainly not try to “nail everything down”.