Spec Issues: (Character) Entity References

tin-pot · November 29, 2016, 12:08pm

Entitiy References in CommonMark

In my opinion the way in which (character) entity references are treated in the CommonMark specification and—consequently—the way they are handled in implementations like cmark is currently unfortunate.

In the following I argue that this actually

reduces the usefulness of CommonMark as specified, and
encumbers the implementation of CommonMark processors,

while achieving no practical or significant gain as far as I can tell.

Below I also sketch some ideas how this could be improved and hope that this would invite a useful discussion and subsequent changes in CommonMark.

There were some or the other comment on roughly this topic before, but no thorough discussion that I know of.

Current situation

The CommonMark specification requires that

All valid HTML entity references […] are recognized as such and treated as equivalent to the corresponding Unicode characters.
— CommonMark spec version 0.27, section 6.2

What constitutes a “valid HTML entity reference” is then defined by reference to the WHATWG Entity Set, which is itself defined in a JSON file hosted on the WHATWG HTML web site.

This is declared to be

[…] an authoritative source for the valid entity references and their corresponding code points.
— CommonMark spec version 0.27, section 6.2

And (mostly) true to this wording, the reference implementation has this WHATWG list of entity names hard-coded and built-in into the executable, and will recognize exactly those names as entity references. (That is, “all valid HTML entites” in the spec should really be read as “all valid HTML entities and only these”!)

For example, &and; (being in the WHATWG set) is replaced by (meaning “treated as equivalent” to?) the UTF-8 encoding of U+2227 LOGICAL AND in the output—regardless whether generating XML, troff man, or LaTeX source.

On the other hand, &implies; (being not on the WHATWG list) is not even considered to be an entity reference but gets treated simply as character data, that is, it is output as &implies; in XML (and similarly “escaped” for the other output formats).

Note that the new “MD4C” implementation provides a “Markdown extension” option --fverbatim-entities for presumably this reason …

The spec mandates the wrong fixed Character Entity Set

While the WHATWG set—currently containing 2231 names—seems huge, and is in fact also the Entity Set specified in W3C HTML5, it obviously covers only a small fraction of even just the Unicode BMP.

I don’t know if W3C or WHATWG have any plans or mechanisms to extend this set in the future, or how to refer to different versions of it should this happen.

Curiously, the HTML5 spec includes MathML in section 4.7.14 and refers (currently) to MathML Ver. 3.0 2nd Edition. And MathML comes with its own, different Entity Set, namely the “combined HTML MathML entity set” maintained by W3C with the FPI

"-//W3C//ENTITIES HTML MathML Set//EN//XML".

In section 7.3 “Entity Declarations”, ISO/IEC 40314:2016 Information technology — Mathematical Markup Language (MathML) Version 3.0 2nd Edition (and the identical text in W3C MathML Version 3 2nd Ed) it is explained that (emphasis mine):

Earlier versions of this MathML specification included detailed listings of the entity definitions to be used with the MathML DTD. These entity definitions are of more general use, and have now been separated into an ancillary document, XML Entity Definitions for Characters [Entities]. The tables there list the entity names and the corresponding Unicode character references. That document describes several entity sets; not all of them are used in the MathML DTD. The MathML DTD references the combined HTML MathML entity set defined in [Entities].
— ISO/IEC 40314:2016, clause 7.3

So this “combined HTML MathML entity set” referred to as [Entities] is the normative one referenced in ISO/IEC 40314:2016.

This is a proper superset of the JSON-defined Entity Set of HTML5, and contains 112 more entity names, mostly from

"ISO 8879-1986//ENTITIES Greek Letters//EN"
"ISO 8879-1986//ENTITIES Alternative Greek Symbols //EN"

If the CommonMark spec sees a need to define an Entity Set which

contains “all the characters a Web author would ever need”,
comprises all entities available in HTML5 and related standards,
is defined in a recognized, stable, publicly available standard,
can be identified, referenced, and readily used in XML documents

then "-//W3C//ENTITIES HTML MathML Set//EN//XML" would be a better choice than “this JSON file on the WHATWG web site”.

(Note that the responsibility for public entity sets was transferred from ISO to W3C some time ago, so W3C is today the official body maintaining such entity sets—for better or worse.)

However, arguments about “the right” entity set to define in the CommonMark specification are in my opinion moot anyway, because the notion of such a “right” set is bogus to start with, and the specification should not mandate any such set.

Mandating a fixed Character Entity Set in the spec is wrong

Standards for public entity sets are a good thing, particularly because there are so many to choose from …

But selecting one such set and mandating it in the spec as “the valid set” of entity names is in my view a bad idea anyway, and misses the whole point and purpose of general entities. The following examples should make this clear.

Maybe I just don’t like the “official” entity name. Case in point would be propositional logic, where
```
A &and; B &implies; C
```
is much nicer and clearer than
```
A &and; B &rArr; C
```
The ability in XML/SGML to just define
```
<!ENTITY implies "&rArr;">
```
is very handy here; and it is also “officially” encouraged:

NOTE – If a different name would be more expressive in the context of a particular document, the entity can be redefined within the document.
— ISO 8879:1986, annex D.4.1.3
There are (a lot of!) Unicode code points for which there is no name defined in any public entity set. For example, maybe I’d like to use U+2981 Z NOTATION SPOT (introduced in Unicode 3.2), and refer to it with &spot;:
```
<!ENTITY spot "&#10625;">
```
Or, maybe when targeting HTML, defining this to
```
<!ENTITY spot "&bullet;">
```
or as a “definitional” entity
```
<!ENTITY spot SDATA "[spot  ]">
```
Other typical examples would be superscript and subscript digits like ³ for U+00B2 SUPERSCRIPT THREE (which—being in ISO 8859-1—has “always” been available in HTML), but also to &sup4; U+2074 SUPERSCRIPT FOUR. Bad luck, in CommonMark you can’t (and neither in HTML5).
The need to mention carbon dioxide in texts is sadly commonplace nowadays. If an author wants the “proper” formula “CO₂” instead of the slightly wrong “CO2”, it would be very convenient to just use &co2; in the text, and define
```
<!ENTITY co2 "CO<SUB>2</SUB>">
```
or
```
<!ENTITY co2 "CO&#x2237;"> 
```
It is a common technique to use entity references to place “logos” in documents. Authors of tutorials might want to write about &TeX; and invoke by this entity reference the appropriate CSS/FO/LaTeX/groff/pixmap magic in their output document. In CommonMark they can’t.
Another common use for general entities is simply for “text macros”, for example I find the word “ubiquitous” pretty hard to spell and type, and would prefer to use say &uq;
```
<!ENTITY uq "ubiquitous">
```
when writing about “ubiquitous computing”, for example. Again, in CommonMark you can’t. (But of course I still had to first check that &uq; is not already defined in the WHATWG set …)

These are all examples for the intended and proposed uses of general entities:

References permit a number of useful techniques:

A short name can be used to refer to a lengthy or text string, or to one that cannot be entered conveniently with the available keyboard.

Parts of the document that are stored in separate system files can be imbedded.

Documents can be exchanged among different systems more easily because references to system-specific objects (such as characters that cannot be keyed directly) can be in the form of entity references that are resolved by the receiving system.

The result of a dynamically executed processing instruction (such as an instruction to retrieve the current date) can be imbedded as part of the document.

— ISO 8879:1986, annex B.6

but CommonMark precludes them all.

What could be done?

The spec should require that (except in code spans etc.) the (simplified, XML-like) syntax
```
general entity reference = "&" , NAME , ";" ;
```
should be recognized as a general entity reference and treated “appropriately”, with NAME having the usual definition
```
NAME = NMSTART , { NMCHAR } ;
```
(see production [5] in the XML 1.0 spec).

Should the spec mandate—as it does now—that implementations are prepared to handle a thousands of characters long NAME? I think this puts an unreasonable burden on implementors for no recognizable gain. Note that the longest name in the WHATWG/HTML5 entity set is CounterClockwiseContourIntegral, consisting of 31 characters. Mandating a “minimum maximum” length—that implementations must be able to handle NAMEs up to say 64 characters long—seems more practical.
What does “appropriate” treatment of entity references mean? In my opinion, this is largely a “quality of implementation” issue. What it does not and can not mean is that these references should be

[…] treated as equivalent to the corresponding Unicode characters.
— CommonMark spec version 0.27, section 6.2

as the spec unfortunately says (but certainly does not mean) now. The cases of < and & make this obvious.
As a minimum, the < and & references MUST be reproduced in “XML-ish” output (even replacing them with < rsp & there would be incorrect!)
For other entity references, there is a range of sensible implementation behaviour:
- Just reproduce it in the (XML/HTML/XHTML/SGML) output. This is the simplest to implement, and would suffice for the &TeX; and all other examples above.
- Check the name against a list of “known” entity names; warn if the name is unknown, but reproduce the reference anyway (again, see the examples above).
- If the name is a “known” one, also check it against a list of “known definitions”, and if found there, replace the entity reference accordingly in the output with the defined replacement text.
Whether or not an implementation has such lists of “known” and optionally “defined” entities, whether and how these lists can be provided or changed by the user—these are all implementation issues in my opinion.

Summary

The important points are:

General entities are much more, well, “general” and useful than the spec sees them now, and
in particular they are not just stand-in “equivalents” to some Unicode characters.
The CommonMark specification and CommonMark implementations should not preclude this usefulness for no good reason.
Requiring implementations to handle NAMEs of unconstrained length places an unreasonable burden on implementors without achieving anything practically useful.
Requiring implementations to know about a fixed list of entity names also places an unreasonable burden on implementors without achieving anything practically useful. (To the contrary, it reduces the possible uses of implementations!)
The notion that entity references should be “treated as equivalent to the corresponding Unicode characters” is misleading at best, if not plain wrong.
As always: the specification should not shackle itself and thus authors to (whatever flavour of) HTML.
There are three distinct processing aspects that should be dealt with separately in the specification:
1. Where do pieces of text (lexically) constitute an entity reference?
2. Which (if not all) NAMEs are considered “valid”?
3. What (if anything) is substituted for the entity reference?
Of these three aspects, only the first is the proper concern of the specification. The other two aspects are largely dependent on the specific application, document type, subsequent processing steps and tools, output format etc. The specification should probably give some guidelines and a “model” scenario to encourage interoperability between implementations, but should certainly not try to “nail everything down”.

jgm · November 29, 2016, 3:12pm

This is useful. Let me summarize several separate suggestions/questions here:

If the spec requires entity resolution for a range of entities, we should at least use the larger list at http://www.w3.org/2003/entities/2007/htmlmathml.ent
A good case can be made for just limiting the spec to identification of entities (without mandating that they be resolved in any particular way), and for reducing the maximum length. This would reduce the burden on conforming implementations and provide more flexibility.
People might want to define custom entities in CommonMark files, using <!ENTITY...>. So there’s a question whether conforming parsers should handle these appropriately, e.g. by constructing a custom entity table to use in parsing. Alternatively nothing could be said about this; it could be up to implementations to do this if they wanted to. Note that this flexibility would mean that behavior for certain inputs was not defined, even up to normalization.

If we went with (2), then probably implementations that construct an AST would need a special Entity node type. I avoided this before because the concept of Entity is XML/HTML-centric and seemed a bit odd in an abstract representation of a document that might be rendered in any number of formats. It would put the burden on renderers (or some intermediate filtering step) to resolve the entities in formats where they can’t be passed through. But maybe this is the way to go.

Comments from others welcome.

tin-pot · November 29, 2016, 4:07pm

Thanks for your reply!

Yes, this seems more appropriate (in the sense of “universal” and “stable”) than the WHATWG/W3C HTML5 set.

We seem to agree here.

I admit that it is convenient to have implementations that “know out of the box” about common entity sets like the HTML5 or MathML or HTML 4.01 or ISO 15445 ones, so that they can perform the “checking” (aka validating) and “substitution” functions I mentioned. And it would probably wise if the CommonMark “recommends” some minmal set of such entity names (though I’m not sure about this). But hard-wiring this set into the implementation (or even the spec) seems too unflexible for my taste.

But it is IMO **not obvious what—if anything—**an implementation should substitute for “known” entity references. What is gained when HTML output contains UTF-8 encoded characters instead of HTML5 or HTML 4.01 entity names? What if I want ISO 8859-1 or even ISO 646-IRV encoding of my generated HTML? What if I want ä mapped to U+00E4 “ä” (since that character is available “everywhere”), but leave &CounterClockwiseContourIntegral; alone, since my editor has trouble handling it, or lacks an appropriate font (let alone my trouble entering this as a Unicode character)?

And if for example LaTeX output is desired, it seems to be much easier to have a LaTeX-specific definition for ∞ along the lines of

<!ENTITY infin "\infty"> <!-- LaTeX control word for U+221E INFINITY -->

than to first insert a literal U+221E into the LaTeX text (hope you’re using XeLaTeX …;-)) and then struggle how to map this into the proper CMSY font.

So Unicode/UTF-8 might not be ideal for all “downstream” processing after all.

Technically, one can’t use ENTITY markup declarations in CommonMark like this, because they MUST occur in the internal subset of the generated XML/HTML/SGML document. As far as I understand CommonMark, everything it produces goes into the document instance set (unless a custom-tailored implementation is “smart” enough to keep them apart). So the only “legal” markup declarations in CommonMark would be <!USEMAP ... > and <!USELINK ...> anyway (apart from comment declarations, of course).

But the “dumb” behaviour (like MD4C’s md2html with the --fverbatim-entities option, which is what most other Markdown processors also do) is quite useful already in this case, since references to the appropriate entity set in the output document is trivially to insert (even with sed).

Speaking of custom entities in CommonMark: This is a most important and interesting point, and both external and internal user-defined entities could be added seemingly with minimal disruption (obviating cruft like <!ENTITY ...> declarations in CommonMark text). See the discussion about “transclusion”.

In this case (and only in this or a very similar case) would it be IMO worthwile for a CommonMark processor to maintain an explicit “entity table”, where each name would map to

a Unicode character, or
some other (user-defined) replacement text, or
a URL (user-defined, for “transclusion”), or
to nothing (just indicating a “known” or “valid” entity name).

mity · November 30, 2016, 10:17am

Well, the truth is the motivation was more technical, to make core of MD4C encoding-agnostic. Therefore the translation from encoding to the output encoding was left on renderer which should in general know more about an output encoding, especially as this was implemented earlier then any Unicode support which got eventually in for e.g. the Unciode case folding to resolve reference links.

So the mentioned command line option --fverbatim-entities only affects the renderer, not the parser.

The outcome that the current design allows renderer to support only subset (or superset) of the entities is more a side effect then an intended goal. But I understand it may be sometimes useful.

FYI, currently the parser sees anything matching the regexp &[a-zA-Z][a-zA-Z0-9]{1-47}; as a (potential) named entity and passes it to the renderer as a text type MD_TEXT_ENTITY. The renderer may translate it to something or output it verbatim as it sees fit.

Second point is I will likely need to revisit the approach to deal with situations which cannot currently fit into the current interface. Due the limitations, the entities are correctly handled only in normal text flow, and not in link/image URLs or titles like they should be. But so far I don’t have an idea how the solution shall look like so don’t ask.

mity · November 30, 2016, 11:08am

Not that trivial if you consider entities inside a code span or code block should not be expanded.

BTW, should they be expanded in raw HTML? I guess not, but then this sentence in the spec. should likely be updated:

Entity and numeric character references are recognized in any context besides code spans or code blocks, including URLs, link titles, and fenced code block info strings

Crissov · November 30, 2016, 4:53pm

I just want to remind everyone that the character substitution feature many authors are more likely to encounter currently, are emoji “short names” or “short codes”.

https://discourse.wicg.io/t/named-emoji-entities-or-short-names/1636

tin-pot · November 30, 2016, 6:27pm

I didn’t want to say that handling of entity references would be trivial—but only that a fixed entity set does not make it easier. In which contexts (not in code spans, for example) an entity reference should be recognized as such or should be treated as character data is independent from the specific name of the entity and from the replacement text, if any.

Talking about “recognizing an entity reference” is a bit misleading, because basically this means the opposite for the implementor than it does for the user:

A lexical item that “looks like an entity reference” in a code span in CommonMark actually must be recognized by the processor and must be “escaped”, so that later the user (ie the user’s browser, say) does not recognize it as an entity reference;
And vice versa: Whether or not some substrings in an HTML block “look like an entity reference” can be ignored by the processor, because these will be recognized later by the user (in her browser or similar tool).

At least that’s how I understand this confusion …

tin-pot · November 30, 2016, 6:43pm

I see, that’s a good reason too. And it also makes (or could make) the output of md2html encoding-agnostic: to produce ASCII output from ASCII input, it would suffice to use an entity set where all replacement texts are numeric character references.

I have been secretly cloning and studying your code already so thanks for your work! I do quite like it how much smaller this implementation is, and hope to experiment with it a bit more.

To nitpick based on your remark: an entity name is just a NAME (in the SGML/XML sense), so the proper syntax (restricted to ISO 646-IRV) would in my opinion be

entity reference = "&" , NAME , ";" ;
NAME = NMSTART , { NMCHAR } ;
NMSTART = "a".."z" | "A".."Z" | ":" | "_" ;
NMCHAR = NMSTART | "0".."9" | "-" | "." ;

(There actually are entity names like b.Delta …)

tin-pot · November 30, 2016, 7:07pm

Looks like SGML short references all over again to me, where you would map strings (short reference delimiters) to entity names; an occurrence of such a string in character content is then equivalent to referencing the designated entity, which then gets replaced in the usual way.

So this can be used to not only give single characters “short names”, but insert basically anything: tags, phrases, elements like images, subdocuments.

I think the interesting problem is how to reasonably limit the scope and context where such substitutions happen (collecting them into distinct “maps”, and make these “maps” active only in specific elements or through <!USEMAP ...> declarations is how short references can be tamed).

Alternatively it is easy to just do stuff like this in a preprocessing step. In my CommonMark processor based on cmark I have a simple digraph-substituting preprocessor, so one can feel a bit like in groff and type \Co to enter a COPYRIGHT SIGN (even in CommonMark code sections obviously …). Not extremely useful in my experience, but I think a viable approach for things like this Emoji desire.

mity · November 30, 2016, 7:28pm

Well, MD4C is new. Very new. I am still working on MD4C itself, and for now I see md2html more as a tool how to test MD4C, especially its (still incomplete) compliance with CommonMark specification. As the CommonMark test suite assumes UTF-8 output, md2html has to produce UTF-8 output (at least as the default option).

I don’t think I will have time/motivation in the foreseeable future for expanding md2html into general purpose tool for very broad audience, providing plethora of options or features: Even after MD4C gets some robustness, stability of API and implementation, I will rather work on incorporating it within my other projects.

That said, however, I am very open to welcome and accept any pull requests improving MD4C, md2html, adding new tool e.g. to convert Markdown to other formats, or any other initiative making the project useful to more people.

tin-pot · November 30, 2016, 7:50pm

I myself have been working again latelty on my cm2doc tool based on the cmark reference implementation, it is intended to be (or rather: grow into) such a tool—in particular to generate output in various forms based on a templating mechanism, and to provide a way to incorporate “foreign syntaxes” (like ASCIIMath for example) into CommonMark, all using a CSS-like configuration text file rather than compiling specialized “renderers” on each occasion.

I’m very tempted to “port” this stuff atop MD4C, if only for code size, getting rid of the 400k source file generated by re2c, and the cleaner and narrower API …

It is still very much in flux, needs testing and documenting etc. But making the transition to MD4C, at least as an experiment, should not be too much effort (developer’s last words …;-))

tin-pot · November 30, 2016, 10:16pm

I forgot in the above to point out one more good reason why a CommonMark (or Markdown) processor should not be required to undiscriminatingly replace all character entity references and not even all numeric character references, and this has nothing to do with encodings, fonts, or “exotic” Unicode characters:

When a CommonMark author writes, for example, | or | instead of | for the U+007C VERTICAL LINE in some text, he probably has a reason for making this distinction: Very likely, the literal | character will have some significance in a post-processing tool (maybe to separate columns of some sort), which the | will not entail. (Note that is the very same distinction between [ and \[ in CommonMark.)

And it seems quite unhelpful to mandate that a CommonMark processor, when generating some sort of XML/HTML/SGML/DocBook etc. output, should obliterate this distinction, hereby making such post-processing much harder if not impossible.

So a simple recommendation would be that at least entity references and character references to ASCII characters should be preserved and reproduced in the processor’s output (at the user’s option, maybe?).

jgm · December 2, 2016, 11:17am

I’m inclined to make changes along these lines.
I’ve opened an issue to keep track:

Details still need working out.