Spec Issues: (Character) Entity References

tin-pot · November 30, 2016, 10:16pm

I forgot in the above to point out one more good reason why a CommonMark (or Markdown) processor should not be required to undiscriminatingly replace all character entity references and not even all numeric character references, and this has nothing to do with encodings, fonts, or “exotic” Unicode characters:

When a CommonMark author writes, for example, | or | instead of | for the U+007C VERTICAL LINE in some text, he probably has a reason for making this distinction: Very likely, the literal | character will have some significance in a post-processing tool (maybe to separate columns of some sort), which the | will not entail. (Note that is the very same distinction between [ and \[ in CommonMark.)

And it seems quite unhelpful to mandate that a CommonMark processor, when generating some sort of XML/HTML/SGML/DocBook etc. output, should obliterate this distinction, hereby making such post-processing much harder if not impossible.

So a simple recommendation would be that at least entity references and character references to ASCII characters should be preserved and reproduced in the processor’s output (at the user’s option, maybe?).