Use escaped space as

tin-pot · November 4, 2015, 8:14am

Thinking about treating “\”,SP to mean a NBSP:

This would also contradict the established meaning of the backslash-escape in CommonMark, which is to tell the processor something along the line:

The next character is to be taken literally, and does not count as mark-up.

In the words of the current CommonMark specification 0.22:

Escaped characters are treated as regular characters and do not have their usual Markdown meanings: […]

So again, re-using the backslash-escape to introduce a notation for NBSP would be not a good idea.

What might or could be useful in my view is a more general notation: introducing backslash-sequences to denote UCS characters by their code points (in hexadecimal):

Here a \xA0 NBSP.
And here a \u00A0 NBSP.
And another \U000000A0 NBSP.

[That would match the C99 (Java? JavaScript?) syntax and meaning for denoting Unicode characters in character string constants, I think rsp hope.]

So far this does not buy us any advantage over the current possibility to write numeric character references in CommonMark:

Here a &#160; NBSP.
And here a &#xA0; NBSP.

This is because the current cmark implementation will replace these references with their respective UCS characters in the output (XML/HTML/XHTML etc), and replaces them so early that you can even write mark-up with character references:

  - Item one.
  &#45; Item two.

and cmark will recognize the - as a HYPHEN-MINUS introducing a list item just fine.

To introduce two different meanings for numeric character references and backslash-coded characters one could use these two rules:

Numeric character references like - are not recognized as mark-up, but are shipped out literally (into HTML/XML/XHTML output—where LaTeX output would receive an equivalent notation);
Backslash-escapes for Unicode characters like

        \x12
        \u1234
        \U12345678

are treated just as encoding of input text: they are replaced with their respective Unicode characters even before the input text is scanned for CommonMark mark-up. Therefore eg a \x2D in the input text is exactly equivalent to writing a - U+002D HYPHEN-MINUS directly.

It is my understanding that the CommonMark specification would need no change wrt to numeric character references, but an addition to define backslash-coded characters (or whatever the terminology should be).

[NOTE: Alas, the specification unfortunately talks in terms of “HTML Entities” and—worse—about storing Unicode characters in the AST, or the kind of output that renderers receive: all of this has no place in a CommonMark specification IMO. This current wording has multiple flaws (independant from the meaning they try to convey):

too HTML-centric (numeric character references are at the core of HTML and XML, namely inherited from SGML);
wrong in the sense that eg “&” is not an entity, but an entity reference,
misleadingly wrong in the case of “-”—which is not even an entity reference, but a (numeric) character reference,
too implementation-centric: what is relevant (for CommonMark authors and implementors) is whether the CommonMark parser “sees” the replaced characters or the literal character references: it turns out the former, but this has nothing to do with “storing Unicode characters in the AST”, or what kind of output a “renderer receives”.

I think that the CommonMark specification should talk only about the syntactic conventions used in CommonMark as far as possible, and refrain from prescribing implementation details.]

Are there scenarios where this definition (or: distinction) would turn out to be helpful?

Use escaped space as &nbsp;

Use escaped space as