Thinking about treating “\
”,SP to mean a NBSP:
This would also contradict the established meaning of the backslash-escape in CommonMark, which is to tell the processor something along the line:
The next character is to be taken literally, and does not count as mark-up.
In the words of the current CommonMark specification 0.22:
Escaped characters are treated as regular characters and do not have their usual Markdown meanings: […]
So again, re-using the backslash-escape to introduce a notation for NBSP would be not a good idea.
What might or could be useful in my view is a more general notation: introducing backslash-sequences to denote UCS characters by their code points (in hexadecimal):
Here a \xA0 NBSP.
And here a \u00A0 NBSP.
And another \U000000A0 NBSP.
[That would match the C99 (Java? JavaScript?) syntax and meaning for denoting Unicode characters in character string constants, I think rsp hope.]
So far this does not buy us any advantage over the current possibility to write numeric character references in CommonMark:
Here a   NBSP.
And here a   NBSP.
This is because the current cmark implementation will replace these references with their respective UCS characters in the output (XML/HTML/XHTML etc), and replaces them so early that you can even write mark-up with character references:
- Item one.
- Item two.
and cmark will recognize the -
as a HYPHEN-MINUS introducing a list item just fine.
To introduce two different meanings for numeric character references and backslash-coded characters one could use these two rules:
-
Numeric character references like
-
are not recognized as mark-up, but are shipped out literally (into HTML/XML/XHTML output—where LaTeX output would receive an equivalent notation); -
Backslash-escapes for Unicode characters like
\x12
\u1234
\U12345678
are treated just as encoding of input text: they are replaced with their respective Unicode characters even before the input text is scanned for CommonMark mark-up. Therefore eg a \x2D
in the input text is exactly equivalent to writing a -
U+002D HYPHEN-MINUS directly.
It is my understanding that the CommonMark specification would need no change wrt to numeric character references, but an addition to define backslash-coded characters (or whatever the terminology should be).
[NOTE: Alas, the specification unfortunately talks in terms of “HTML Entities” and—worse—about storing Unicode characters in the AST, or the kind of output that renderers receive: all of this has no place in a CommonMark specification IMO. This current wording has multiple flaws (independant from the meaning they try to convey):
-
too HTML-centric (numeric character references are at the core of HTML and XML, namely inherited from SGML);
-
wrong in the sense that eg “
&
” is not an entity, but an entity reference, -
misleadingly wrong in the case of “
-
”—which is not even an entity reference, but a (numeric) character reference, -
too implementation-centric: what is relevant (for CommonMark authors and implementors) is whether the CommonMark parser “sees” the replaced characters or the literal character references: it turns out the former, but this has nothing to do with “storing Unicode characters in the AST”, or what kind of output a “renderer receives”.
I think that the CommonMark specification should talk only about the syntactic conventions used in CommonMark as far as possible, and refrain from prescribing implementation details.]
Are there scenarios where this definition (or: distinction) would turn out to be helpful?