Use escaped space as  

In LaTeX my main use of non-breaking spaces are cross-references, e.g. to avoid something like

...
and as you can see from fig. 
3 there are appropriate
...

In such cases writing   would be disfavorable, yet on the other hand such references should be probably be generated by some extension anyway.

@dikmax What use-case did you have in mind?

You mean in English there’s very few instances. But markdown can be used for other languages too!

I write blog in Russian and insert non-breakable space hundred times for every post for better typography and readability.
Writing   every time would be pain for me.

Here’s sample source md-file for example: https://github.com/dikmax/dikmax.name/blob/master/post/2015/2015-11-02-satrip-2015-to-bolivia.md

2 Likes

@dikmax, good to know.

I think using NBSP has primarily to do with whether you care about typography or not: in my Markdown texts I use NBSP pretty often, and I write in English or German.

Consider cases like these (using the tilde “~” as a “visible” NBSP, which btw I would prefer much over the “\” followed by a SP which the OP suggested):

25.4~mm is 1.00~in.
[...] in section~1. But in the following [...]
Up to~3 spaces at the beginning of a line.
"The Hobbit" is a novel by J~R~R Tolkien.
There is~- or is not~- an example for this.
There is~-- or is not~-- an example for this.
There is---or is not---an example for this.

[The last three lines show different typescript styles for entering what in printed text would be EN DASH rsp EM DASH; rsp to provoke the --smart option of cmark to produce such dashes.]

The purpose of each of these NBSP uses is obvious: to prevent the character(s) after it from ending up at the beginning of a line, or the character(s) before it from ending up at the end of a line—in both cases of line-breaking involved:

  1. Reformatting the Markdown plain text itself (which I do regularily);
  2. Formatting the output of processing Markdown, eg line-breaking in a browser.

Without NBSP, we risk that said characters will end up at places where they either look ugly, or even confuse the Markdown processor (like in the case of HYPHEN-MINUS followed by SP).


So yes: NBSP in Markdown text is important (even literal NBSP, for formatting Markdown plain text).

And no: I don’t think CommonMark or any other Markdown syntax needs a notation for NBSP—it is both easier and more useful to enter NBSP directly (easy in a proper text editor like Vim), or maybe use “@” or “~” or whatever as a stand-in while editing, and then do a global search-and-replace on this stand-in string to turn it’s occurences into real NBSPs.

I you will not re-format the plain text typescript, a feature aimed at treating problems like the ones I listed above implemented in the Markdown processor would still be more useful than simply providing a notation for entering NBSP:

The Markdown processor could recognize cases like

  • “decimal digits, followed by space, preceded by space or nothing”,
  • “hyphen(s), followed by space, preceded by space or nothing, but not the start of an “unordered” list item”,

and so on, and “insert NBSP” by itself (ie treat the respective SP as if it was NBSP)—which would obviate the need for the author to enter NBSP explicitly, at least in trivially recognizable cases like these. How you can enter a “real” NBSP into your typescript is rather a question you should ask your editing tool, not the CommonMark processor, IMO.

And remember (if you have to use a sh***y editor) that you can always use a numeric character reference in your Markdown input text—ugly but reliable:

25.4 mm is 1 in

The named character reference is only slightly “nicer”, but less reliable:

25.4 mm is 1 in

Thinking about treating “\”,SP to mean a NBSP:

This would also contradict the established meaning of the backslash-escape in CommonMark, which is to tell the processor something along the line:

The next character is to be taken literally, and does not count as mark-up.

In the words of the current CommonMark specification 0.22:

Escaped characters are treated as regular characters and do not have their usual Markdown meanings: […]

So again, re-using the backslash-escape to introduce a notation for NBSP would be not a good idea.


What might or could be useful in my view is a more general notation: introducing backslash-sequences to denote UCS characters by their code points (in hexadecimal):

Here a \xA0 NBSP.
And here a \u00A0 NBSP.
And another \U000000A0 NBSP.

[That would match the C99 (Java? JavaScript?) syntax and meaning for denoting Unicode characters in character string constants, I think rsp hope.]

So far this does not buy us any advantage over the current possibility to write numeric character references in CommonMark:

Here a   NBSP.
And here a   NBSP.

This is because the current cmark implementation will replace these references with their respective UCS characters in the output (XML/HTML/XHTML etc), and replaces them so early that you can even write mark-up with character references:

  - Item one.
  - Item two.

and cmark will recognize the - as a HYPHEN-MINUS introducing a list item just fine.


To introduce two different meanings for numeric character references and backslash-coded characters one could use these two rules:

  1. Numeric character references like - are not recognized as mark-up, but are shipped out literally (into HTML/XML/XHTML output—where LaTeX output would receive an equivalent notation);

  2. Backslash-escapes for Unicode characters like

        \x12
        \u1234
        \U12345678

are treated just as encoding of input text: they are replaced with their respective Unicode characters even before the input text is scanned for CommonMark mark-up. Therefore eg a \x2D in the input text is exactly equivalent to writing a - U+002D HYPHEN-MINUS directly.

It is my understanding that the CommonMark specification would need no change wrt to numeric character references, but an addition to define backslash-coded characters (or whatever the terminology should be).

[NOTE: Alas, the specification unfortunately talks in terms of “HTML Entities” and—worse—about storing Unicode characters in the AST, or the kind of output that renderers receive: all of this has no place in a CommonMark specification IMO. This current wording has multiple flaws (independant from the meaning they try to convey):

  1. too HTML-centric (numeric character references are at the core of HTML and XML, namely inherited from SGML);

  2. wrong in the sense that eg “&” is not an entity, but an entity reference,

  3. misleadingly wrong in the case of “-”—which is not even an entity reference, but a (numeric) character reference,

  4. too implementation-centric: what is relevant (for CommonMark authors and implementors) is whether the CommonMark parser “sees” the replaced characters or the literal character references: it turns out the former, but this has nothing to do with “storing Unicode characters in the AST”, or what kind of output a “renderer receives”.

I think that the CommonMark specification should talk only about the syntactic conventions used in CommonMark as far as possible, and refrain from prescribing implementation details.]

Are there scenarios where this definition (or: distinction) would turn out to be helpful?

1 Like

Speaking of contradiction with spec for escaped chars. We already have one special case.
Escaped new line leads to hard break, don’t see why we can’t introduce another one.

Using ~ as visible NBSP could lead to problems in some parsers, like Pandoc where single tilde is used for subscript.

1 Like

Speaking of contradiction with spec for escaped chars. We already have one special case.

Bad enough, but why make it even worse?

Escaped new line leads to hard break, don’t see why we can’t introduce another one.

I didn’t even know that—but I do use the alternative (two or more spaces at the end of a line) frequently, and my plain text formatter knows about it too.

You can’t deny that

... end of line:  
continued after the break.

looks better than

... end of line:\
continued after the break.

can you? Not to mention the bonus of consistency vs malus of contradicting other uses of backslash-escape. So I clearly prefer the “trailing-spaces” alternative to signify a “hard line break”.

The “visibility problem” is easily solved (again) by using a decent editor (ie here: capable to display trailing white space in lines).


Regarding the use of tilde “~”: I didn’t mean to propose that the CommonMark processor “sees” this tilde (let alone reacts to it), nor that the CommonMark specification should be extended in order to mention the tilde character as something special—my simple suggestion was:

  1. to type this tilde (or “ ” or whatever other string you fancy) as a “temporary NBSP-stand-in”; and

  2. somewhere along the line do a search-and-replace in your editor so that the tilde (or whatever string you used) is repaced by a real NBSP.

  3. And only then push the typescript through cmark or whatever Markdown processor.

If I understand you correctly, then all you want to do is use "\ " (ie backslash-then-space) as your favourite stand-in string for NBSP? And let the cmark processor do the substitution?—Because your editor can’t do a global search-and-replace?


If you can’t use a process like this during writing/editing your Markdown text, you should really think about a better tool environment.

For example, I can recommend (and do use right in this moment) the “It’s All Text” add-on for Firefox, so that I don’t have to write text postings in Markdown syntax inside the tiny sh*y browser <textarea>, but can use my editor of choice for that.

1 Like

@tin-pot - if you spot places where we’re misusing terminology, focused reports on the jgm/CommonMark issue tracker, or pull requests, would be helpful.

On contradiction with the spec for escaped characters: well, technically any change is going to contradict the existing spec. Note, though, that according to the spec, only ASCII punctuation characters can be backslash-escaped. So the current spec says that if you type backslash + space, you get a backslash and a space. Adding a rule for backslash + space wouldn’t be crazy; it would be much like the rule for backslash + newline. (In both cases, what you get is sort of “this literal character” – you get a space or a line break.)

I usually prefer the two-space rule for line breaks, too, and I set up my editor so that it is visible. But you have to understand that in the period of discussion leading up the spec, one of the most common complaints about Markdown syntax was about the “invisible” syntax for line breaks. Many people were very happy with the backslash + newline addition for line breaks. In fact, John Gruber told me that he really liked the idea and that he couldn’t believe he hadn’t thought of it himself.

1 Like

@jgm:

if you spot places where we’re misusing terminology, focused reports on the jgm/CommonMark issue tracker, or pull requests, would be helpful.

I had this in mind the whole day, but didn’t find the time to write a “formal” issue about the Spec (and the Impl too, more or less)—sorry! You are just too quick in reading and answering posts … :wink:

If you’re still interested, we could indeed take (this part of) the discussion to the jgm/CommonMark issue tracker.


On the question about backslash “\” meanings: I find three “categories of Prior Art” how this is conventionally done:

  1. Use “\” to supress interpretation of the following character: this is, if I understand it correctly, what the CommonMark spec now says and intends. The W3C CSS syntax has the same convention (among others), for example.

  2. Use “\”, followed some incantation involving digits, to specify a (Unicode) code position: there are examples of this conventions galore, in all kinds of syntaxes.

  3. Use “\”, followed by some distinguished character, to specify a specific (typically: control) character: the \n in C etc is a typical example.

Thus CommonMark uses only (or: mostly) choice 1 for now; introducing a special meaning for "\ " would open up choice 2; and my quick idea about entering Unicode characters via “\” would be of course choice 3.

The problem I have with “choice 2” is quite simple: where to stop? Note that this choice amounts not to introducing a rule, but to a enumerating list of special cases: why not use “\-” to signify EN DASH, or “\=” for EM DASH, or “\<” for SINGLE LEFT-POINTING ANGLE QUOTATION MARK and so on and on.

The special rule for backslash at end-of-line is a current exception (and one for which a better alternative exists, IMO: I have never used backslash to mark up a “hard line break”!). That John Gruber (who presumably came up with the “two-spaces-at-end-of-line” rule in the first place), seemingly likes the backslash better now, is a curiosity, but: If I had to give a one-sentence description of what distinguishes Markdown from syntaxes with a similar purpose (like Textile, or Wikimedia and so on), it would be:

Let authors write the text in a form that they would know and remember and use anyway (from typewriting tradition, from plain text e-mail style, etc).

I would hope that you’d agree that this is the over-arching principle in Gruber’s original Markdown design, that CommonMark does and should follow it, and that additions to the CommonMark spec should be carefully weighted in light of this principle: is the benefit of extending the syntax rules greater than the violation / diminishing importance / decreased simplicity it would incur on the core principle behind the whole syntax?

In the case of NBSP (and EM DASH etc) my conlusion is clear: no, the benefit would be minimal compared to the “uglyfication” it would bring into the syntax spec and the written typescript.

Just use a proper, Unicode-enabled editor, for G*d’s sake!

Or [even] more polemic:

If one does not mind writing into the typescript all kinds of strange character strings to invoke an ever-increasing list of features (like inserting non-ASCII Unicode characters, or specifiying HTML attributes and values and so on)—Well: there are a lot of other mark-up “languages” out there, with much more advanced features and processors than CommonMark (or even the various Markdown extensions) will probably ever have—why not use texinfo, or ASCIIdoc, or UDO, etcetc? (Or just go for SGML or even XML?)

So in my opionion introducing wanton syntax extensions for one special case after the other is not a good idea. And in particular not for the purpose of introducing features which are really just the job of an editing tool or preprocessor after all, like in the NBSP case.

Don’t get me wrong, but I like how \ + space looks in editor more than &nbps; and other stuff. It’s makes text more readable for me as writer. Using some special char while writing and then replacing it before publishing seems unnecessary hassle. We should use language capabilities and don’t introduce steps which can lead to error.

1 Like

I understand what you mean, but I don’t think that “looks better in the editor” justifies to introduce a special case of syntax extension: maybe I find that “\-” looks best to write EN DASH, or “\c” for COPYRIGHT SIGN: we will never find an end of this or an agreement for these kinds of personal preferences.

Here is how I actually deal with NBSP and friends:

  1. Most of the time I enter NBSP, or EN DASH, etc directly into my CommonMark text (which is a UTF-8 encoded text file), using the “digraph” feature of Vim: that is, I type “CTRL-K,N,S” for NBSP, or “CTRL-K,-,N” for an EN DASH.

  2. If you ask me, having the final and correct Unicode characters in your CommonMark text right from the start looks and works best.

  3. If I’m too lazy, or outside of Vim, I simpy use a character reference, like &nbsp; or &#160;.

  4. Alternatively, I could use my formatter (invoked from inside the editor) to substitute eg \NS by NBSP, \-N by EN DASH and so on: there is a “semi-official” list of digraphs, namely RFC 1345, so I don’t have to invent my own, and Vim luckily uses these RFC 1345 digraphs as the default list of defined digraphs.

  5. For my convenience however, because I often write German texts using an US keyboard layout, I added “\ae” as an alternative to the RFC 1345 “\a:” for LATIN SMALL LETTER A WITH DIAERESIS (U+00E4), because “ä” is a frequent character in German texts, and “ae” is the common transliteration of it—and the same for ö, ü, Ä, Ö, Ü, and ß of course. (The same “private” digraphs could be added in Vim.)

  6. All of this takes place while editing the CommonMark text, and the CommonMark processor sees nothing of it: no syntax extension, no special case needed. And this approach does already cover the whole range of RFC 1345 characters (including the examples EN DASH ("\-N") and COPYRIGHT SIGN ("\Co") from above); it is simple to use IMO, and is not error-prone once you get the hang of it.

So if you really insist on new syntax to enter non-ASCII characters, I would suggest to extend CommonMark sytax to

  1. allow RFC 1345 digraphs in CommonMark texts;

  2. allow hexadecimal numerical character specifications in CommonMark text, like “\12”, “\u1234”, and “\U12345678”.

But I guess that “\NS” wouldn’t look good enough for your taste? (I can assure you that as a German native I do cringe too, but for a whole different reason I suppose …).

I’d be in favour of dropping backslashes for newlines and spaces and instead include support for pre-formatted line blocks. As discussed here and here. The end result would be more readable prose.

I second that. (With the remark that backslash-space does not exist yet in the spec and thus can’t be dropped :wink: )

The proposed new “kind of block” syntax using VERTICAL LINE (U+007C) is nice and useful, but would IMO solve a somewhat different problem—and not the OP’s “problem” with NBSP (or generally: how to enter non-ASCII characters in CommonMark).

I suppose that the VERTICAL LINE character is placed in the same “column” as would be the GREATER THAN SIGN (U+003E), ie the new “kind of block” would respect the nesting of blocks (like in lists).

I always thought of the line-block syntax as a container for lines, rather than arbitrary block elements, but the latter makes sense. The effect of the | prefix could be to turn on an option that treats newlines as hard breaks. So, you could have something like

| 1. This is a
|    list with a line break

and it would parse as

<ol>
<li>This is a<br />
list with a line break</li>
</ol>

One problem with allowing full nestability would be determining what to do with line blocks inside line blocks.

| What is | this?

especially if pipe tables are added, since there might be ambiguities there.

I always think of a line/pipe block as a container for rows, i.e. a one-column table which is (perhaps optionally) realized with different (simpler) markup, i.e. <br> instead of <tr> etc.

| What is | this?

A headerless single-row two-column table, accordingly.

<table><tbody>
  <tr><td> What is </td><td> this? </td></tr>
</tbody></table>

| 1. This is a
|    list with a line break

This could also become a two-row table.

<table><tbody>
  <tr><td><ol><li> This is a </li></ol></td></tr>
  <tr><td> list with a line break </td></tr>
</tbody></table>

I like this idea of seeing a table in both uses: it is always good to eliminate special syntax for special cases.

I have written a proposal to extend the CommonMark specification to include tables which would render both your examples in the way as intended (well, I hope the rules do have this effect, and that the following is the intended table structure in each example):

  1. The first one as a single-row, two-column table without a table header, as shown;

  2. the second one as a single-row, single-column table with an ordered list in it’s (sole) table data cell; and this list contains one item with a <br> after this is.

So the second example should produce “what it says in the content” :wink: :

<table><tbody>
  <tr><td><ol><li>This is a<br>list with a line break</li><ol></td></tr>
</tbody></table>

You’re welcome to take a look, and feel free to comment or suggest improvements:

http://talk.commonmark.org/t/rfc-spec-extension-for-tables-syntax-and-transformation-rules/1910/1

Using a table seems like overkill to me. It would give bad output in many cases (if default table displays include borders, padding, and so on). Far simpler just to have the line block syntax turn on the (already implemented “newlines as hard breaks” extension locally.

1 Like

@jgm: I see what you mean by this, and I think you have a point there:

Using a table seems like overkill to me. It would give bad output in many cases (if default table displays include borders, padding, and so on). Far simpler just to have the line block syntax turn on the (already implemented “newlines as hard breaks” extension locally.

It should be easy to change the way a “degenerate table” (containing only one cell) is treated when it is output: and then to genereate not a <table> (or similar) element, but say a <div>, or a specially-classed <p>, or whatever you can think of. (A specific-classed <table> would be able to eliminate or ameliorate all your “bad output” consequences too. But only in the HTML/CSS case of output target, that is …)

But what I like about this “one syntax rule fits two purposes” similarity is that on the input syntax side both cases can be

  • specified,
  • explained,
  • parsed,
  • debugged

at once, in one go, using only one syntax rule, specification or manual section, parser code extension etc.

I like to think that this would be A Good Thing (but of course not at any cost or price, eg diminishing usability).

Getting back to nbsp. I still don’t understand what’s wrong with using escaped space. Yes, you can press Alt+Space on Mac (or some other combination on other system) to insert Unicode nbsp char, but it will be invisible. Same thing as with two spaces at the end of the line.

You say, take decent editor, which highlights invisible chars, but using Markdown is used everywhere, not just in your preferred editor. Try to see where you inserted line break or nbsp in Discourse comment field, or GitHub reply field or even specialized Web-editor like prose.io.

2 Likes

@dikmax

Getting back to nbsp. I still don’t understand what’s wrong with using escaped space.

Rest assured that there is absolutely nothing wrong with using “escaped space” "\ " (REVERSE SOLIDUS , SPACE) as a keyboarding string for NBSP, if you so wish to.

What would be “wrong” (or rather: unjustified, an unneeded complication, not worth the hassle—you know what I mean) in my opinion is: changing the CommonMark specification to accomodate this single itch of yours, and in doing so to also weaken and complicate the “basic rule for using backslash” in Markdown, namely to:

  • use backslash (“\”) to “hide” a mark-up significant character from the parser, like in “\[”.

This is the primary purpose of backslash in any Markdown variant, and in CommonMark as well. Note that this rule

  • corresponds exactly to the purpose and use of &lt;—or equivalently &#60; or &#x3c;—in XML/XHTML/SGML/HTML; while

  • also corresponds to the use of “\” in W3C [CSS level 2][css] syntax (but “\” is used there for a range of other things too, like specifying characters by hex-digits giving the code point); and

  • has a sole exception in the current CommonMark specification (AFAIK), namely that:

    • a backslash at end-of-line effects a hard line break,

    which is kind of the opposite to “hiding the end-of-line”, and thus this execption is in my view a mistake—even more so as there is a (IMO better, as in: more consistent and unobtrusive and easier to type) alternative already, namely:

    • use two or more " " (SPACE) at end-of-line to effect a hard line break.

Of course you can have a completely different opinion concerning this “use of backslash” issue in CommonMark, but do you at least understand where I’m coming from whith these remarks?


Yes, you can press Alt+Space on Mac (or some other combination on other system) to insert Unicode nbsp char, but it will be invisible. Same thing as with two spaces at the end of the line.

Yes, so what? I can’t see a NBSP, or EM SPACE or EN SPACE either. If I can’t use a decent editor (I’m typing this text into a shabby HTML <textarea> in my browser right now!), I simply type &nbsp; or &#a0; or &#160, or &emsp;.

Or I write the goddam <BR> (or <br />) at the end-of-line by myself, but most of the time I simply remember that I have typed two or more SPACEs there and be done with it.

I really can’t understand why you find that so hard on the one hand, and still insist that you don’t have nor want to use better tools like a decent editor …


That said …

… I agree that one could define a “better” (ie more versatile) use of backslash “\” in CommonMark, and IMO the use of backslash in CSS would be a good starting point for such an extension. This could introduce new rules for using “\” which can be more or less literally cloned from the CSS rules about backslash:


  1. Hexadecimal character specification: “a non-breaking\0Aspace” inserts a NBSP, but in “a space\0A after a hex-sequence is gobbled up” the SPACE after “A” will get “gobbled up”, resulting in only the NBSP being between the “e” and the “a”.

  2. The “usual” backslash-escape sequences like (most of them are probably not that useful and needed):

  • \t” for HT,
  • \n” for a new-line or line break (not equivalent to entering a U+000A LF control character!),
  • \f” for a FF,
  • \b” for BS,
  • \a” for BEL,
  • And I would welcome “\s” for NBSP as a new backslash-escape sequence too!
  1. As usual (like in the C preprocessor, and also inside strings in CSS): a backslash followed by and end-of-line is ignored, the result is as if neither the backslash nor the line break had been in the input [or the line break is replaced by a single SPACE]----so you can write eg long section heading texts into multiple lines connected by “\” at EOL. (C programmers know how to do this :wink: ) But the parser (and the CommonMark rules too!) would still see only a single line:

    This section title \
    \nis a bit long so we write it into \
    multiple input lines
    --------------------------------
    
    But the parser sees only *one* title line here! While in the *output* we will get *two* lines in the section
    title, because of the "hard line break" introduced by "`\n`".
    
  2. Also as usual: “\\” will enter a literal U+005C REVERSE SOLIDUS character.

  3. If the backslash is not followed by a decimal digit or a lower-case letter, the usual “hide it from the parser” rule continues to apply.


NOTE 1: The explicit EOL sequence “\n” would also obviate the ugly use of backslash to mark-up a hard line break.

NOTE 2: We’re writing text, not program code, so control characters like CR and LF (and BEL, and BS) are not really usefule here. Personally, I would like to have backslash-escape sequences for often-used typographical characters like EM SPACE, or EN DASH and EM DASH and so on.

NOTE 2: By the simple rule (3) above, the sequence "\ " (backslash, then space) would now be well-defined, and would “hide” the space from the parser (I think there are situations where SPACE is relevant in CommonMark parsing, so this could also be useful, beyond being consistent).

NOTE 3: According to the current CommonMark rules, the backslash only “hides” punctuation characters, and is taken literally else: I find this rule pretty byzantine, too, and worth replacing by simpler, more versatile, and more usefule rules.


This would be an extension of the CommonMark specification I’d be happy to support—and I hope that it would at least be acceptable for you to use “\s” (and not backslash-followed-by-space) to “mark up” (or just: type) a NBSP …

Would it?