Leading and trailing white spaces in code blocks

I would like to ask about the title int the specs. Is it intended that the leading and trailing white spaces are trimmed even in code blocks? I would argue that the code blocks should be something like the <pre> tags and they should not modify the text in any way (and of course should remain an inline element instead of being a block like the <pre> tag).

Here is a short example:

  • this is a code block
  • this is a code block with 8 trailing spaces: and content after it
  • content before it:this is a code block with 8 leading spaces

What do you think about this?

From the spec:

This example shows the motivation for stripping leading and trailing spaces:

` `` `

renders as:

<p><code>``</code></p>

I understand that example is a good use-case for the original intention. But I still think that the code spans should be able to show whitespaces as theyā€™re in the original text. For example currently itā€™s not possible to insert python in a code span because of the multiple spaces and leading spaces stripping. See this example:

x = 1 if x == 1: # indented four spaces print "x is 1."

Which has this source:

`
x = 1
if x == 1:
    # indented four spaces
    print "x is 1."
`

I know this is a block, so <pre> tags should be used for it anyways, but I think that the code block feature with the backtick should not change the whitespaces in the string this heavily to be useful for some cases where the newlines and multiple whitspaces are an important part of the text.

Well, if you want a <pre> tag, it means you donā€™t want a code span anymore, but a code block. And code blocks are respecting the indentation

3 Likes

In other words in this

`are trailing spaces trimmed from here-->   `

has the trailing spaces trimmed.

I think we should consider what BabelMark says here.

Of the implementations tested:

  • 10 preserve the trailing spaces
  • 17 do not

My feeling is that we should preserve spaces in code spans. Unless this is very complicated from an implementation standpoint.

3 Likes

Yes, and it is horrible to do so!

My feeling is that we should preserve spaces in code blocks. Unless this is very complicated from an implementation standpoint.

Wait ā€“ code blocks? Didnā€™t your example above show the vanishing of SPACE at the end of a ā€œbacktickā€-delimited code span?

Anyway: Yes, SPACE should be preserved in code spans delimited by GRAVE ACCENT! Yes yes yes!

But line breaks probably should not (if anything, they should probably be ā€œnormalizedā€ into single SPACE characters).

Sorry, I clearly meant code spans.

What differentiates the specificationā€™s example 304 code span

`āŽµ``āŽµ`

from cases like

"`āŽµXāŽµ`"

is the fact that in example 304 the ā€œleadingā€ and ā€œtrailingā€ SPACE is only there for syntactic reasons, and strictly speaking both SPACE characters are part of the ā€œopening delimiterā€ ( GRAVE ACCENT , SPACE ) respectively the ā€œclosing delimiterā€ ( SPACE , GRAVE ACCENT ) of the code span: removing one of these SPACE characters disturbs and disrupts parsing of the whole code span.

When there are no GRAVE ACCENT characters (aka ā€œbacktickā€) adjacent to the code span delimiter, as in the second example with the enclosed ā€œXā€, one can obviously add and remove SPACE adjacent to the GRAVE ACCENT without any consequences for parsing.


In this light, it only makes sense to discard SPACE characters in the situation of example 304 (where they were inserted for purely syntactical reasons in the first place!), but not in the second example, IMO.

A rule concerning leading and trailing SPACE in code spans based on this difference could be:

Any SPACE after the opening ā€œbacktickā€ character(s) and before the first ā€œinner backtickā€ character (if there is one, that is, not one of the closing ā€œbacktickā€ characters) is discarded; and vice versa any SPACE in front of the closing ā€œbacktickā€ character(s) which follows the last ā€œinner backtickā€ character.

However, I have the feeling that this could (and should!) be said in a much cleaner, simpler way, maybe just using a grammar of sorts?


[ Edit: I have posted a proposed modification of the CommonMark specification over there. ]

Are you saying that discarding SPACE inside ā€œcode spansā€ delimited by GRAVE ACCENT was the intended behaviour?

I have actually started to use U+23B5 BOTTOM SQUARE BRACKET as a stand-in for SPACE where it is important (pasted into the text via clipboard!), because not only SPACE but NO-BREAK SPACE would get removed by the parser without trace.

I would for sure welcome the preservation of SPACE in code spans! ā€“ See my rant at the end of my post here for details.

I am saying look at the Babelmark results yourself and see that there is no clear consensus one way or the other. It is just about 50/50.

Dammit! I had the firm conviction that (Markdown and) CommonMark input like

fooāŽµ"`XāŽµ`"āŽµbar

would be equivalent in any way, shape, or form to writing

fooāŽµ"<code>XāŽµ</code>"āŽµbar

and both would produce the fragment (in HTML):

fooāŽµ&quot;<code>XāŽµ</code>&quot;āŽµbar

(with or without using named character references) ā€” and that this would be the only sane and useful behavior of a CommonMark processor. At least, I could not remember ever having a problem with Markdown in this regard.

Turns out: the reason for the latter seems to be that I never had the need for examples like this one, where SPACE at the end of a code span is significant, and where it is therefore important for this SPACE to show up in the HTML <code> elementā€™s character content too.

In fact, parsers like discount or soldout which I also use (less frequently today than in the past) do also discard the trailing SPACE in the example above, and I would also have to use the funny U+23B5 BOTTOM SQUARE BRACKET to force a visually unambiguous output with those parsers, or use similar workarounds (alas, I have to admit, Iā€™m getting used to it, and it does look kind of nice, doesnā€™t it?).

[ To clarify: Omission of the SPACE in question only happens in the code span example using ā€œbackticksā€. ]


So Iā€™m owing you an apology: sorry, I was wrongā€”discarding the SPACE character is neither your nor your implementationā€™s fault, but seems to be a common practice instead, and Iā€™d like to take back any harsh words addressed at you about it.


But may I still maintain the opinion that keeping leading and trailing SPACE in code spans is the only proper way to process them? Or is there some deep reason to discard these SPACEs that you know of and which I just canā€™t see?

I support keeping the spaces if the parser hurdles are not extreme.

It should not be too hard to come up with a sensible rule for handling SPACE in code spans (in particular, leading and trailing spaces), see my attempt here. In fact, implementing the rule I propose is probably easier than phrasing the syntax rule in a simple, plain-english way.

Here is my take on rephrasing subsection ā€œ6.3ā€‚Code spansā€ in the CommonMark specification, and adapting the example HTML output by ā€œbrain-parsingā€ the input according to the modified rules.

The main purpose was to only trim leading and trailing SPACE from code span content when necessary (and even that could be done more parsimonious, and more ā€œcorrectā€, by removing at most one SPACE at each end!).

I also took the liberty to change the treatment of ā€œinner white spaceā€ in code spans a bit, see the rephrased ā€œQ&Aā€ fragment below: only line breaks are ā€œnormalizedā€, but in-line strings of multiple SPACE persist.

The example input and output fragments are simply placed into code blocks here, with just a blank line in between. The MIDDLE DOT takes the role of a ā€œvisible SPACEā€, as in the specification.

The example labels are here used as headings, so that the pertaining comment follows immediately after a heading giving the example number.


6.3ā€ƒCode spans

A backtick string is a string of one or more backtick characters ("`" U+0060 GRAVE ACCENT) that is neither preceded nor followed by another backtick character.

A code span begins with a backtick string and ends with a backtick string of equal length n, thereby enclosing a non-empty content string. There may be backtick characters in the content string, but not another backtick string of length n.

The character content in the result of parsing the code span is the content string after

  1. removing spaces and line breaks that precede the first backtick in the content string (if any), and

  2. removing spaces and line break that follow the last backtick in the content string (if any), and

  3. replacing line breaks inside the content string, along with adjacent white space, with a single SPACE.

Example 302

This is a simple code span:

`foo`

<p><code>foo</code></p>

Example 303

Here two backticks are used, because the code contains a backtick. This example also illustrates that leading and trailing spaces are not stripped in this case:

``Ā·fooĀ·`Ā·bar ``

<p><code>Ā·fooĀ·`Ā·barĀ·</code></p>

Example 304

This example shows a case where both the leading and trailing space characters are trimmed, because they precede or follow a backtick character in the content string:

`Ā·``Ā·`

<p><code>``</code></p>

Example 305

Line breaks are ā€œnormalizedā€ to spaces:

``
foo
``

<p><code>Ā·fooĀ·</code></p>

Example 306

Interior line breaks and surrounding white space are ā€œnormalizedā€ into single spaces:

`fooĀ·Ā·Ā·bar
Ā·Ā·baz`

<p><code>fooĀ·Ā·Ā·barĀ·baz</code></p>

Q: Why not ā€œcollapseā€ the inner spaces between foo and bar too, although browsers will collapse them in many cases anyway?

A: Because this depends on the style sheet used for rendering HTML, and we shouldnā€™t rely on any specific rendering assumptions.

Example 307

(Existing implementations differ in their treatment of internal spaces and line endings. Some, including Markdown.pl and showdown, convert an internal line break into a <BR> element. But this makes things difficult for those who like to hard-wrap their paragraphs, since a line break in the midst of a code span will cause an unintended line break in the output. Others just leave internal spaces as they are, which is fine if only HTML is being targeted.)

`fooĀ·``Ā·bar`

<p><code>fooĀ·``Ā·bar</code></p>

Example 308

Note that backslash escapes do not work in code spans. All backslashes are treated literally:

`foo\`bar`

<p><code>foo\</code>bar`</p>

Backslash escapes are never needed, because one can always choose a backtick string of length n to delimit code that does not contain any string of n backtick characters.

There is a deeper reason. Suppose you want to put a single backtick in a code span. The way to do it is:

`` ` ``

or

``` ` ```

or etc., and here the spaces are needed, or youā€™d just have a long string of backticks.

So we need to ignore at least one leading and trailing space. Whatā€™s in question is:

  1. should internal spaces be collapsed?
  2. should all leading and trailing spaces be ignored, or just the first and last?
  3. what should we do with internal newlines?

The current spec says: 1 - yes, 2 - all, 3 - treat as spaces.
The answers I currently think are best are: 1 - no, 2 - first and last, 3 - not sure.

EDIT: Sorry, @tin-pot, I made this comment before reading your next post.

What you suggest, namely,

  • removing spaces and line breaks that precede the first backtick in the content string (if any), and
  • removing spaces and line break that follow the last backtick in the content string (if any), and
  • replacing line breaks inside the content string, along with adjacent white space, with a single SPACE.

sounds reasonable to me ā€“ I hadnā€™t thought of this more moderate proposal about space-stripping. However, I think Iā€™d still prefer to strip an initial and final space in every case, for a greater degree of compatibility with existing Markdown renderers. (Some people may be in the habit of always having space around the contents of their inline code spans, I suppose.)

About newlines: what is the rationale for collapsing spaces around newlines, if weā€™re not doing it generally?

1 Like

Yes, one SPACE following the opening (preceding the closing) backtick string should be seen as part of the delimiter, and thus get discarded. Iā€™d guess that is meant by Gruber here when he writes ā€œmay includeā€:

The backtick delimiters surrounding a code span may include spacesā€”one after the opening, one before the closing.

Technically, one would need to drop one SPACE only if that SPACE was needed at this place to begin with, that is, here:

`` `x` ``

but not here:

`` x ``

But that is probably too much sophistry.


On the other questions,

  1. should internal spaces be collapsed? ā€“ I donā€™t see why, except in the case of line breaks;
  2. should all leading and trailing spaces be ignored, or just the first and last? ā€“ Iā€™d say only one at each end;
  3. what should we do with internal newlines? ā€“ Internal newlines would best be ā€œcontractedā€, see below.

The rationale for handling line breaks by replacing them, along with adjacent white space, with a single SPACE is the use case where lines get re-flowed. Consider a code span like this, in paragraph that is indented by four:

āŽµāŽµāŽµāŽµAurea prima sata est aetas `oneāŽµāŽµ/āŽµāŽµtwoāŽµāŽµ/āŽµāŽµthreeāŽµāŽµ/āŽµāŽµfour` quae vindice nullo

Iā€™d say (regarding item 1) that thereā€™s no reason to modify the double SPACEs here, if thatā€™s what the author intended.

Now imagine a tool like fmt(1) runs through this paragraph, re-formats lines, and the result is:

āŽµāŽµāŽµāŽµAurea prima sata est aetas `oneāŽµāŽµ/āŽµāŽµtwoāŽµāŽµ/
āŽµāŽµāŽµāŽµthreeāŽµāŽµ/āŽµāŽµfour` quae vindice nullo

If newline and/or adjacent SPACE would be preserved in the output, the code span would now have the character content:

oneāŽµāŽµ/āŽµāŽµtwoāŽµāŽµ/\nāŽµāŽµāŽµāŽµthreeāŽµāŽµ/āŽµāŽµfour

(here ā€œ\nā€ means the EOL character). I think this is worse than:

oneāŽµāŽµ/āŽµāŽµtwoāŽµāŽµ/āŽµthreeāŽµāŽµ/āŽµāŽµfour

Which would be the (intended) result of replacing EOL and adjacent SPACEs with a single SPACE.

(This is, btw, what attribute value normalization does with attribute value literals in XML, and the reasoning behind it is probably similar.)

1 Like

Oh, I like this character technique ā€“ I can use it to indicate spaces in the CommonMark reference!

Looking back at this thread, I think the cleanest option would be:

  1. Collapse one initial space if the next non-space character is a backtick.
  2. Collapse one final space if the preceeding non-space character is a backtick.
  3. Do not collapse interior spaces.
  4. Treat an interior newline as a space unless itā€™s adjacent to a space, in which case ignore it.

But 1 and 2 might lead to backwards compatibility issues, so itā€™s probably better to do something like this:

  1. Collapse one initial and one final space, but only if there is both initial and final space. So, for example BACKTICK + SPACE + A + BACKTICK gives you a code span with SPACE + A, while BACKTICK + SPACE + A + SPACE + BACKTICK gives you one with just A.
  2. Do not collapse interior spaces.
  3. Treat an interior newline as a space unless itā€™s adjacent to a space, in which case ignore it.
5 Likes

Iā€™m the original creator of this thread. Somehow I forget all about it during the middle a long time ago and now I got a new notification and just reread most of the thread again.

I am not sure if it is really relevant now or not, but after some thought for me this ā€œfinalā€ proposal pretty much covers what my original expectation was about handling spaces inside a code span. So I would be happy with this.

1 Like

Iā€™d really like to push for this! Itā€™s already on the ā€œissues that must be resolved before 1.0ā€ list, and I definitely think it needs to be changed somehow. Iā€™m sure people have already had the thought, but I think it might be worth iterating in this particular way: surrounding spaces in code blocks are impossible as it stands now. I recently wanted to illustrate a leading space here on talk.commonmark.org, and I ended up having to use raw <code> tags! It was a sad day.

Personally Iā€™d prefer the ā€œcleanestā€ option that @jgm specified, but of course that does introduce potential issues with people who have ` typed their code blocks like this for style `. The second option is of course totally fine (if potentially a bit confusing), though - the main point of importance is that surrounding spaces become possible. Multiple spaces are also very common in code, but everyone seems to already be in agreement on not collapsing those anymore.