Leading and trailing white spaces in code blocks

nightw · September 17, 2014, 8:41pm

I would like to ask about the title int the specs. Is it intended that the leading and trailing white spaces are trimmed even in code blocks? I would argue that the code blocks should be something like the <pre> tags and they should not modify the text in any way (and of course should remain an inline element instead of being a block like the <pre> tag).

Here is a short example:

this is a code block
this is a code block with 8 trailing spaces: and content after it
content before it:this is a code block with 8 leading spaces

What do you think about this?

mb21 · September 17, 2014, 9:16pm

From the spec:

This example shows the motivation for stripping leading and trailing spaces:
` `` `
renders as:
<p><code>``</code></p>

nightw · September 18, 2014, 12:32pm

I understand that example is a good use-case for the original intention. But I still think that the code spans should be able to show whitespaces as they’re in the original text. For example currently it’s not possible to insert python in a code span because of the multiple spaces and leading spaces stripping. See this example:

x = 1 if x == 1: # indented four spaces print "x is 1."

Which has this source:

`
x = 1
if x == 1:
    # indented four spaces
    print "x is 1."
`

I know this is a block, so <pre> tags should be used for it anyways, but I think that the code block feature with the backtick should not change the whitespaces in the string this heavily to be useful for some cases where the newlines and multiple whitspaces are an important part of the text.

stof · September 18, 2014, 2:35pm

Well, if you want a <pre> tag, it means you don’t want a code span anymore, but a code block. And code blocks are respecting the indentation

codinghorror · January 3, 2016, 10:05am

In other words in this

`are trailing spaces trimmed from here-->   `

has the trailing spaces trimmed.

I think we should consider what BabelMark says here.

Of the implementations tested:

10 preserve the trailing spaces
17 do not

My feeling is that we should preserve spaces in code spans. Unless this is very complicated from an implementation standpoint.

tin-pot · January 3, 2016, 3:01pm

Yes, and it is horrible to do so!

My feeling is that we should preserve spaces in code blocks. Unless this is very complicated from an implementation standpoint.

Wait – code blocks? Didn’t your example above show the vanishing of SPACE at the end of a “backtick”-delimited code span?

Anyway: Yes, SPACE should be preserved in code spans delimited by GRAVE ACCENT! Yes yes yes!

But line breaks probably should not (if anything, they should probably be “normalized” into single SPACE characters).

codinghorror · January 3, 2016, 6:56pm

Sorry, I clearly meant code spans.

tin-pot · January 3, 2016, 7:56pm

What differentiates the specification’s example 304 code span

`⎵``⎵`

from cases like

"`⎵X⎵`"

is the fact that in example 304 the “leading” and “trailing” SPACE is only there for syntactic reasons, and strictly speaking both SPACE characters are part of the “opening delimiter” ( GRAVE ACCENT , SPACE ) respectively the “closing delimiter” ( SPACE , GRAVE ACCENT ) of the code span: removing one of these SPACE characters disturbs and disrupts parsing of the whole code span.

When there are no GRAVE ACCENT characters (aka “backtick”) adjacent to the code span delimiter, as in the second example with the enclosed “X”, one can obviously add and remove SPACE adjacent to the GRAVE ACCENT without any consequences for parsing.

In this light, it only makes sense to discard SPACE characters in the situation of example 304 (where they were inserted for purely syntactical reasons in the first place!), but not in the second example, IMO.

A rule concerning leading and trailing SPACE in code spans based on this difference could be:

Any SPACE after the opening “backtick” character(s) and before the first “inner backtick” character (if there is one, that is, not one of the closing “backtick” characters) is discarded; and vice versa any SPACE in front of the closing “backtick” character(s) which follows the last “inner backtick” character.

However, I have the feeling that this could (and should!) be said in a much cleaner, simpler way, maybe just using a grammar of sorts?

[ Edit: I have posted a proposed modification of the CommonMark specification over there. ]

tin-pot · January 3, 2016, 2:52pm

Are you saying that discarding SPACE inside “code spans” delimited by GRAVE ACCENT was the intended behaviour?

I have actually started to use U+23B5 BOTTOM SQUARE BRACKET as a stand-in for SPACE where it is important (pasted into the text via clipboard!), because not only SPACE but NO-BREAK SPACE would get removed by the parser without trace.

I would for sure welcome the preservation of SPACE in code spans! – See my rant at the end of my post here for details.

codinghorror · January 3, 2016, 6:58pm

I am saying look at the Babelmark results yourself and see that there is no clear consensus one way or the other. It is just about 50/50.

tin-pot · January 3, 2016, 7:33pm

Dammit! I had the firm conviction that (Markdown and) CommonMark input like

foo⎵"`X⎵`"⎵bar

would be equivalent in any way, shape, or form to writing

foo⎵"<code>X⎵</code>"⎵bar

and both would produce the fragment (in HTML):

foo⎵&quot;<code>X⎵</code>&quot;⎵bar

(with or without using named character references) — and that this would be the only sane and useful behavior of a CommonMark processor. At least, I could not remember ever having a problem with Markdown in this regard.

Turns out: the reason for the latter seems to be that I never had the need for examples like this one, where SPACE at the end of a code span is significant, and where it is therefore important for this SPACE to show up in the HTML <code> element’s character content too.

In fact, parsers like discount or soldout which I also use (less frequently today than in the past) do also discard the trailing SPACE in the example above, and I would also have to use the funny U+23B5 BOTTOM SQUARE BRACKET to force a visually unambiguous output with those parsers, or use similar workarounds (alas, I have to admit, I’m getting used to it, and it does look kind of nice, doesn’t it?).

[ To clarify: Omission of the SPACE in question only happens in the code span example using “backticks”. ]

So I’m owing you an apology: sorry, I was wrong—discarding the SPACE character is neither your nor your implementation’s fault, but seems to be a common practice instead, and I’d like to take back any harsh words addressed at you about it.

But may I still maintain the opinion that keeping leading and trailing SPACE in code spans is the only proper way to process them? Or is there some deep reason to discard these SPACEs that you know of and which I just can’t see?

codinghorror · January 3, 2016, 7:35pm

I support keeping the spaces if the parser hurdles are not extreme.

tin-pot · January 3, 2016, 8:03pm

It should not be too hard to come up with a sensible rule for handling SPACE in code spans (in particular, leading and trailing spaces), see my attempt here. In fact, implementing the rule I propose is probably easier than phrasing the syntax rule in a simple, plain-english way.

tin-pot · January 3, 2016, 9:42pm

Here is my take on rephrasing subsection “6.3 Code spans” in the CommonMark specification, and adapting the example HTML output by “brain-parsing” the input according to the modified rules.

The main purpose was to only trim leading and trailing SPACE from code span content when necessary (and even that could be done more parsimonious, and more “correct”, by removing at most one SPACE at each end!).

I also took the liberty to change the treatment of “inner white space” in code spans a bit, see the rephrased “Q&A” fragment below: only line breaks are “normalized”, but in-line strings of multiple SPACE persist.

The example input and output fragments are simply placed into code blocks here, with just a blank line in between. The MIDDLE DOT takes the role of a “visible SPACE”, as in the specification.

The example labels are here used as headings, so that the pertaining comment follows immediately after a heading giving the example number.

6.3 Code spans

A backtick string is a string of one or more backtick characters ("`" U+0060 GRAVE ACCENT) that is neither preceded nor followed by another backtick character.

A code span begins with a backtick string and ends with a backtick string of equal length n, thereby enclosing a non-empty content string. There may be backtick characters in the content string, but not another backtick string of length n.

The character content in the result of parsing the code span is the content string after

removing spaces and line breaks that precede the first backtick in the content string (if any), and
removing spaces and line break that follow the last backtick in the content string (if any), and
replacing line breaks inside the content string, along with adjacent white space, with a single SPACE.

Example 302

This is a simple code span:

`foo`

<p><code>foo</code></p>

Example 303

Here two backticks are used, because the code contains a backtick. This example also illustrates that leading and trailing spaces are not stripped in this case:

``·foo·`·bar ``

<p><code>·foo·`·bar·</code></p>

Example 304

This example shows a case where both the leading and trailing space characters are trimmed, because they precede or follow a backtick character in the content string:

`·``·`

<p><code>``</code></p>

Example 305

Line breaks are “normalized” to spaces:

``
foo
``

<p><code>·foo·</code></p>

Example 306

Interior line breaks and surrounding white space are “normalized” into single spaces:

`foo···bar
··baz`

<p><code>foo···bar·baz</code></p>

Q: Why not “collapse” the inner spaces between foo and bar too, although browsers will collapse them in many cases anyway?

A: Because this depends on the style sheet used for rendering HTML, and we shouldn’t rely on any specific rendering assumptions.

Example 307

(Existing implementations differ in their treatment of internal spaces and line endings. Some, including Markdown.pl and showdown, convert an internal line break into a <BR> element. But this makes things difficult for those who like to hard-wrap their paragraphs, since a line break in the midst of a code span will cause an unintended line break in the output. Others just leave internal spaces as they are, which is fine if only HTML is being targeted.)

`foo·``·bar`

<p><code>foo·``·bar</code></p>

Example 308

Note that backslash escapes do not work in code spans. All backslashes are treated literally:

`foo\`bar`

<p><code>foo\</code>bar`</p>

Backslash escapes are never needed, because one can always choose a backtick string of length n to delimit code that does not contain any string of n backtick characters.

jgm · January 5, 2016, 10:26pm

There is a deeper reason. Suppose you want to put a single backtick in a code span. The way to do it is:

`` ` ``

or

``` ` ```

or etc., and here the spaces are needed, or you’d just have a long string of backticks.

So we need to ignore at least one leading and trailing space. What’s in question is:

should internal spaces be collapsed?
should all leading and trailing spaces be ignored, or just the first and last?
what should we do with internal newlines?

The current spec says: 1 - yes, 2 - all, 3 - treat as spaces.
The answers I currently think are best are: 1 - no, 2 - first and last, 3 - not sure.

EDIT: Sorry, @tin-pot, I made this comment before reading your next post.

What you suggest, namely,

removing spaces and line breaks that precede the first backtick in the content string (if any), and

removing spaces and line break that follow the last backtick in the content string (if any), and

replacing line breaks inside the content string, along with adjacent white space, with a single SPACE.

sounds reasonable to me – I hadn’t thought of this more moderate proposal about space-stripping. However, I think I’d still prefer to strip an initial and final space in every case, for a greater degree of compatibility with existing Markdown renderers. (Some people may be in the habit of always having space around the contents of their inline code spans, I suppose.)

About newlines: what is the rationale for collapsing spaces around newlines, if we’re not doing it generally?

tin-pot · January 5, 2016, 10:47pm

Yes, one SPACE following the opening (preceding the closing) backtick string should be seen as part of the delimiter, and thus get discarded. I’d guess that is meant by Gruber here when he writes “may include”:

The backtick delimiters surrounding a code span may include spaces—one after the opening, one before the closing.

Technically, one would need to drop one SPACE only if that SPACE was needed at this place to begin with, that is, here:

`` `x` ``

but not here:

`` x ``

But that is probably too much sophistry.

On the other questions,

should internal spaces be collapsed? – I don’t see why, except in the case of line breaks;
should all leading and trailing spaces be ignored, or just the first and last? – I’d say only one at each end;
what should we do with internal newlines? – Internal newlines would best be “contracted”, see below.

The rationale for handling line breaks by replacing them, along with adjacent white space, with a single SPACE is the use case where lines get re-flowed. Consider a code span like this, in paragraph that is indented by four:

⎵⎵⎵⎵Aurea prima sata est aetas `one⎵⎵/⎵⎵two⎵⎵/⎵⎵three⎵⎵/⎵⎵four` quae vindice nullo

I’d say (regarding item 1) that there’s no reason to modify the double SPACEs here, if that’s what the author intended.

Now imagine a tool like fmt(1) runs through this paragraph, re-formats lines, and the result is:

⎵⎵⎵⎵Aurea prima sata est aetas `one⎵⎵/⎵⎵two⎵⎵/
⎵⎵⎵⎵three⎵⎵/⎵⎵four` quae vindice nullo

If newline and/or adjacent SPACE would be preserved in the output, the code span would now have the character content:

one⎵⎵/⎵⎵two⎵⎵/\n⎵⎵⎵⎵three⎵⎵/⎵⎵four

(here “\n” means the EOL character). I think this is worse than:

one⎵⎵/⎵⎵two⎵⎵/⎵three⎵⎵/⎵⎵four

Which would be the (intended) result of replacing EOL and adjacent SPACEs with a single SPACE.

(This is, btw, what attribute value normalization does with attribute value literals in XML, and the reasoning behind it is probably similar.)

codinghorror · June 28, 2016, 10:28am

Oh, I like this character technique – I can use it to indicate spaces in the CommonMark reference!

jgm · November 8, 2016, 9:56am

Looking back at this thread, I think the cleanest option would be:

Collapse one initial space if the next non-space character is a backtick.
Collapse one final space if the preceeding non-space character is a backtick.
Do not collapse interior spaces.
Treat an interior newline as a space unless it’s adjacent to a space, in which case ignore it.

But 1 and 2 might lead to backwards compatibility issues, so it’s probably better to do something like this:

Collapse one initial and one final space, but only if there is both initial and final space. So, for example BACKTICK + SPACE + A + BACKTICK gives you a code span with SPACE + A, while BACKTICK + SPACE + A + SPACE + BACKTICK gives you one with just A.
Do not collapse interior spaces.
Treat an interior newline as a space unless it’s adjacent to a space, in which case ignore it.

nightw · November 8, 2016, 7:58pm

I’m the original creator of this thread. Somehow I forget all about it during the middle a long time ago and now I got a new notification and just reread most of the thread again.

I am not sure if it is really relevant now or not, but after some thought for me this “final” proposal pretty much covers what my original expectation was about handling spaces inside a code span. So I would be happy with this.

obskyr · July 11, 2017, 9:51am

I’d really like to push for this! It’s already on the “issues that must be resolved before 1.0” list, and I definitely think it needs to be changed somehow. I’m sure people have already had the thought, but I think it might be worth iterating in this particular way: surrounding spaces in code blocks are impossible as it stands now. I recently wanted to illustrate a leading space here on talk.commonmark.org, and I ended up having to use raw <code> tags! It was a sad day.

Personally I’d prefer the “cleanest” option that @jgm specified, but of course that does introduce potential issues with people who have ` typed their code blocks like this for style `. The second option is of course totally fine (if potentially a bit confusing), though - the main point of importance is that surrounding spaces become possible. Multiple spaces are also very common in code, but everyone seems to already be in agreement on not collapsing those anymore.