A problem with backtick code fences

jgm · July 1, 2016, 5:57am

@codinghorror - You can already have tilde code fences for code blocks. This doesn’t help with the problem, which is how to express a certain kind of code span (not a code block). And even if we allowed tildes to be used for code spans (which would conflict with common extensions that use them for strikeout), we’d still have the problem.

codinghorror · July 1, 2016, 8:35am

I see, I think I misunderstood the example:

``` hello
this is NOT inline code with one backtick ` and two backticks ``;
it is a code block!
```

~~~ hello
this is NOT inline code with one backtick ` and two backticks ``;
it is a code block!
~~~

Crissov · July 1, 2016, 12:58pm

Forbid the space before the info string if it contains more words/spaces:

```hello
this is NOT inline code with one backtick ` and two backticks ``;
it is a code block, because there is no space after the backticks!
```

```.hello
this is NOT inline code with one backtick ` and two backticks ``;
it is a code block, because there is no space after the backticks!
```

```␣hello
this is NOT inline code with one backtick ` and two backticks ``;
it is a code block, because it only contains a single word after the backticks!
```

```␣hello␣
this IS inline code with one backtick ` and two backticks ` (?)
```

```␣hello␣world
this IS inline code with one backtick ` and two backticks `
```

```hello␣world
this is NOT inline code with one backtick ` and two backticks ``;
it is a code block, because there is no space after the backticks!
```

```.hello␣world
this is NOT inline code with one backtick ` and two backticks ``;
it is a code block, because there is no space after the backticks!
```

```␣.hello␣world
this is NOT inline code with one backtick ` and two backticks ``;
it is a code block, because there is no word but punctuation after the space!
```

jkdev · July 1, 2016, 10:00pm

Maybe we’re overthinking this. Simpler rules are better, right?

How about:

Triple backticks or triple tildes: Code fence.
Opening fence begins a line, closing fence begins a new line: Code block.
Opening fence begins a line, but closing fence does not begin a new line: Inline code.
Text on the same line as the opening fence of a code block: Info string.

Any info string is thus allowed, even multiple words, even with a preceding space.

As for the issue that began this thread…

``` To include ` and `` backticks in inline code,
the closing fence should not be at the start of a new line,
but rather after code, like this. ``` And here's some non-code inline text.

This starts a paragraph with inline code, including single and double backticks.

It works in the current commonmark, markdown.pl, and most other flavors. (Babelmark 2 test, Babelmark 3 test.)

And, as pointed out by cben and Ajedi32, it follows the convention of *other* types of **delimiters** being written _inline_.

jgm · July 2, 2016, 12:22am

@jkdev - your proposal does nothing to remove the ambiguities.

1 ``` code
2 foo
3 bar ```
4 ```

Is this a code block that ends on line 4? Or a code span that ends on line 3? You could resolve this in favor of the latter by saying that we close as soon as we can, but this breaks backwards compatibility for code blocks containing strings of backticks, and creates difficulties expressing code blocks containing backticks.

Not being able to identify a code block by the first line would also break a lot of very nice properties of our present parsers, which identify block structure first, inline structure later.

It may be that this issue is enough of a corner case that we shouldn’t obsess about it. The only real “blind spot” there is is for inline code that contains strings of two backticks and occurs at the beginning of a paragraph (otherwise you can reorganize it so it doesn’t start at the beginning of the line).

I suppose another solution would be to allow only one-word info strings with backtick code blocks, while allowing free-form info strings with tilde code blocks. I hesitate to do that, though, as it complicates the mental model. (Why can I do this with tildes but not backticks?)

cben · September 2, 2016, 3:37pm

While it woulndn’t resolve the original problem,
I’m wondering if changing code block rules to close on triple (or however many) backticks anywhere in the line is feasible. In your example, it’d be a code block that ends at line 3.

Motivation for closing early:

Current behavior is not essential: I never realized that I can safely include ``` in code blocks as long as they don’t start the line. But I can always use more backticks (````) to start/end the code block, which is a simple rule covering all cases.
“Compatibility” with original markdown: backtick-fenced code block syntax degrades gracefully to inline code in tools that don’t understand fenced blocks [^1]. That’s a good property and IMO should be maximized. However, tools that think ``` starts inline code will stop on the first ``` anywhere.

Babelmark confirms about half implementations support fenced blocks and only stop on final start-of-line backticks, while half don’t understand fenced blocks and stop the inline code on line 3.

marked (0.2.6) is only one that supports fenced blocks AND stops early on line 3. It only does that for ``` at an end of the line — if text follows, it doesn’t close the block there.
AFAICT no tool follows @jkdev’s proposal of switching from block to inline when closing fence is mid-line.

[^1] I lied: it only degrades gracefully without empty lines — empty lines abort inline code but not fenced blocks.
That’s why deciding block structure first is important. Consider this paradox:

```
Is this inline code or code block?

Closing fence is not at start of line: ``` And here's some non-code inline text.

If the block/inline decision depends on lookahead to where closing backticks are, it can’t be either — inline shouldn’t cross the empty line and block shouldn’t have mid-line termination. I.e. you don’t know how far to look for termination before you found the termination…

Interoperability is especially critical for agreeing where code starts and ends

Code is fundamental like escaping — it suppresses markdown-significant constructs, so if you don’t agree about whether it’s code, you get cascading confusion…
Fenced blocks make it worse — disagreeing about just one top-level fence can catastrophically flip the meanings of everything till the end of the document!

That’s why any limits on info strings worry me. The simplest rule “3+ backticks/tildes followed by anything [without backticks] starts a block” is probably our best chance for agreement. (Ignoring, or treating as code, info strings you don’t understand is fine, as long as you still consider it a code block.)
If we agree it’s code but don’t agree block vs inline, that’s still rather good!
IIUC the only exceptions are (1) empty lines (2) mid-line closing backticks.
Fenced blocks without empty lines are probably a non-starter, but perhaps (2) could be harmonized?
As noted above all but one implementations with fenced blocks disregard mid-line backticks. So there is no single best-compatibility answer here
- “Parse block structure before inline” principle is I think new to CommonMark? It’s more or less implicit in the language, but I think many existing parsers have more ad-hoc structure. Ad-hoc parsing favors a single, simple, rule for code termination, whether it’s block or inline. [I’m thinking here not only of full parsers but about approximations like editor syntax highlighting…]

jgm · November 8, 2016, 9:51am

Revisiting this thread, I see only two possible solutions.

One is @Crissov’s, which is to require the info string to start immediately after the backticks (with no intervening space) if it contains more than one word. (If it is just one word, then a space is okay; we want this for backwards compatibility since many implementations allow a space.) That is:

```␣hello␣world
this IS inline code with one backtick ` and two backticks `
```

```hello␣world
this is NOT inline code with one backtick ` and two backticks ``;
it is a code block, because there is no space after the backticks!
```

The second is to constrain the info string; instead of allowing it to be anything, we could limit to, say, a bracketed list of key/value pairs:

Example:

``` haskell {class="numberLines" id="mycodesample" startline="15"}
let x = x + 1 in x
```

One option would be to allow any pandoc-style attributes, e.g. {#id .class1 .class2 key="value" booleankey}.

I think I prefer the option of giving some structure to the info string to the option of forbidding the space when there’s more than a single word in the info string, since the latter makes the presence of a single space have a big effect (and only in some cases), which might be surprising.

But nobody on this thread has actually commented on the idea of giving more structure to the info string.

vitaly · November 17, 2016, 6:54am

I have no personal preferences. Currently i use this syntax for fenced quotes extension:

```quote http://link.to/origin
multiline
markdown
content
```

but it’s not a big problem to switch to ~~~ delimiters, if constraints are for backticks only.

cben · November 21, 2016, 12:28am

Solution 0: do nothing.

jgm:

If you have a paragraph beginning with inline code that contains sequences of backticks with lengths 1 and 2, and it doesn’t fit on one line, then you’re completely out of luck; there is no way to write this in CommonMark. Of course, you could avoid the hard break and put everything on one line, but that is ugly and doesn’t usually have to be done.

It’s ugly, but it’s needed only for the very rare combination of (1) inline code (2) which is annoyingly long (3) containing `` (4) at start of paragraph. Are we still discussing that one use case, or is there a bigger purpose?

Is hard-wrapping so critical that a long line == “no way to write this”? Yes, most of the syntax is wrappable (even setext headers now!), but here we’re dealing with a conflict between inline code and line blocks, of which the latter is inherently not wrappable. If you wanted to express exactly the same long line of code in a code block, there would be no question that writing it as one line is the only option, and that’s OK.
IMHO allowing ``` oneword but requiring no space in ```two words is too surprising.
Do any existing implementations constrain the info string?
Let’s see, Babelmark shows few varying on space before word, and few varying on multi-word info string. But both are rare.
What happens when info string doesn’t adhere to the constraint? It’s still code, but inline code, right?

I’m worried about different implementations disagreeing on what’s code and what’s markdown. Especially on top level where off-by-one-fence can flip the meaning of a whole document.
Disagreement between block and inline is less severe, but with empty line it can grow into a what’s-code disagreement:

gist.github.com

https://gist.github.com/cben/cb6b8354ac083ae94e7aa30486c83907

info-string-what-is-code.txt

``` info string *of unconstrained form*.
Is this code?  Is this a code block?

And this?
```
This is text IFF above sentence was code.

However this situation already exists between implementations that understand fenced blocks and those that don’t.
Not sure we’d be increasing the risk by constraining info string syntax.

The above Babelmark link shows some catastrophic text-became-code disagreements — even without empty lines!
(marked, PHP Markdown Extra, Maruku) and some milder code-became-text cases.
It seems some implementations do direct code block → text fallback, which is bad for interoperability

Intuitively, I expect any constraints on info strings should reduce interoperability; the current “anything goes (except backticks)” rule is simplest so should have better chance to be widely adopted. However in practice it’s it might not matter…

cben · November 21, 2016, 1:24am

Oh. The disagreements in above babelmark are exactly like what you explained to me above in 2015 as not a bug:

``` something that's illegal info string
The line above can NOT start a fenced block,
(this will be treated as text rather than code)
but the line below can:
```

(This text will be treated as code.)

``` another info string
The line above can NOT terminate the fenced block,
(which is good, this is treated as code as intended)
only the line below terminates the block:
```

I’m not sure that’s what happens in marked, PHP Markdown Extra, Maruku, but it’s what CommonMark would do if you constrain the info string.
A human reading sequentially thinks “opening fence, or at least start of inline code”; but the spec looks for block structure before inline, so as soon as you reject a an intended opening fence, you’ll lock on the closing fence instead.

Here is another way this fails:

````` outer fence

Backticks on next line are just code
``` all inside outer fence, right?
foo

`````

But if “outer fence” is some illegal info string, it stops being a fence, and the 5 backticks no longer shield the 3 backticks inside!

The only solution I see is never constraining info strings (well we must outlaw backticks for inline one-liners).
IIUC anything else inevitably creates interoperability problems.

jgm · November 21, 2016, 9:33pm

Nice point that “Option 0” may actually be better than the other options, given the costs of each.

There are implementations that constrain the info string. Pandoc allows either a GitHub-style single world or pandoc-style attributes like {#id .class .class2 key="value"}.

vitaly · November 23, 2016, 2:56am

There is similar problem with ~~~text~~~

codinghorror · January 21, 2019, 8:29am

Can we remove this from the list of blockers for 1.0? “Do nothing” is a very appealing choice @jgm.

vanou · April 28, 2019, 3:35pm

Proposal

This is a proposal to solve this problem, and consists of these 5 statements.
(But #5 is optional.)

We prohibit paragraph consists of only one code span whose
- backtick string is more than 2 backtiks
AND
- end backtick string has its own line, and is preceeded by only 0~3 space(s) AND followed by any number of spaces
Code span doesn’t convert line ending to space. Just remove line endings.
Length of code fence must be longer than sequence of backticks this fenced code block contains.
Length of backtick string must be longer than sequence of backticks this code span contains.
Fenced code block requires blank line before & after it.

Below, I will explain these. Sorry for long script.

About #1 & #2

If backtick string consists of less than 3 backticks

This is code span. And it’s not matter whether backtick string has its own line or not.
Because length of code fence must be at least 3 backticks, so they are recognized as code span.

If text which preceeds end backtick contains non-space charactor

This is code span. And it’s not matter whether start backtick string has its own line or not, and how long the length of backtick string is. Because closed code fence must have its own line.

If text which follows end backtick string contains non-space charactor

This is code span. And it’s not matter whether start backtick string has its own line or not, and how long the length of backtick string is. Because closed code fence must have its own line.

If end backtick string has its own line, and is preceeded by only 0~3 space(s) AND followed by any number of spaces, and length of backtisk strings is longer than 2

This is when ambiguity comes.

``` nice
days
```

Is this meant to be

this one

<p><code> nice days</code></p>

or this one?

<pre><code class="language-nice">days
</code></pre>

To determine whether this is fenced code block or code span, I introduce one restriction on paragraph (#1).
(Why on paragraph? Because this fenced code block vs code span ambiguity only happens in paragraph, I think.)

We prohibit paragraph consists of only one code span whose

backtick string is more than 2 backtiks

AND

end backtick string has its own line, and is preceeded by only 0~3 space(s) AND followed by any number of spaces

Then, if we want paragraph consists of only one code span whose backtick string is more than 2 backticks, we make end backtick string preceeded by content of code span through splitting content with line ending.

```content
of code spa
n```

But as of CommonMark 0.29, if content of code span consists of only one long string, this doesn’t work well.

Because, content of code span is normalized as following ways;

First, line endings are converted to spaces.

If the resulting string both begins and ends with a space
character, but does not consist entirely of space
characters, a single space character is removed from the
front and back. This allows you to include code that begins
or ends with backtick characters, which must be separated by
whitespace from the opening or closing backtick strings.

So, if we make code span only contains one long string (e.g. sha256 hash) and want to hard-wrap it, this normalization process introduces problem:

This one code span

`sha256:e3b0c44298fc1c149afbf4c8996fb
92427ae41e4649b934ca495991b7852b855`

will result in

sha256:e3b0c44298fc1c149afbf4c8996fb 92427ae41e4649b934ca495991b7852b855

But, if you copy&paste this result, you see a space between 8996fb and 92427a.
This is not what I expect.

So, #2 comes:

Code span doesn’t convert line ending to space. Just remove line endings.

About #3 & #4

#1 & #2 are not sufficient.

Following examples have still umbiguity.

#one fenced code block OR one code span followed by ```?
```cannot determine
whether fenced code block containing ```
```
#one code span OR two code spans?
```
two``` ```code spans?```

With #3 & #4, we resolve both examples’ umbiguity.

For first example,

if you intend this to be one fenced code block, following #3, fence code’s length must be longer than 3.
`````cannot determine
whether fenced code block containing ```
`````

if you intend this to be one code span followed by ```, no change makes sense.
(If parser obeys #3, then it thinks this is not fence code block.)
```cannot determine
whether fenced code block containing ```
```

For second example,

if you intend this to be one code span, following #4, length of backtick strings must be longer than 3.
`````
two``` ```code spans?`````

if you intend this to be two code spans, no change makes sense.
(If parser obeys #4, then it thinks this is not one code span.)
```
two``` ```code spans?```

About #5

I think there is one problem, if I follow only #1, #2, #3 and #4 (but not #5).

If I want one code span to be embedded in paragraph like below,

abc def
```ghi
jkl mno
```
pqr stu

this will be understood by parser as ‘one paragraph’ + ‘one fenced code block’ + ‘one paragraph’.

Because, as of CommonMark 0.29,

A fenced code block may interrupt a paragraph, and does not require a blank line either before or after.

To deal with this situation, we need #5:

Fenced code block requires blank line before & after it.

Thank you for reading.

jgm · April 28, 2019, 5:14pm

@vanou I think this is too complex; in addition, requiring blank space before and after a fenced code block would break too many existing documents.

@codinghorror - I think “do nothing” is probably okay for now. A better solution, I think, is the pandoc one: constrain the info string so that it is either (a) a single word or (b) a pandoc-style attribute block in curly braces. This would require agreement on the attribute syntax and would rule out free-form info strings.

vanou · April 28, 2019, 11:53pm

@jgm

(As respond to your feedback, relaxing policy)
It’s enough to only follow #1.
And putting #2 & #3 & #4 aside is ok.

As I said, #5 is optional. It’s ok to reject #5, because of backward incompatibility. Backward compatibility is important.

About pandoc style

Constraining style of info string means constraining code span style.

Following your idea, if code span’s start backtick string has its own line, (a) a single word or (b) a pandoc-style attribute block in curly spaces cannot follow start backtick in same line?

Does this constraints impose backward-incompatibility?

Crissov · April 29, 2019, 10:46am

a) red or seem backwards compatible. In addition to an optional single “word” at the start, CM would need to allow any number of “words” preceded by any of #, ., @, ?, !, _, -, +, and both sides of = and :, as well as quoted (', ") and parenthetical ((), [], <>, {}) strings that may include spaces. Other ASCII punctuation may be in current use as prefix as well.

vanou · April 30, 2019, 2:29am

Sorry. This is same as @cben 's idea.

I agree with @cben 's idea. It’s easy to understand and simple.

jgm · April 30, 2019, 4:14am

cben:

I don’t leave a space before the closing backtick, even a hard-wrapping editor won’t accidentally put it at start of line.

But of course there are cases where you need that
space: when the content being quoted ends with a
backtick.

Admittedly, though, this is a rare kind of case.
Which is why, in practice, this issue can probably
be ignored for now (option 0).

vanou · April 30, 2019, 6:22am

@jgm

Ah. Yes, it could happen. I agree.
I agree that @cben 's idea is not enough.

How about this approach?

The line including closing backtick must have at least one non-space character, other than closing backtick, that comes before OR after closing backtick.