Insist that code fenced blocks are properly closed

gjtorikian · September 4, 2014, 7:48am

Say I write some text like this:

```
hello
<eod>

Because I’ve neglected the closing fence mark, the entire document after that point is marked up within a pre block.

This possibility produces unexpected results: no other block tag (or inline, for that matter) can accidentally alter the rest of the document.

Wouldn’t it be possible to say something to the effect of:

Detect fenced code block
Continue looking for a matching fence
If I reach the end of the document, break out, render the text as literal backticks (or something), and continue parsing

SvenDowideit · September 4, 2014, 10:27am

yes please. Having rigorous validity feedback for the writer would help immensely - especially when the ‘compiler’ can give that feedback at the right location - ie, 'Sorry, you started code fenced block here, and need to close it somewhere`

jgm · September 4, 2014, 5:00pm

+++ SvenDowideit [Sep 04 14 10:37 ]:

yes please. Having rigorous validity feedback for the writer would help immensely - especially when the ‘compiler’ can give that feedback at the right location - ie, 'Sorry, you started code fenced block here, and need to close it somewhere`

There is no such thing as an invalid Markdown document, according to
this spec (and nearly every implementation). It’s a separate question
whether to issue warnings when we see something that looks like it
was probably not intended. That could definitely be done.

The question what the parser should return with an unclosed open
backtick fence is separate from the issue of warnings. Either option
is probably unintended by the user, so I don’t see a great difference
from that point of view. (Putting the whole rest of the document in
the code block makes it more likely that the error will be noticed.)

From a parsing point of view, it’s much better if you don’t have to
backtrack. The current implementations have the nice property that
they can parse input line by line, throwing away each line after
they parse it.

gjtorikian · September 4, 2014, 6:01pm

Could you expand on that? Why is it much better? Because from a user point of view, I think it’s pretty poor.

I guess I’m looking at it from a state machine perspective. You keep going down the document, wrapping things in a pre, until you realize you hit the end and never found a closing pre. Then you head back to where you were and pop off. There might be a performance implication, to be sure, but a performance degradation in an error state makes more sense than rendering some unexpected content, IMHO.

nebula · September 21, 2018, 3:08am

I would like to come back to this point as it is a more fundamental one than it may appear. I am implementing a CommonMark parser and I am stumbling on this point.

For me, and I think for most users, text should be text unless it is stated clearly that we are formatting using Markdown. All syntactic elements of Markdown follow this philosophy:

A header is not a header unless it is properly formed e.g. #header is not a header.
A link is not a link unless it is e.g. [link name](link destination is not a link.
Emphasis needs to be closed to be considered as emphasis e.g. *emphasis is not an emphasis.
etc…

For a reason I can not understand, the fenced code block rule seems to be different: it is a code block without the proper closing fence.

The more important point though, is from a user perspective. The way the current CommonMark spec states it, if I write three backticks then all my text becomes code. So, we assume that it is the user intention to create a code from all the following text, but if I write three backticks my intention will be 95% of the time to create a new code block without the text below it.

John MacFarlane says:

This comment is true if you look at the problem from a final document perspective, but if we take a document as a document “to become” or “in progress” as it is the case in an editor, there is a big difference for the user.

So, more than just deviating the Markdown parsing philosophy, as I see it, the current way of hanlding fenced code is unintuitive from a user perspective.

In other imlementations e.g. MultiMarkdown, a fenced code block is not a fenced code block unless it is closed. GitHub seems to think the same a the current spec though.

In conclusion, I think CommonMark should not consider three backticks as being part of a block unless there is a corresponding closing fence, a text should be text unless the user has clearly, and correctly stated his intention to create a new Markdown element.

Since CommonMark has not reached 1.0 spec, I think it is still time to have these discussions.

Any thoughts?

jgm · September 21, 2018, 4:07am

I see the point about possible effects on syntax
highlighting or live preview in a document that
is being edited.

However, the parsing method used in the reference
implementations really doesn’t allow us to require
a closing fence. If you require a closing fence,
then you have potentially unlimited backtracking:
you may have to parse to the end of the document
before you know that there’s no closing fence.
Having made that determination, everything needs
to be redone.

The current spec allows us to construct efficient
non-backtracking parsers. This is important both
for performance and to avoid DDOS attacks. Is
it worth giving this up to require closing fences?

I’m open to suggestions here. The reasoning
behind the current rule was: this is probably a
“don’t care,” because an unclosed code block is
probably a mistake; therefore we can decide based
on parser efficiency considerations. You are pointing
out that it’s not completely “don’t care,” because
these things can occur in “in-progress” documents
and would affect syntax highlighting and live preview.

nebula · September 21, 2018, 6:52am

There is indeed a problem with fenced code. It’s the only instance of a block where it’s continuation
is implicitly stated, at the same level as paragraph, and therefore requires backtracking.

A blockquote continuation is explicitly stated using > in front of every line, indented code blocks are explicitly stated using the spaces in front of every line. Paragraph continuation are implicitly stated because they are the default blocks unless other blocks are explicitly stated.

As I see it, the purpose of fenced code block is to facilitate introducing code in a Markdown document, it is basically an easier alternative to indented code blocks which were part of the original specification. GitHub introduced it because they are a “code” oriented company and wanted to facilitate this particular use, but I don’t see any important reason why CommonMark should follow this path other than compatibility.

Here are the options I can see:

Introduce backtracking in the reference implementation, which may be difficult and impact performance
and security but would bring a cleaner syntax parsing philosophy. I am not aware that much of the reference implementation but there must be a way to backtrack efficiently. Basically to parse in a pessimistic way: when encounter a fence, consider it as a possible fenced code block but continue parsing as normal until we see another fence, we then pop() the last fence and create one fenced code block and replace the parsed nodes from the first fence with the newly created fenced code block. I don’t see any big performance impact here… or…
Remove necessity of backtracking for fenced code blocks by requiring explicit block continuation like it is done for other kind of blocks. CommonMark is already distancing itself from the original specification by using fenced code blocks but I think it would be better aligned with it by using the “explicit block continuation" philosophy of the first specification by John Gruber.

The one I can think of right now is the following:

``` swift 
```	
```	func test() {
```		
```	}

or simply:

`   swift 
`	
`	func test() {
`		
`	}

It’s not as simple as the previous fenced code blocks syntax but it has the advantage of being clear and
force the user to state explicitly his intention. Any other suggestions would be welcome!

Note: It would also solve this problem:

jgm · September 21, 2018, 5:00pm

Actually, GitHub didn’t introduce it. Delimited code blocks (with ~~~ instead of backticks) have been supported in pandoc and PHP-Markdown for a long time. Michel Fortin and I hashed out the syntax on the markdown-discuss list in October 2007, before GitHub even existed. What motivated this was not ease of writing or cut-and-pasting, or the need to specify a language syntax, but rather the problem of how to put an indented code block after a list item (since indented content gets interpreted as a list continuation paragraph). Here’s the history for those who are interested.

I agree that there is a conceptual case against fenced code blocks. However, they’ve caught on and aren’t going away. We need to support them. They have several advantages over indented code blocks:

easier to write
easier to cut-and-paste from the markdown source
can occur directly after a list without being captured as part of the list item
allow specification of a language syntax

Your idea of requiring an explicit delimiter (besides an indent) on every line would solve 3 and 4, but I can’t see it catching on, because of 1 and 2.

So the choices I see are:

A. Keep things as they are. This leaves an issue for rendering a partially written document (e.g. in an editor), but perhaps this could be solved in the editor itself. For example, the editor could parse the document after an edit, and if the parse tree ends with a fenced code block, the editor could check for a closing code fence and then take some appropriate action.

B. Change the spec and require some kind of backtracking. You may be right that there is a way to do this efficiently, but I’m not yet convinced.

nebula · September 22, 2018, 3:42am

There may be a third option.

It all boils down to this simple question: is Markdown code?

In the original specification by John Gruber, Markdown elements could be easily embedded in indented code blocks because the indentation itself removed the embedded elements from the Markdown syntactic space. So it was possible to write this:

# indented header level 1 that is not one, it's code

as the header here is not a header according to the current syntactic space. In the original specification there was no regionality, all the rules where valid everywhere.

The fenced code block introduce a regionality by saying: “everything between the start fence and an eventual closing fence, present or not, is code”: even if it’s valid Markdown, it’s code. In the original spec all valid Markdown stayed valid regardless of the position or the context.

I think the CommonMark spec should follow the original specification and all valid Markdown should always be valid.

As I said before, the fenced code block share a same parsing property as paragraph as it’s continuation is implicit. And I think the same closing rule as the paragraph should apply. Therefore, the solution would be to stop an unclosed fenced code block when reaching a valid Markdown element according to the current syntactic space. A fenced code block could be stopped by explicit block elements like headers, html block, or anything valid that tells us that we are not anymore in a fenced code block.

A fenced code block could not be stopped by a paragraph, so the ending fence would be necessary to distinguish the end of a fenced code block in case it is followed by a paragraph that we want to exclude from it.

This resolves all problems, align with the original spec and introduce a nice propertie: all valid Markdown elements according to the current syntactic space are always valid i.e. they are not code. It also avoids fenced code block to continue infinitely until the end of the document, which is certainly not the user intention to do so, especially if there is for example a header following. So, handling this way a fenced code block is also intuitive from a user perspective.

This introduce something else though, Markdown could not be included in fenced code block anymore unless it is indented, or “not valid”. So to make a code block with Markdown inside it, someone would have to either use the indented code block or the

<pre><code># Valid markdown header</pre></code>

html tags. But I think this limitation is small and in fact: the new rule fixes the conceptual case against fenced code block. More importantly, the reference specification can continue with the streamed parsing strategy already used, no backtracking is introduced, compatibility with other implementations is kept in most cases unless fenced code block are used to enclose valid Markdown in the current context and the parsing philosophy gap with the original specification is eliminated.

nebula · September 29, 2018, 5:32am

After giving some more thought on this, here are my conclusions…

The way fenced code block are handled is not practical in a live context like an editor because we can force reparsing files of possibles tens of thousands of lines unlike any programming languages where a file hardly ever goes beyond 2000 lines (in which case it’s a good indication there is too much there, but that’s another subject…)

So the use case for Markdown is to be able to handle big files fast (novels, technical documentation, etc…) As anybody can read in the README of the C reference implementation, it can parse War And Peace in 127 milliseconds. It’s fast, but not enough, and the War And Peace file is basically just text, there is no Markdown formatting in there, which, I suppose, would surely slows things down.

So, to come back to your options (John MacFarlane), option A could not be an option because of the speed required to compute properly in a live environment, and since computer chips does not seem to improve a lot in terms of speed, I think going forward with option A would severely limit editors applications.

Option B, is more interesting but it has also severe limitations, one of which is that since there is no specific syntax for open (with the exception when it’s added language) and close fence, any fence can match with any other. It adds a lot of complexity to the parsing strategies and a lot of state have to be kept to handle things properly. So I wouldn’t go that road either.

The only option left is my last proposal but I think it may be too restrictive. It also remove the possibility to use library like highlight.js to parse markdown code inside code block. So I would suggest to allow markdown inside a fenced code block if the open fence is specifically marked with markdown. Otherwise, I would close the fenced code block like in the previous proposal: end any fenced code block when encountering a Markdown valid explicit block (excluding paragraph), like a header, a blockquote, or anything that makes sense, basically, to be defined.

So, the only case left is when we don’t close a fence specifically marked with markdown language. I would handle this case as the reference implementation does right now. And I think applications developers can live with that.

So both code below would be parsed as fenced code blocks:


``` markdown 

# header 1 

```

parsed as:

# header 1

and:


``` 

func isEven(number: Int) {
	
	...
}


# header

would be parsed as:

func isEven(number: Int) {
	
	...
}

header

jgm · September 29, 2018, 7:07am

Sébastien Hamel noreply@talk.commonmark.org writes:

The way fenced code block are handled is not
practical in a live context like an editor because
we can force reparsing files of possibles tens of
thousands of lines unlike any programming languages
where a file hardly ever goes beyond 2000 lines (in
which case it’s a good indication there is too much
there, but that’s another subject…)

Requiring fenced code blocks to be closed would not
help with this.

Suppose you’re on line 500 of a 10,000 file. The
editor is displaying lines 490-520. You insert
three backticks on line 500. With the current spec,
that instantly changes the meaning of every line
501-10,000.

But suppose the spec were changed to require fenced
blocks to be closed (not just by the end of the
document). Then we’d STILL need to parse to the
end of the file before we knew how to render
lines 501-10,000. (In fact, it’s better with the
current spec, since you can KNOW after parsing
lines 490-520 that lines 501-520 should be rendered
as code. You don’t need to scan past the editor’s
window.)

The problem of non-local changes would remain even if
we got rid of fenced code blocks altogether. Suppose
you have 1000 blocks of text separated by blank lines.
The first one is unindented; the rest are all indented
4 spaces. Now, adding 1. to the beginning of the
first paragraph (making it a list item) will change
the meaning of the entire rest of the document.
Previously, the indented blocks were (indented)
code blocks. Now, they are continuation paragraphs
of the list item.

Or, just imagine your editor window is lines 5000-5030
of a 10,000 line file. All of these lines are
indented eight spaces. That doesn’t tell you much
about how to highlight them! They could be
continuation paragraphs for a deeply nested list.
Or they could be indented code for code that starts
with whitespace. Or they could be indented code for
code that doesn’t start with whitespace, but inside
a list item. Or they might be inside a fenced code
block that starts much earlier and ends much later.

Reference links also cause non-locality (see my post
on “Beyond Markdown” for more on this). With the
current spec, you can’t know if something is a link
until you’ve parsed the whole document, because there
might be a reference link definition at the very end.

jgm · September 29, 2018, 7:14am

Regarding the suggestion that

```
func isEven(number: Int) {
	
	...
}


# header

be parsed as a code block followed by a header with
content “header”: I don’t see how anything like that
could work. You’re asking the parser to figure out
what is code and what is content. Of course, a human
can usually do this, but that takes a lot of
intelligence. I don’t see any mechanical rule that
would do it. Note that # header is a perfectly
good line in many programming languages (a comment).

Note also that code blocks can be used for any kind
of literal content, including for example samples of
markdown!

So I wouldn’t go this direction with the spec. That
said, nothing stops you, in implementing an editor,
from building in heuristics like this, to do fast
syntax highlighting without looking at the whole
document.

nebula · September 30, 2018, 4:29am

Thanks for the examples! Enlightening.

I do not suggest that we should require the fence code to be closed, it’s the opposite actually. I say we should close fence close blocks when we reach an explicit valid markdown block element when the language specified is different from markdown or there is no language specified at all.

No, I’m asking the code to figure out what is Markdown. It’s this example that shows how to relieve the necessity of a ending fence actually: we stop when we reach valid Markdown. If the content the user wants in is excluded from the bloc because the language share common syntax with Markdown then the user has just to indent it.

About the non-locality of references, I am actually working on this part in the parser. And I realise, that there is some fundamental problems in the language which makes it impossible to be fixed now. So, leave everything as it is. I read Beyond Markdown: I am all in.

jgm · September 30, 2018, 5:36am

Sébastien Hamel noreply@talk.commonmark.org writes:

No, I’m asking the code to figure out what is
Markdown. It’s this example that shows how to
relieve the necessity of a ending fence actually: we
stop when we reach valid Markdown. If the content

Well, any string of characters is valid Markdown, so
that’s not going to work…