Gap in spec section 2.2, 4.4: HT vs SP vs code block

Section 2.2 requires that tabs (HT) are not expanded into spaces (SP); I take this to mean that HT is required to be “preserved” and passed through into the output:

Tabs in lines are not expanded to spaces.

For code block lines, section 4.4 requires that “four spaces of indentation”:

The contents of the code block are the literal contents of the lines, including trailing line endings, minus four spaces of indentation.

Consider a code block input line with ( HT , HT , SP ) as indentation (visualized here as “^I^I␣”), like the line containing “A” in this input:

12345678 Consider this code block:

␣␣␣␣␣␣Code:
^I^I␣A

The words of the specification require removing “four spaces of indentation” from “^I^␣A”, but does not say how: there’s only one SP that could be removed, and the preceding HT is not supposed to be “expanded to spaces” by the clear words of section 2.2.

The actual result produced by the reference implementation is this:

␣␣Code:
^I␣A

But I don’t see how this can be inferred from the specification (without silently, just on this occasion, skipping the “HT is not expanded to SP” rule!). To cover the actual behavior in the given example, the “minus four spaces” rule for code block lines could be rephrased as:

[…] minus four spaces of indentation if the indentation ends in four or more spaces, or else minus the last tab character in the indentation.

This would explain the above case nicely, but would fail for an input line like “^I␣^I␣X” in a code block: The “X” is in column 10 (counting from 1), removing the last HT gives “^I␣␣X”, which leaves “X” in column 7: it is thus akin to removing only three SP here—ups! We could instead remove the first HT here, producing “␣^I␣X”, where “X” ends up in column 6, as if we removed “four spaces of indentation”, as required.

The problem here, I would sayt, is that in the general case, in a sequence of SP and HT:

  • not every SP effectively moves the active position to the right by one (if followed by HT), and
  • not every HT effectively moves the active position to the right by four (by definition).

So the “no HT expansion” requirement, combined with the “as if expanded” description of behavior, makes things rather messy here: exactly which characters of the indentation string are supposed to be deleted, or expanded and replaced?


Which brings me to this question: Why is it that HT must not be expanded into SP, even in a line’s indentation? Are there actual applications where this is crucial? Is it prudent to, say, try to extract from a parsed code block that comes out of a CommonMark processor, a Makefile (where the distinction between SP and HT is obviously of outmost importance)? Is this a common usage scenario?

This requirement certainly makes both the specification more complicated, as demonstrated above, and complicates implementations too, as I’m sure. So what is the rationale behind it, how is this complication justified?

There are certain places where you want to be able to have literal tabs (mainly, code blocks and code spans – imagine you’re writing a tutorial on Makefiles, and you want people to be able to copy and paste).

The reference implementations keep track of a virtual column position as they consume characters, so they don’t need to expand tabs. This is not too complex.

However, as I’ve already pointed out in the “1.0 issues” thread, there are some rough edges, and the spec was never fully updated after the tab change. The rough edges are cases where, if tabs had been expanded to spaces, part of a tab’s worth of spaces would be consumed as indentation, the rest belonging to a code block.

Example

In this example, indented code would normally need to be indented 6 spaces, so given what the spec currently says you’d expect the code block to begin with 2 spaces. But it doesn’t – both tabs are gobbled up and the code block has no initial spaces. I think the implementation’s behavior is what we want; after all, no spaces were typed, so why should the code block begin with spaces? But the spec needs to be adjusted so this is all clear.

[…] so given what the spec currently says you’d expect the code block to begin with 2 spaces.

Well, no: Given that the spec says “remove four SP characters”, and that there are no SP characters in the line, strictly speaking I don’t know what to expect, hence my initial posting. Kind of like “division by zero”, I’d say. Why shouldn’t I expect to see both HT characters in the output, thinking “no SP here, so nothing gets deleted”?

Erm, after looking it up again, the spec says not “remove four SP characters”, but:

[…] literal contents of the lines, including trailing line endings, minus four spaces of indentation. […]

The crux here is that the word “spaces” seems to be used with two meanings: (1) the SP character; and (2) a character position in the line. So “minus four spaces” could mean:

  • delete four SP characters (that’s what I thought, seduced by the “literal contents” words), or
  • move four positions to the left, or four positions less indentation.

And I’m starting to think that you meant the second interpretation all the time? But “literal contents” has only the meaning “a string of characters” in my book, so what is your intended interpretation of a phrase like

[[phrase denoting some character string]] “minus four spaces”

if not “the result obtained by removing four SP characters from [[phrase denoting some character string]]”?


But the spec needs to be adjusted so this is all clear.

My take on it would be—off the top of my head, but I think this is viable:

  1. The indentation (a string of HT and/or SP) of the code block line is

    • “tab-expanded” (there’s no better word for this?) from left to right,
    • until it has a prefix consisting of enough SP characters to “reach” (or “fill”?) the code block’s indentation,
  2. Then the number of SP that corresponds to the code block’s indentation is removed from (the start of) the code block line,

  3. The remainder of the code block line is “literally” output as a content line for the code block.

NOTE - This is equivalent to “expand all HT to SP, then remove surplus indentation”, except that we preserve (and output) as many as possible HT in the “right part” (in both senses of the word “right” :wink: ) of the code block line.

EXAMPLE
In your example, the code block’s indent is 6 (blank positions in front of each line), the code block line is ( HT , HT , “code” ). So according to my little spec, we must expand (starting from the left) the first HT, which yields ( SP , SP , SP , SP , HT , “code” ). We still need two more SP to reach the indentation of 6, so one more round, and we have ( 8 * SP , “code” ). Deleting 6 SP (equivalent to the code block’s indentation of 6) from the left gives ( 2 * SP , “code” ), which is the content we place into the code block.

The spec does say:

However, in contexts where indentation is significant for the document’s structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.

That was the cheap change I made after changing the parsers to preserve tabs. As I’ve said, I think more needs to be done; I just haven’t had time to think through the changes that are needed.

As for whether my example should give you a code block with two leading spaces or with none: I think there’s something to be said on each side, and it doesn’t matter hugely how the issue is resolved. The wording above suggests it should be resolved in favor of “two leading spaces,” which is not what the reference implementations currently do. But I’m a bit uncomfortable with the fact that the code block contains leading spaces when there are none in the input.

I think a strong argument in favor of having ( SP , SP , “code” ) as the code block content line is:

  • it is what one would expect visually (from looking at a printout/editor screen, of course using 4 column-tabs);
  • which is another way of saying: it is the result one would expect if tabs had been expanded from the start.

Note that this has no influence on your Makefile tutorial example: whether the Makefile code block lines start with two SP or no HT is equally bad, and arranging the input text to avoid this is trivial: indent (by whatever means) in the editor to the start position of the code block, and then hit TAB on the keyboard. (Alas, the editor could still mangle the lines …). An even simpler approach is to use TAB for indenting the code block throughout; that should be safe, don’t you think?