Examples 5, 6, and 7

aeslaughter · August 17, 2017, 11:13pm

Why in examples 5, 6, and 7 does the second tab become two spaces in html? I can’t seem to find any discussion of this in the surrounding text, the beginning of the section there is the following that mentions that tabs are equivalent to 4 spaces for indenting.

Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.

jgm · August 17, 2017, 11:19pm

I’ll just answer for Example 5, since it’s the same reason in each case. The two tabs at the beginning of the line are treated just as if they were 8 spaces. You might then ask, why does a word indented 8 spaces in this context turn into a code block starting with two spaces? Answer: see the spec for lists and list items. Two spaces are eaten up to line up with the text at the beginning of the list item (foo), then four are eaten up as indentation for an indented code block, and that leaves two spaces, which belong in the code block’s contents.

bhelyer · August 22, 2017, 9:53am

Recently wrote a CommonMark parser, and this confused me at first too. I wrote a comment in our implementation that might be helpful:

/*!
 * Represents a line of input.
 *
 * CommonMark handles tabs as tabs with a tab stop of 4 set.
 * The tab stop is considered with regards to the whole line,
 * even parts that a given parsing function won't see.
 *
 * Consider the following, a >, two tabs, and the string "content".
 *
 *      >		content
 *
 * The rules for blockquotes state that a block quote is a '>' an
 * optional space, and the rest of the line is the content.
 *
 * The rules for indented code blocks state that an indented codeblock
 * is four spaces of indentation, and then the content of the codeblock.
 * If the rule was "tabs are expanded to four spaces", the above would be
 * trivial.
 *
 *     ><tab><tab>content
 *
 * Becomes (where '.' is space)
 *
 *     >........content
 *
 * The '> ' is consumed when the blockquote is parsed, and then
 *
 *     .......content
 *
 * Is parsed as a code block, with the result being
 *
 *     ...content
 *
 * A codeblock with content indented by three spaces.
 * But that is not how CommonMark handles tabs.
 *
 * Tabs are considered to be expanded to spaces where they have to be
 * (when you're removing leading whitespace, as above, for example), but
 * as tab stops, so the text rounds to 4, considering the <entire line>. So,
 * (given that | is invisible and represents groups of four characters):
 *
 *     ><tab><tab>c|onte|nt
 * 
 * Becomes (where 
 *
 *     >...|<tab>con|tent
 *
 * The '> ' is removed, as before. (where the text in parens has been removed,
 * but we need to consider it so as to not break tab stops).
 *
 *     (>.)..|<tab>con|tent
 *
 * Then the codeblock parser needs to remove four spaces of indentation, and as
 * we only have two spaces on front, we need to expand the tab again.
 *
 *     (>.)..|....|cont|ent
 *
 * And remove four spaces.
 *
 *     (>...)|(..)..|cont|ent
 *
 * And that's how '>		content' becomes '  content' in a quoted codeblock. Simple!
 *
 * Anyway, we need to be able to look at the entire string to remove leading whitespace,
 * but the following parsing functions need to see what's been removed.
 * The expansion of tabs occurs in place.
 */