Proposed change in tab handling

I wanted to get feedback on a proposed change in tab handling.

Currently the spec says that parsing is preceded by conversion of spaces to tabs, with a 4-space tab stop. This mirrors the behavior of most existing implementations, and allows the parser to ignore complications arising from a mix of tabs and spaces. However, it has a serious drawback: you cannot include literal tabs in code blocks or code spans.

Surely there are cases where you need to put literal tabs in code blocks. If you’re writing a tutorial on Makefiles, for example, you need to give code samples with tabs, or they won’t work when copied and pasted.

In this thread @flying_sheep pointed out that John Gruber’s original Markdown syntax description does not call for destructive tab-to-space conversion. Here is what it says about tabs:

  • start and end tags in HTML blocks should not be indent"ed by “tabs or spaces”
  • “a line containing nothing but spaces or tabs is considered blank.”
  • “Normal paragraphs should not be indented with spaces or tabs.”
  • “List markers must be followed by one or more spaces or a tab.”
  • “List items may consist of multiple paragraphs. Each subsequent paragraph in a list item must be indented by either 4 spaces or one tab.”
  • “To put a code block within a list item, the code block needs to be indented twice—8 spaces or two tabs:”
  • “To produce a code block in Markdown, simply indent every line of the block by at least 4 spaces or 1 tab.”
  • A label in a link reference definition should be “followed by one or more spaces (or tabs)”
  • “You can put the title attribute on the next line and use extra spaces or tabs for padding”

So, the following change would be perfectly consistent with Gruber’s syntax description (maybe even more consistent than the current behavior, since his description never says that tabs in code blocks should be turned into spaces).

Proposal:

  • Remove the part of the spec that calls for tab-to-space preprocessing.
  • Replace it with language that says that for purposes of determining block structure (indentation etc.), tabs are to be treated as if they had been expanded to spaces with a 4-space tab stop.
  • Change test cases appropriately, and include test cases that show tabs being preserved in code blocks and spans.
  • Modify reference implementations accordingly.

I have working code, and a lightly revised test/spec.txt, in the no-detab branches of jgm/cmark and jgm/commonmark.js. The new block parser works by keeping track of both a (byte or character) “offset” into the line being parsed, and a virtual “column,” which takes into account tabs. Any syntax that cares about indentation makes reference to the “column.” We only need to keep track of this in parsing the beginnings of lines (which determines block structure), since beyond that tabs are simply preserved.

Advantages:

The primary advantage is that we can have tabs in code blocks.

An additional advantage is that this change removes the need for a detabbing pass in the parser. This makes the parsers more efficient, because UTF-8-aware detabbing is fairly expensive. (Taking this out gave cmark a significant speed boost, around 15%.) But that is not the main motivation for the proposed change.

Possible drawbacks:

The only real worry I’ve heard expressed so far is that code blocks containing tabs may show up differently in browsers than they did before. This is because browsers default to an 8-space tab expansion. This is configurable via CSS, but not yet in a way that works in all browsers.

I am not too worried about this, because it will only affect people who:

  • use tabs in their code
  • configure their editors for a 4-space tab stop

People who use tabs in their code and configure their editors for an 8-space tab stop (e.g. many C programmers) will find the new behavior better. People who don’t use tabs in code will see no change.

Perhaps, though, there are other drawbacks I haven’t considered.

3 Likes

Downside seems minimal. I know errant tabs in input are annoying when you do not want them, as it is more invisible magical characters. So I would count that as a mild drawback. We are no longer cleaning tabs out.

I have pushed a minimal change to the spec, implementing this, and changes to the reference implementations. I think quite a bit more work will be needed on both to get all the details right, but this seems like the right way to go.

While tabs can be annoying, preserving them is safer. We cannot know if all users will want or intend for the tabs to be removed. Some text editors will convert tabs to spaces; if that is the intended behaviour for a particular use case then there is still a clean tabs-to-spaces solution, just at the editor level instead of the parser level.