The width of characters in the tab expansion

lifthrasiir · September 4, 2014, 3:01am

The current tab expansion process always regards every character as the same width with the space. Unfortunately, this is quite bad for non-English uses of Markdown: for example, the CJK environment has traditionally treated any CJK characters as double-width characters, i.e. they takes two columns even in the monospace fonts. (See UAX #11 for Unicode’s approach on this issue.)

The core specification is less affected by this issue (since there is no syntax requiring an arbitrary character before the indentation), but it will greatly affect two use cases:

Any additional syntax built upon the core specification will be less usable for non-English writers.
Code blocks will still be affected.

I hereby propose the “undefined behavior” (or rather, a right of the implementations) about the tab expansion:

Every line is composed of N ASCII (U+0000 to U+007F) characters (N >= 0) followed by an optional sequence of a non-ASCII character and any other characters.
Any tab character, which is the M-th character in the line and 0 <= M < N, should be replaced with (4 - M mod 4) space characters.
Any tab character, which is not in the first N characters, should still be replaced with 1, 2, 3 or 4 space characters, but the exact number of spaces is up to the implementations. The number of spaces should be a function of preceding characters and nothing else.

This opens the possibility for implementations to use the proper East Asian Width or other heuristics for the tab expansion, while allowing them not to handle them.

(I do think that the future specification may have a profile to mandate the use of East Asian Width or so, but that wouldb e too much at the moment.)

uranusjr · September 26, 2014, 1:18am

This affects not only CJK characters, but all characters that don’t logically occupy the same width (with ASCII characters). The CommonMark spec is pretty vague on this aspect. The intention of the behaviour is, IMO, to retain the appearance of documents after tab are expended.

The current implementation (at least the JavaScript one) treats zero-width and combining characters the same way as ASCII (and all other Unicode) characters:

t	ASCII character

‌	Zero-width non-joiner

é	Non-ASCII character

oͭ	ASCII character with combining character

produces

<p>t   test</p>
<p>‌   test</p>
<p>é   test</p>
<p>oͭ  test</p>

I would really like the spec to address this more clearly, either by providing a clearer definition, more tests, or just declare it as implementation-defined as proposed above. But there should be something.

mb21 · September 26, 2014, 12:40pm

I guess most people (especially programmers) recommend to not use tabs. For example, set up your editor to enter four spaces when you press the tab key. That’s why it’s probably not seen as too important an issue.

uranusjr · September 26, 2014, 1:01pm

Then they could have just omitted that part completely. HTML doesn’t care a lick about tabs and spaces, and by including this (mandatory!) tab expansion thing they created problems for people trying to adopt their specification, but in the mean time make the feature a lot less useful than it should be. Tab expansion is good, but CommonMark is not making it worth the effort implementing because the specification is so flawed.