The current tab expansion process always regards every character as the same width with the space. Unfortunately, this is quite bad for non-English uses of Markdown: for example, the CJK environment has traditionally treated any CJK characters as double-width characters, i.e. they takes two columns even in the monospace fonts. (See UAX #11 for Unicode’s approach on this issue.)
The core specification is less affected by this issue (since there is no syntax requiring an arbitrary character before the indentation), but it will greatly affect two use cases:
- Any additional syntax built upon the core specification will be less usable for non-English writers.
- Code blocks will still be affected.
I hereby propose the “undefined behavior” (or rather, a right of the implementations) about the tab expansion:
- Every line is composed of N ASCII (U+0000 to U+007F) characters (N >= 0) followed by an optional sequence of a non-ASCII character and any other characters.
- Any tab character, which is the M-th character in the line and 0 <= M < N, should be replaced with (4 - M mod 4) space characters.
- Any tab character, which is not in the first N characters, should still be replaced with 1, 2, 3 or 4 space characters, but the exact number of spaces is up to the implementations. The number of spaces should be a function of preceding characters and nothing else.
This opens the possibility for implementations to use the proper East Asian Width or other heuristics for the tab expansion, while allowing them not to handle them.
(I do think that the future specification may have a profile to mandate the use of East Asian Width or so, but that wouldb e too much at the moment.)