Support for full-width formatting characters

obskyr · July 11, 2017, 1:48pm

Proposal

I would like to propose supporting full-width punctuation for formatting (e.g. [] ⇔ ［］, etc.). The rationale for this is twofold:

It makes CommonMark-formatted CJK text more natural and doesn’t break character alignment (＊漢字＊です vs. *漢字*です).
Full-width characters are easier to type with an IME, making the composition of CommonMark-formatted CJK text easier.

With the proposed ruby tag extension, this becomes especially important, as ruby tags are very likely to be used within blocks of CJK text. Personally, I think reducing friction when writing CommonMark in CJK languages (along with localization efforts for the spec, which I’ve seen some work on), could even lead to increased adoption in East Asian communities, which is a utopian future I’d like to strive for.

Example

＃ タイトル

［コモンマークのホームページ]（https://commonmark.org/）を見て。＊特異＊な仕様だけど、僕は＊＊重要＊＊だと思う。

＊ スパム
＊ 卵
＊ ベイクドビーンズ

１． スパム
２． 卵
３． ベイクドビーンズ

Characters

Following is an exhaustive list of alternate characters (collected by hand; tell me if I missed any). Characters marked with a dagger (†) were not immediately accessible in IMEs in my tests (Microsoft’s and Google’s Japanese IMEs) – they may still be included for the natural look and alignment issue, but are less important and may end up languishing in obscurity if included.

(space) ⇔ 　 (full-width space, U+3000 IDEOGRAPHIC SPACE)
# ⇔ ＃
= ⇔ ＝
[ and ] ⇔ ［ and ］
( and ) ⇔ （ and ）
* ⇔ ＊
_ ⇔ ＿
- ⇔ －
~ ⇔ ～
1234567890 ⇔ １２３４５６７８９０
. ⇔ ． (note: not 。, as that’s a different character – pressing 1 and subsequently . with an IME usually results in “１．”)
<> ⇔ ＜＞
! ⇔ ！
: ⇔ ：
` ⇔ ｀†
" and ' ⇔ ＂† and ＇†
\ ⇔ ＼†

Of course, this is probably not a small ask – that is quite a list of characters, after all! Since the impact on the current Latin-only user experience is nonexistent, however, I believe it’s worth it.

Considerations

The following paragraph of the spec raises a question:

This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding.

The question is: how to handle parsers that are limited to an encoding that can’t represent full-width characters? This is already technically a problem, since a parser can be limited to an encoding that can’t represent all ASCII characters. Since most all encodings in practice support ASCII, however, it hasn’t come up. Basically, this line will need to be considered and potentially clarified. I see 3 possibilities here:

Leave the paragraph as-is, and specify full-width characters normally (e.g. “[…] square brackets ([ and ] or ［ and ］)”).

I don’t recommend this, as it may be interpreted as meaning that any parser confined to an encoding that doesn’t support full-width characters is non-conforming.
Specify full-width characters normally and change the paragraph to something along the lines of:

This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding. A conforming parser may also refrain from implementing characters that its encoding cannot represent.

I personally like this solution – it’s simple, it fixes the existing issue, and it’s A-OK to skip full-width characters if (and only if) your encoding doesn’t contain them.
Leave the paragraph as-is, and specify that Unicode’s full-width characters are the same characters as the ASCII equivalents.

Since CommonMark deals in characters rather than bytes, you could consider the full-width characters to be the same characters as the ASCII equivalents. In other words, ［］ are square brackets just as much as [] are, and should be supported. This has the potential to open up a can of worms with regards to other variant glyphs, though, which might be a bit much – I’m fairly certain it wouldn’t be fun if CommonMark had to specify what variant glyphs were valid for… well, every encoding out there.

Pairs should probably be required to match - for example, it’s probably best if ［example]（http://example.com/) doesn’t work.

If the third option above is chosen, this needs to be specified explicitly: other encodings may have their own variant glyphs that need to match.

As far as I know, this proposal shouldn’t add considerably to parser complexity, as it doesn’t introduce any new syntactical constructs.

Some of these characters are more important than others. Brackets are especially important for the ruby tag extension, if that is greenlit. Things that aren’t inline, like heading characters and list markers, are a bit less important (as they won’t break alignment in the middle of a paragraph). Personally I don’t see much reason to pick and choose – all of these help CJK text in some way – though there may be issues I haven’t considered.

Some characters may warrant separate discussion: full-width numbers in lists, for example (although I’d say those are fairly important for point 1), or the rarely-used full-width grave accent (｀), quotation mark (＂ and ＇), and backslash (＼) characters. In particular, the quotation marks are very odd and should probably not be included – in Microsoft’s Japanese IME, ＂ is the 10th choice among 16 other full-width quotation marks and would thus probably just lead to confusion. An option there would be to support 「 and 」 as an alternative, although I’m not sure whether CommonMark wants to support any sort of directional quotation marks.

No special treatment is necessary – simply considering full-width characters 1:1 alternatives to the symbols already in place is fine. As the spec stands now, the full-width space should be considered a space and not just unicode whitespace, as it would otherwise be unusable in things like headings and lists (“[…] must be followed by a space”). If you’d rather not consider the full-width space a “space”, headings and lists (among other things) do need to have their specifications updated to support unicode whitespace, or full-width spaces won’t be supported there. Worth mentioning is that considering the full-width space a space would bring a few idiosyncracies with it – for example, they would be collapsed just like normal spaces, which may or may not be desirable.

In closing

Being friendly toward CJK languages may or may not be an important goal of CommonMark. However, whatever the case may be, I can’t come up with a single use case where this would hurt – and several where it’d help.

notriddle · July 11, 2017, 10:00pm

If you’re using the fullwidth characters to type up a link, like ［１２３４５６７８９０］（．．．）, how do you represent the URL part of the link without losing your alignment?

obskyr · July 12, 2017, 7:04am

You don’t. That part’s inevitable. However, you can use any of the 3 types of reference links to preserve alignment.

cben · July 12, 2017, 7:49am

This option could be misread as requiring to normalize them to ASCII chars wherever they appear, even when they’re just part of the text. (depends on exact wording, but anyway it sounds like it suggests an implementation that first normalizes chars and then parses)

Option 2 (“may also refrain …”) sounds clearest to me.

obskyr · July 12, 2017, 7:57am

Oh, you’re right, that’s an interpretation I didn’t think about. I suppose you’d have to be particularly clear with the wording of that one. What I meant wasn’t that they’d be normalized at all, but rather that they would be considered the same for formatting and escaping purposes. No character replacement would actually take place.

That option does have the potential to make variant glyphs in general a headache, though. What’s more, it’d take an extra kink or two to satisfy the “must match” requirement – something along the lines of “characters must match within the encoding used and not only as characters”, to prevent things like ~～~ and [］（). That whole thing sounds a bit messy to me, so I also prefer option 2.

mb21 · July 17, 2017, 2:47pm

I’m not sure if I understand correctly. But can’t you simply use a monospaced font in your markdown editor to accomplish that? That’s what a lot of latin-script users (myself included) do as well.

notriddle · July 17, 2017, 3:00pm

The IME defaults to writing the fullwidth version of the characters.

codinghorror · July 21, 2017, 10:04pm

I concur with @mb21 – why not just specify a monospace font in your CSS stylesheet for the editor?

(fun fact: when Discourse was under alpha development, we originally used a monospace font in the editor, but @sam switched it out for a proportional font – which I agree is the correct choice for an editor not aimed at programmers, primarily.)

obskyr · July 22, 2017, 1:31pm

If only it were that easy… Sadly, that wouldn’t solve any of the issues. There are a couple of reasons why:

Very often, you’re in a context where you’re not in control of the font. 90% of the time when I’m writing Markdown/CommonMark, I’m in an online editor with a font set by the website (Stack Exchange, GitHub, reddit, Discourse…). While you could use a userstyle to set the font on each site you use CommonMark on, that’s hardly something you want to require users to do.
The IME friction is left. The second point in the OP is to make Markdown easier to type with an IME – using a different font but still ASCII characters leaves you having to switch out of IME mode (or scroll through countless alternatives) to type punctuation, just as it is now.
Perhaps most importantly, even if you do happen to be in an environment where you can easily change fonts, full-width and half-width characters aren’t the same widths even in monospace fonts.

In fact, I wrote a script to check: among Windows 10’s default fonts and the rag-tag collection of custom fonts I have installed, not a single one had the same width for kanji as for ASCII characters. This means that it’s impossible to get the same effect as supporting full-width characters, unless you manage to find a custom font with the same width for ASCII as for CJK characters (which I’ve never seen).

There are a couple of fonts where ASCII characters are exactly half the width of CJK characters (in Windows 10, that list consists of MingLiU, SimSun, MS Gothic, and MS Mincho), which at least lets you get back into alignment fairly easily, but is still not optimal. None of the usual monospace fonts I tried (e.g. Consolas, Source Code Pro, Inconsolata…) lined up to half-CJK-width either.

If you assume that you’re in an environment where you can change fonts, and you assume that you’re fine with switching to one of a very limited set of monospace CJK fonts, and you assume that you’re fine with (the relatively benign) misalignment of half a CJK character (a whole load of assumptions, isn’t it), this last point isn’t an issue. Even in that very specific case, though, the previous two points are present as ever.

So… that’s why you can’t just specify a monospace font.

On a more personal note, I can vouch for the impact these problems have in practice. I type quite a bit of CJK Markdown regularly, and due to the issues outline above it’s never a very pleasant experience – whether on a website or in VS Code or wherever.