Proposal
I would like to propose supporting full-width punctuation for formatting (e.g. []
⇔ []
, etc.). The rationale for this is twofold:
- It makes CommonMark-formatted CJK text more natural and doesn’t break character alignment (
*漢字*です
vs.*漢字*です
). - Full-width characters are easier to type with an IME, making the composition of CommonMark-formatted CJK text easier.
With the proposed ruby tag extension, this becomes especially important, as ruby tags are very likely to be used within blocks of CJK text. Personally, I think reducing friction when writing CommonMark in CJK languages (along with localization efforts for the spec, which I’ve seen some work on), could even lead to increased adoption in East Asian communities, which is a utopian future I’d like to strive for.
Example
# タイトル
[コモンマークのホームページ](https://commonmark.org/)を見て。*特異*な仕様だけど、僕は**重要**だと思う。
* スパム
* 卵
* ベイクドビーンズ
1. スパム
2. 卵
3. ベイクドビーンズ
Characters
Following is an exhaustive list of alternate characters (collected by hand; tell me if I missed any). Characters marked with a dagger (†) were not immediately accessible in IMEs in my tests (Microsoft’s and Google’s Japanese IMEs) – they may still be included for the natural look and alignment issue, but are less important and may end up languishing in obscurity if included.
#
⇔#
=
⇔=
[
and]
⇔[
and]
(
and)
⇔(
and)
*
⇔*
_
⇔_
-
⇔-
~
⇔~
1234567890
⇔1234567890
.
⇔.
(note: not。
, as that’s a different character – pressing 1 and subsequently . with an IME usually results in “1.”)<>
⇔<>
!
⇔!
:
⇔:
`
⇔`
†"
and'
⇔"
† and'
†\
⇔\
†
Of course, this is probably not a small ask – that is quite a list of characters, after all! Since the impact on the current Latin-only user experience is nonexistent, however, I believe it’s worth it.
Considerations
The following paragraph of the spec raises a question:
This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding.
The question is: how to handle parsers that are limited to an encoding that can’t represent full-width characters? This is already technically a problem, since a parser can be limited to an encoding that can’t represent all ASCII characters. Since most all encodings in practice support ASCII, however, it hasn’t come up. Basically, this line will need to be considered and potentially clarified. I see 3 possibilities here:
-
Leave the paragraph as-is, and specify full-width characters normally (e.g. “[…] square brackets (
[
and]
or[
and]
)”).I don’t recommend this, as it may be interpreted as meaning that any parser confined to an encoding that doesn’t support full-width characters is non-conforming.
-
Specify full-width characters normally and change the paragraph to something along the lines of:
This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding. A conforming parser may also refrain from implementing characters that its encoding cannot represent.
I personally like this solution – it’s simple, it fixes the existing issue, and it’s A-OK to skip full-width characters if (and only if) your encoding doesn’t contain them.
-
Leave the paragraph as-is, and specify that Unicode’s full-width characters are the same characters as the ASCII equivalents.
Since CommonMark deals in characters rather than bytes, you could consider the full-width characters to be the same characters as the ASCII equivalents. In other words,
[]
are square brackets just as much as[]
are, and should be supported. This has the potential to open up a can of worms with regards to other variant glyphs, though, which might be a bit much – I’m fairly certain it wouldn’t be fun if CommonMark had to specify what variant glyphs were valid for… well, every encoding out there.
Pairs should probably be required to match - for example, it’s probably best if [example](http://example.com/)
doesn’t work.
If the third option above is chosen, this needs to be specified explicitly: other encodings may have their own variant glyphs that need to match.
As far as I know, this proposal shouldn’t add considerably to parser complexity, as it doesn’t introduce any new syntactical constructs.
Some of these characters are more important than others. Brackets are especially important for the ruby tag extension, if that is greenlit. Things that aren’t inline, like heading characters and list markers, are a bit less important (as they won’t break alignment in the middle of a paragraph). Personally I don’t see much reason to pick and choose – all of these help CJK text in some way – though there may be issues I haven’t considered.
Some characters may warrant separate discussion: full-width numbers in lists, for example (although I’d say those are fairly important for point 1), or the rarely-used full-width grave accent (`
), quotation mark ("
and '
), and backslash (\
) characters. In particular, the quotation marks are very odd and should probably not be included – in Microsoft’s Japanese IME, "
is the 10th choice among 16 other full-width quotation marks and would thus probably just lead to confusion. An option there would be to support 「
and 」
as an alternative, although I’m not sure whether CommonMark wants to support any sort of directional quotation marks.
No special treatment is necessary – simply considering full-width characters 1:1 alternatives to the symbols already in place is fine. As the spec stands now, the full-width space should be considered a space and not just unicode whitespace, as it would otherwise be unusable in things like headings and lists (“[…] must be followed by a space”). If you’d rather not consider the full-width space a “space”, headings and lists (among other things) do need to have their specifications updated to support unicode whitespace, or full-width spaces won’t be supported there. Worth mentioning is that considering the full-width space a space would bring a few idiosyncracies with it – for example, they would be collapsed just like normal spaces, which may or may not be desirable.
In closing
Being friendly toward CJK languages may or may not be an important goal of CommonMark. However, whatever the case may be, I can’t come up with a single use case where this would hurt – and several where it’d help.