Emphasis and East Asian text

(I’m sorry to re-open this particular can of worms — I know emphasis parsing is a real pain.)

A user raised an issue today regarding how emphasis is treated in the context of East Asian text and punctuation. I’m going to paste part of my text from the linked issue here:

The problem here is that this definition of punctuation character makes sense in the context of the specification if we assume “Unicode whitespace” is a part of the text used (as with most Latin alphabet-derived languages); we expect to see The cat is called "Nodoka". but not 猫は「のどか」という。, where the latter has no space or punctuation character separating the 「」 from the surrounding text.

Hence, when we add emphasis (e.g. around "Nodoka"), we get: The cat is called **"Nodoka"**. but not 猫は**「のどか」**という。

With the English text, the opening ** satisfies the definition of a “left-flanking delimiter run”: it is (a) not followed by Unicode whitespace ("), and (b) preceded by Unicode whitespace. The closing ** satisfies the definition of a “right-flanking delimiter run”: it is (a) not preceded by Unicode whitespace ("), and (b) followed by a punctuation character (.).

With the Japanese text, however, the opening ** does not satisfy the definition of a “left-flanking delimiter run”: it is (a) not followed by Unicode whitespace (), but (b) it is followed by a punctuation character, and it is not preceded by Unicode whitespace or a punctuation character (). Likewise, the closing ** does not satisfy the definition of a “right-flanking delimiter” run: it is (a) not preceded by Unicode whitespace (), but (b) it is preceded by a punctuation character, and it is not followed by Unicode whitespace or punctuation ().

tl;dr: East Asian text doesn’t use whitespace inline in sentences. Accordingly, the left-/right-flanking delimiter run definitions are pretty unhelpful because they expect whitespace to help guide interpretation of the author’s intent.

I’m not sure what the correct solution to this would be, other than hacky solutions. (e.g. restricting what we consider punctuation to exclude East Asian non-sentence-ending punctuation, such as 「『【《」』】》 — would this even do it though for more than just this specific case? I’m not sure.)

4 Likes

kivikakk, thanks for guidance.

IMHO East Asian punctuations can be determined by East_Asian_Width property. If a punctuation has the property value A, W, F or H, it would be better to be treated in East Asian-specific manner, i.e. not requiring whitespaces around them.

With Unicode 9.0.0, I could distinguish 149 such characters, and I found they correspond to the characters listed in CLREQ, JLREQ and KLREQ slightly well.

(21 Jun update: Adding links to CLREQ and KLREQ.)

1 Like

@jgm, as ever, if you’d like a pull request to the reference implementation to test out this behaviour, I’d be more than happy to put one together.

At first glance, anyway, this seems a reasonable change. Iit means that in East Asian text there’d be less information available to disambiguate right and left delimiter runs, and hence probably more unintended interpretations. I don’t know what to do about this, really, within the general framework of Markdown, so your solution may be the best there is. So yes, please do go ahead and prepare a PR for this.

Thanks for quick response.

I made a pull request.

1 Like

I fear that distinguishing Western and East Asian punctuation marks may not be an ideal solution.

Although this PR works for Japanese and Chinese text (please note that Korean text uses “Western” punctuation marks), it does not solve a related but slightly different issue in Korean text reported here (github/javascript-tutorial, #2040).

Koreans expect *스크립트(script)*라고 to be rendered to <em>스크립트(script)</em>라고. Since Korean text uses “Western” punctuation marks, the current CommonMark spec or this PR does not render the above Korean text “correctly.”

This Korean-text issue may be resolved by adding one more condition to @jgm’s simple rule in this comment:

Right flanking:

  • before char is non-space, AND
  • one of the following:
    • before char is EA punctuation or non-punctuation
    • after char is space or punctuation or any EA character,

although it will break nested emphases more severely.


By the way, I think a better way to solve CJK-related emphasis issues is to introduce a new syntax ~_ , _~ , ~* , and *~ originally suggested by Prof. John MacFarlane for intra-word emphasis. However, his suggestion is equally applicable to any CJK-related emphasis issues arising from the lack of whitespace:

猫は~**「のどか」**~という。
~**有经验的人总会猜测对手会怎么做。**~这样的话
~*스크립트(script)*~라고

It’s been 5 years, is there any progress?

1 Like

@rxliuli CommonMark prioritizes compatibility with Markdown, so unless a change is designed to have minimal impact to the existing large corpus of documents written in it, I doubt a solution for CJK will get incorporated into CommonMark.

My sense is that it is difficult if not impossible for a single static lightweight syntax (be it CommonMark, Asciidoc, RST, or whatever) to support languages with such different natural syntaxes (i.e. the syntax of the natural language as written). For example, Markdown’s emphasis markup is designed specifically for languages with space-delimited words. If you change it to work better for CJK, it will work less well for space-delimited languages.

It would make more sense to design a new markup language oriented for the CJK language group, or perhaps dedicated ones for each of those languages since, as @barro350 suggests, they have significant differences between them as well. I only know Chinese so I can only guess.

I’m working on a solution (Plain Text Style Sheets) that makes it easy to declaratively define custom syntaxes or extend existing ones like CommonMark, and I will endeavor to offer an option tailored specifically for CJK languages. I’d need help figuring out what all the needs are. Anyone who considers themselves an expert feel free to drop me a line.