Emphasis and East Asian text

#1

(I’m sorry to re-open this particular can of worms — I know emphasis parsing is a real pain.)

A user raised an issue today regarding how emphasis is treated in the context of East Asian text and punctuation. I’m going to paste part of my text from the linked issue here:

The problem here is that this definition of punctuation character makes sense in the context of the specification if we assume “Unicode whitespace” is a part of the text used (as with most Latin alphabet-derived languages); we expect to see The cat is called "Nodoka". but not 猫は「のどか」という。, where the latter has no space or punctuation character separating the 「」 from the surrounding text.

Hence, when we add emphasis (e.g. around "Nodoka"), we get: The cat is called **"Nodoka"**. but not 猫は**「のどか」**という。

With the English text, the opening ** satisfies the definition of a “left-flanking delimiter run”: it is (a) not followed by Unicode whitespace ("), and (b) preceded by Unicode whitespace. The closing ** satisfies the definition of a “right-flanking delimiter run”: it is (a) not preceded by Unicode whitespace ("), and (b) followed by a punctuation character (.).

With the Japanese text, however, the opening ** does not satisfy the definition of a “left-flanking delimiter run”: it is (a) not followed by Unicode whitespace (), but (b) it is followed by a punctuation character, and it is not preceded by Unicode whitespace or a punctuation character (). Likewise, the closing ** does not satisfy the definition of a “right-flanking delimiter” run: it is (a) not preceded by Unicode whitespace (), but (b) it is preceded by a punctuation character, and it is not followed by Unicode whitespace or punctuation ().

tl;dr: East Asian text doesn’t use whitespace inline in sentences. Accordingly, the left-/right-flanking delimiter run definitions are pretty unhelpful because they expect whitespace to help guide interpretation of the author’s intent.

I’m not sure what the correct solution to this would be, other than hacky solutions. (e.g. restricting what we consider punctuation to exclude East Asian non-sentence-ending punctuation, such as 「『【《」』】》 — would this even do it though for more than just this specific case? I’m not sure.)

4 Likes

#2

kivikakk, thanks for guidance.

IMHO East Asian punctuations can be determined by East_Asian_Width property. If a punctuation has the property value A, W, F or H, it would be better to be treated in East Asian-specific manner, i.e. not requiring whitespaces around them.

With Unicode 9.0.0, I could distinguish 149 such characters, and I found they correspond to the characters listed in CLREQ, JLREQ and KLREQ slightly well.

(21 Jun update: Adding links to CLREQ and KLREQ.)

1 Like

#3

@jgm, as ever, if you’d like a pull request to the reference implementation to test out this behaviour, I’d be more than happy to put one together.

0 Likes

#4

At first glance, anyway, this seems a reasonable change. Iit means that in East Asian text there’d be less information available to disambiguate right and left delimiter runs, and hence probably more unintended interpretations. I don’t know what to do about this, really, within the general framework of Markdown, so your solution may be the best there is. So yes, please do go ahead and prepare a PR for this.

0 Likes

#5

Thanks for quick response.

I made a pull request.

1 Like