Emphasis and East Asian text

kivikakk · June 20, 2017, 7:12am

(I’m sorry to re-open this particular can of worms — I know emphasis parsing is a real pain.)

A user raised an issue today regarding how emphasis is treated in the context of East Asian text and punctuation. I’m going to paste part of my text from the linked issue here:

The problem here is that this definition of punctuation character makes sense in the context of the specification if we assume “Unicode whitespace” is a part of the text used (as with most Latin alphabet-derived languages); we expect to see The cat is called "Nodoka". but not 猫は「のどか」という。, where the latter has no space or punctuation character separating the 「」 from the surrounding text.

Hence, when we add emphasis (e.g. around "Nodoka"), we get: The cat is called **"Nodoka"**. but not 猫は**「のどか」**という。

With the English text, the opening ** satisfies the definition of a “left-flanking delimiter run”: it is (a) not followed by Unicode whitespace ("), and (b) preceded by Unicode whitespace. The closing ** satisfies the definition of a “right-flanking delimiter run”: it is (a) not preceded by Unicode whitespace ("), and (b) followed by a punctuation character (.).

With the Japanese text, however, the opening ** does not satisfy the definition of a “left-flanking delimiter run”: it is (a) not followed by Unicode whitespace (「), but (b) it is followed by a punctuation character, and it is not preceded by Unicode whitespace or a punctuation character (は). Likewise, the closing ** does not satisfy the definition of a “right-flanking delimiter” run: it is (a) not preceded by Unicode whitespace (」), but (b) it is preceded by a punctuation character, and it is not followed by Unicode whitespace or punctuation (と).

tl;dr: East Asian text doesn’t use whitespace inline in sentences. Accordingly, the left-/right-flanking delimiter run definitions are pretty unhelpful because they expect whitespace to help guide interpretation of the author’s intent.

I’m not sure what the correct solution to this would be, other than hacky solutions. (e.g. restricting what we consider punctuation to exclude East Asian non-sentence-ending punctuation, such as 「『【《」』】》 — would this even do it though for more than just this specific case? I’m not sure.)

ikedas · June 21, 2017, 3:03am

kivikakk, thanks for guidance.

IMHO East Asian punctuations can be determined by East_Asian_Width property. If a punctuation has the property value A, W, F or H, it would be better to be treated in East Asian-specific manner, i.e. not requiring whitespaces around them.

With Unicode 9.0.0, I could distinguish 149 such characters, and I found they correspond to the characters listed in CLREQ, JLREQ and KLREQ slightly well.

(21 Jun update: Adding links to CLREQ and KLREQ.)

kivikakk · June 22, 2017, 4:10am

@jgm, as ever, if you’d like a pull request to the reference implementation to test out this behaviour, I’d be more than happy to put one together.

jgm · June 22, 2017, 8:41am

At first glance, anyway, this seems a reasonable change. Iit means that in East Asian text there’d be less information available to disambiguate right and left delimiter runs, and hence probably more unintended interpretations. I don’t know what to do about this, really, within the general framework of Markdown, so your solution may be the best there is. So yes, please do go ahead and prepare a PR for this.

ikedas · June 25, 2017, 3:38am

Thanks for quick response.

I made a pull request.

barro350 · August 21, 2020, 4:52pm

I fear that distinguishing Western and East Asian punctuation marks may not be an ideal solution.

Although this PR works for Japanese and Chinese text (please note that Korean text uses “Western” punctuation marks), it does not solve a related but slightly different issue in Korean text reported here (github/javascript-tutorial, #2040).

Koreans expect *스크립트(script)*라고 to be rendered to <em>스크립트(script)</em>라고. Since Korean text uses “Western” punctuation marks, the current CommonMark spec or this PR does not render the above Korean text “correctly.”

This Korean-text issue may be resolved by adding one more condition to @jgm’s simple rule in this comment:

Right flanking:

before char is non-space, AND
one of the following:
- before char is EA punctuation or non-punctuation
- after char is space or punctuation or any EA character,

although it will break nested emphases more severely.

By the way, I think a better way to solve CJK-related emphasis issues is to introduce a new syntax ~_ , _~ , ~* , and *~ originally suggested by Prof. John MacFarlane for intra-word emphasis. However, his suggestion is equally applicable to any CJK-related emphasis issues arising from the lack of whitespace:

猫は~**「のどか」**~という。
~**有经验的人总会猜测对手会怎么做。**~这样的话
~*스크립트(script)*~라고

rxliuli · July 17, 2022, 6:16am

It’s been 5 years, is there any progress?

vas · July 19, 2022, 2:30pm

@rxliuli CommonMark prioritizes compatibility with Markdown, so unless a change is designed to have minimal impact to the existing large corpus of documents written in it, I doubt a solution for CJK will get incorporated into CommonMark.

My sense is that it is difficult if not impossible for a single static lightweight syntax (be it CommonMark, Asciidoc, RST, or whatever) to support languages with such different natural syntaxes (i.e. the syntax of the natural language as written). For example, Markdown’s emphasis markup is designed specifically for languages with space-delimited words. If you change it to work better for CJK, it will work less well for space-delimited languages.

It would make more sense to design a new markup language oriented for the CJK language group, or perhaps dedicated ones for each of those languages since, as @barro350 suggests, they have significant differences between them as well. I only know Chinese so I can only guess.

I’m working on a solution (Plain Text Style Sheets) that makes it easy to declaratively define custom syntaxes or extend existing ones like CommonMark, and I will endeavor to offer an option tailored specifically for CJK languages. I’d need help figuring out what all the needs are. Anyone who considers themselves an expert feel free to drop me a line.

tats-u · January 29, 2025, 10:16am

I released plugins for remark / markdown-it: GitHub - tats-u/markdown-cjk-friendly: Make CommonMark more friendly for Japanese/Chinese/Korean (CommonMark next specification draft)—plugins & patched packages

Specifications draft is: markdown-cjk-friendly/specification.md at main · tats-u/markdown-cjk-friendly · GitHub

GitHub issue (feedbacks can be also here): Emphasis with CJK punctuation · Issue #650 · commonmark/commonmark-spec · GitHub

Your feedback is very welcome.

vas · February 3, 2025, 6:32am

@tats-u,

Does your work represent the solution described in the first paragraph of my thesis above?

Does it disprove the rest of my thesis?

tats-u · February 11, 2025, 2:51am

@vas non-CJK people and documents will never be able to observe this change.

Non-CJK people and documents will not be able to tell the difference between “non-CJK punctuation character” in my spec and “Unicode punctuation character” in the original spec. Also there are no CJK code point without variation selector, IVS, or SVS that can follow CJK preceded by CJK code point without variation selector in non-CJK context.

github.com/tats-u/markdown-cjk-friendly

specification.md

main

# CommonMark CJK-friendly Amendments Specification

CommonMark issue: https://github.com/commonmark/commonmark-spec/issues/650

The following chapters are written as an amendment to [the original CommonMark specification](https://spec.commonmark.org/0.31.2/). Missing chapters, sections, and definitions are the same as in the original specification.

## 2. Preliminaries 

### 2.1 Characters and lines

A <a href="#cjk-character" id="cjk-character">CJK character</a> is a [character](https://spec.commonmark.org/0.31.2/#character) ([Unicode code point](http://unicode.org/glossary/#code_point)) that meets _at least one_ of the following criteria:

- Meets _both_ of the following criteria:
  - [UAX #11 East Asian Width](https://www.unicode.org/reports/tr11/) category is either `W`, `F`, or `H`
  - Not in [RGI emoji set](https://www.unicode.org/reports/tr51/#def_rgi_set) (i.e. is not [fully-qualified emoji](https://www.unicode.org/reports/tr51/#def_fully_qualified_emoji)) defined in [UTS #51 Unicode Emoji](https://www.unicode.org/reports/tr51/#def_qualified_emoji_character)
- [UAX #24 Unicode Script Property](https://www.unicode.org/reports/tr24/) is Hangul

An <a href="#ideographic-variation-selector" id="ideographic-variation-selector">Ideographic Variation Selector</a> is a [character](https://spec.commonmark.org/0.31.2/#character) in the Variation Selectors Supplement Block (U+E0100–U+E01EF).

A <a href="#svs-that-can-follow-cjk" id="svs-that-can-follow-cjk">Standard Variation Selector that can follow CJK</a> is a [character](https://spec.commonmark.org/0.31.2/#character) other than U+FE0F in the Variation Selectors Block (U+FE00–U+FE0F) that can follow [CJK character](#cjk-character) (U+FE00–U+FE02 or U+FE0E as of Unicode 16[^svs-range]).

This file has been truncated. show original

IVS and SVS themselves are invisible to humans, so CJK people will think “** are recognized as are while they are next to CJK characters”. CJK people never nest emphasis, so I am convinced that this change does not impact badly even on existing CJK documents (never on Chinese/Japanese).

Indeed, it outputs the same HTML as the original implementations for all existing CommonMark test cases (as of 0.31.2).

github.com/tats-u/markdown-cjk-friendly

packages/micromark-extension-cjk-friendly/test/micromark.spec.ts

41ffaf266


      
          it("Output for CommonMark test cases are the same as those without this plugin", async () => {
            for (const testCase of commonMarkTestCases) {
              expect(md2Html(testCase.markdown)).toBe(
                md2HtmlOriginal(testCase.markdown),
              );
            }
          });

github.com/tats-u/markdown-cjk-friendly

packages/markdown-it-cjk-friendly/test/markdownIt.spec.ts

41ffaf266


      
          it("Output for CommonMark test cases are the same as those without this plugin", async () => {
            for (const testCase of commonMarkTestCases) {
              expect(md2Html(testCase.markdown)).toBe(
                md2HtmlOriginal(testCase.markdown),
              );
            }
          });

vas · February 12, 2025, 12:31am

Yes, you appear to have found a solution that does no harm and does a lot of good.

But would it

make more sense to design a new markup language oriented for the CJK language group, or perhaps dedicated ones for each of those languages since, as @barro350 suggests, they have significant differences between them as well.

?

What I’m trying to get at is this: If Markdown did not exist, and you wanted to design an ideal lightweight markup syntax for Chinese authors, what would it look like? Would it not make very different choices than those made for Markdown? For example, would it use distinct and thus unambiguous left-flanking and right-flanking markup, much like natural languange itself, e.g. ()[]{}⟪⟫〖〗⎡⎦«» ?

tats-u · March 15, 2025, 2:01pm

Disclaimer: I’m not Chinese but Japanese.

It would make use of full-width ASCII characters other than brackets, the period, the comma, or slashes, which are pain of the neck to be input. In Shift_JIS, the symbols [¥]^_`{|} can appear in the second (last) byte of Japanese characters. (e.g. Some characters e.g. ソ and 十 have made trouble in C-derived languages because they contain \ as the last byte)

And we don’t use italic and prefer warm colored (especially red) text and “text-emphasis in CSS” to italic.

e,g.

！！（注意）！！：｛圏点｜必ず｝行ってください。

↓

<strong>注意</strong>：<em style="font-style: normal; text-emphais:filled">絶対に<em>行ってください。

https://www.aozora.gr.jp/aozora-manual/index-input.html#markup

Aozora Bunko (like Japan’s version of Project Gutenberg, but specifically dedicated to Japanese literature)'s markup would be helpful.

Note: 《》 is relatively pain of the neck to be input compared to full-width ASCII.

vas · March 20, 2025, 9:00am

Apologies if I gave the impression that I assumed you were. From your username I would have guessed Japanese. I just used Chinese as an example, probably because I can speak Mandarin, though I am illiterate.

Are there any existing efforts to establish a standard for Japanese lightweight markup? Does anyone other than Aozora Bunko use their markup? Some of their markup seems to be presentation markup rather than semantic markup, but maybe I misunderstood (I quickly skimmed a translation of that page, and I am not familiar with Japanese writing conventions).

tats-u · March 23, 2025, 2:03am

I have not heard. I think dialects of the Wiki markup are much more famous and has satisfied Japanese.

Nobody. Dedicated to there.

It is because it is designed to retain the styles and notations in books.