Bold failure in Chinese sentence, because of the usage of Chinese punctuation is different from English

I met bold error, may caused by the usage of Chinese punctuation is different from English. Here are some examples:

Error:

**水温适度。**水的温度与室温相同,可以有效的减少对胃肠道的刺激,白开水具有生物活性,可以透过细胞膜促进人体的新陈代谢。

**水温适度。**水的温度与室温相同,可以有效的减少对胃肠道的刺激,白开水具有生物活性,可以透过细胞膜促进人体的新陈代谢。

In Chinese punctuation, No spaces are required between Chinese punctuation marks and subsequent sentences.

So the following wording does not meet the standard:

**水温适度。** 水的温度与室温相同,可以有效的减少对胃肠道的刺激,白开水具有生物活性,可以透过细胞膜促进人体的新陈代谢。

水温适度。 水的温度与室温相同,可以有效的减少对胃肠道的刺激,白开水具有生物活性,可以透过细胞膜促进人体的新陈代谢。

Correct:

**水温适度。水的温度与室温相同,可以有效的减少对胃肠道的刺激,白开水具有生物活性,可以透过细胞膜促进人体的新陈代谢。**

水温适度。水的温度与室温相同,可以有效的减少对胃肠道的刺激,白开水具有生物活性,可以透过细胞膜促进人体的新陈代谢。

The best solution at present might be usage of zero-width space. I had suggested adding a sub-section about it in the spec:

A left-flanking delimiter run is a delimiter run that is
(1) not followed by Unicode whitespace,
and either
(2a) not followed by a punctuation character,
or
(2b) followed by a punctuation character and preceded by Unicode whitespace or a punctuation character. (…)

A right-flanking delimiter run is a delimiter run that is
(1) not preceded by Unicode whitespace,
and either
(2a) not preceded by a punctuation character,
or
(2b) preceded by a punctuation character and followed by Unicode whitespace or a punctuation character. (…)

A punctuation character is
an ASCII punctuation character or
anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.

Unicode Punctuation P character property (major category)

Value Category minor Count (13.0) Remarks
Pc connector 10 Includes “_” underscore
Pd dash 25 Includes several hyphen characters
Ps open 75 Opening bracket characters
Pe close 73 Closing bracket characters
Pi initial quote 12 Opening quotation mark. Does not include the ASCII “neutral” quotation mark. May behave like Ps or Pe depending on usage
Pf final quote 10 Closing quotation mark. May behave like Ps or Pe depending on usage
Po other 593

Ideographic Full Stop 。 and Comma 、, U+3002 and U+3001, are both Po other.

The flanking behavior could respect the Unicode minor category more, but I donʼt think this would help in this particular case.

1 Like

thx!
I wish there was a way to fix this bug, which is a product flaw that will cause a lot of user dissatisfaction.

Good job! :+1:
BTW: Because of this flaw, blackfriday has a better experience than goldmark.

Good job!

So a variant with Z-WS could look like this (and spec really needs more documentation on Z-WS, as in the afore-mentioned PR):

**水温适度。**​水的温度与室温相同,可以有效的减少对胃肠道的刺激,白开水具有生物活性,可以透过细胞膜促进人体的新陈代谢。

If we want a proper solution, one that remains true to the spirit of Markdown, we have to return its basic principle:

The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.

This implies that, from the reader’s perspective, styling of the plain text functions analogously to the corresponding styling of the rich text. Thus while a bold font weight is a good way to highlight words in rich text, wrapping text segments with ** has a similar effect in plain text:

I met **bold** error, may caused by *the usage of 
Chinese punctuation is different from English.* Here
are some examples:

But this only works well for for languages like English where words are white-space delimited. It does not work for languages like Chinese for two reasons:

  1. Words or phrases so marked do no stand out in the plain text:
    水温适度。水的温度与室温相同,*可以*有效的减少对**胃肠道**的刺
    激,白开水具有生物活性,可以透过**细胞膜**促进人体的新陈代谢。
    
  2. We cannot rely on space characters to help determine whether the markup is starting or ending a highlighted segment.

We can address #1 by choosing alternatives to the asterisk *. Unlike English, for CJK it is a disadvantage not an advantage to limit markup symbols to ASCII. While I would defer to native CJK readers on the actual symbol selection (What do CJK writers naturally use in text messages today?), here is the same using alternate emphasis and strong markup symbols:

水温适度。水的温度与室温相同,◆可以◆有效的减少对◼︎胃肠道◼︎的刺
激,白开水具有生物活性,可以透过◼︎细胞膜◼︎促进人体的新陈代谢。

We can solve #2 by choosing different symbols for opening and closing. So perhaps:

水温适度。水的温度与室温相同,┏可以┓有效的减少对┣胃肠道┫的刺
激,白开水具有生物活性,可以透过┣细胞膜┫促进人体的新陈代谢。

notes

  • The new markup symbols would be plain text alternatives to * and _. They would still be rendered in bold or italics or whatever strong and emphasis are mapped to in the output format (e.g. HTML).
  • Having separate by visually symmetrical open and close markup symbols eliminates ambiguities that even exist in space delimited languages. Such alternate markup symbols would have universal applicability.
  • Symbols could be selected that gracefully degrade under legacy Markdown rendering – for example CJK readers might feel that the above alternate symbols appearing in the rendered rich text still serves the same purpose as, even if not as ideal as, italic/bold rendering.
2 Likes

Yeah, you are right!
Quotation marks can be used for emphasis, just like this:

水温适度。水的温度与室温相同,可以有效的减少对「胃肠道」的刺激,「白开水」具有生物活性,可以透过「细胞膜」促进人体的新陈代谢。

OR

水温适度。水的温度与室温相同,可以有效的减少对“胃肠道”的刺激,“白开水”具有生物活性,可以透过“细胞膜”促进人体的新陈代谢。

OR

着重号

But for markdown users, bold text is recommended:

水温适度。水的温度与室温相同,可以有效的减少对胃肠道的刺激,白开水具有生物活性,可以透过细胞膜促进人体的新陈代谢。


The problem I encountered was that when bolding a sentence in a paragraph, the current standard would go wrong:

**水温适度。**水的温度与室温相同,可以有效的减少对胃肠道的刺激,白开水
具有生物活性,**可以透过细胞膜促进人体的新陈代谢。**

**水温适度。**水的温度与室温相同,可以有效的减少对胃肠道的刺激,白开水具有生物活性,可以透过细胞膜促进人体的新陈代谢。

If it’s in English, everything is fine.

**It looks a little funny!** As you can see, **the bold sentences
(Including punctuation at the end of the sentence) are NOT rendered.**

It looks a little funny! As you can see, the bold sentences (Including punctuation at the end of the sentence) are NOT rendered.

@keaton, I think you misunderstand my proposal. I’m not saying that new symbols be used instead of bold. I’m saying provide alternate markup symbols optimized for CJK plain text that will also render as bold in the output. Thus:

plain text (Markdown) rich text (e.g. rendered HTML)
English This **word** is highlighted. This word is highlighted.
Chinese 该┏单词┓被突出显示。 单词被突出显示。
English This ┏word┓ is highlighted. This word is highlighted.

It is important to understand that * was a choice optimized for languages like English. If Markdown was first invented in China, for example, some other symbol would most definitely have been chosen.

@vas Sorry that I misunderstood you.
Your proposal is great. But I think the symbols need to be easily typed on the keyboard.

half-width characters

`~ ! @ # $ % ^ & * ( ) -_ =+ [{ ]} \| ;: '" ,< .> /?

full-width characters

* —— “”「」『』【】〖〗〈〉《》

Therefore, there are not many selectable symbols.

My idea is not to change the bold syntax.

Yes, typeability is important.

Without augmenting the grammar I don’t think there is a solution to your problem. As I said above Markdown’s *-based scheme works for white-space delimited languages. See reason number 2 above. See the spec for Emphasis and strong emphasis, particularly the part about left flanking and right flanking and the rules for opening and closing. For example, the following can be supported precisely because of the whitespace between words:

***strong** in emph*
*emph *with emph* in it*
**strong **with strong** in it**

and the following can be handled without ambiguity:

*foo *bar**
*foo **bar** baz*

I’m afraid making changes to the semantics of * to handle your original example above would break Markdown for space-delimited languages. Perhaps there is a brilliant solution that no one thought of over all these years, but I think it unlikely.

(I guess we could come up with an alternate rule set that is invoked only in the context of CJK text? If we could do that, what would those rules be?)

If I am correct, we either have to accept the limitations for CJK (and use workarounds like @haqer1 suggested) or we have to augment Markdown. If the latter, I’d start with my question: If Markdown were invented in China, what symbol or combination of symbols would have been chosen?

1 Like

In the international world, there is no need to change the current markdown syntax because of CJK. For example, in an article mixed in Chinese and English, the standard markdown syntax must be used. So, we have to find another way. What @haqer1 said is feasible.

I don’t know how blackfriday works. Using blackfriday will not cause this problem. There might be a solution, but I can’t understand the code.

Well, there is the legacy reduplication of printable ASCII characters compatible with CJK sinograms in Unicode block Halfwidth and Fullwidth Forms, which, of course, includes U+FF0A * and U+FF3F _.

I am a macOS user who uses system’s default Chinese input method for a long time, and I still don’t know any simple ways to input those fullwidth *s and _s.
The only three ways to input them I knew are:

  1. search "fullwidth asterisk/underline " on web and copy/paste;
  2. add “Unicode Hex Input” to my input method, and use magic numbers to input them;
  3. use Japanese input method to input them.

None of those ways above are fast or convenient, and will just break the coherent writing experience.

And even if inputting them are not problems for any Chinese/Japanese users, what about users of other languages that neither use spaces to separate words nor has fullwidth characters, like Thai…?


Introducing new symbols also causes inconsistence and confusion. Every other markup are ASCII-based, people will always expect halfwidth *s and _s to work, and will never expect that they must use fullwidth *s and _ to markup Chinese sentences specially.

And also, I have never seen people use fullwidth *s and _ in Chinese communities. Maybe it is because those two characters are harder to input. Anyway, I don’t think many Chinese users are familiar to input them, or at least use them on a daily basis.


By the way, could someone give an exmaple on why are nested emphasises needed. I don’t quite understand the use case for that “delimiter run” thing.

1 Like

Quick idea:
How about just make *s and _s that are before and after CJKV Characters to be left-flanking delimiter run and right-flanking delimiter run.