A left-flanking delimiter run is a delimiter run that is
(1) not followed by Unicode whitespace,
and either
(2a) not followed by a punctuation character,
or
(2b) followed by a punctuation character and preceded by Unicode whitespace or a punctuation character. (…)
A right-flanking delimiter run is a delimiter run that is
(1) not preceded by Unicode whitespace,
and either
(2a) not preceded by a punctuation character,
or
(2b) preceded by a punctuation character and followed by Unicode whitespace or a punctuation character. (…)
A punctuation character is
an ASCII punctuation character or
anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.
Unicode PunctuationP character property (major category)
Value
Category minor
Count (13.0)
Remarks
Pc
connector
10
Includes “_” underscore
Pd
dash
25
Includes several hyphen characters
Ps
open
75
Opening bracket characters
Pe
close
73
Closing bracket characters
Pi
initial quote
12
Opening quotation mark. Does not include the ASCII “neutral” quotation mark. May behave like Ps or Pe depending on usage
Pf
final quote
10
Closing quotation mark. May behave like Ps or Pe depending on usage
Po
other
593
Ideographic Full Stop 。 and Comma 、, U+3002 and U+3001, are both Poother.
The flanking behavior could respect the Unicode minor category more, but I donʼt think this would help in this particular case.
If we want a proper solution, one that remains true to the spirit of Markdown, we have to return its basic principle:
The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.
This implies that, from the reader’s perspective, styling of the plain text functions analogously to the corresponding styling of the rich text. Thus while a bold font weight is a good way to highlight words in rich text, wrapping text segments with ** has a similar effect in plain text:
I met **bold** error, may caused by *the usage of
Chinese punctuation is different from English.* Here
are some examples:
But this only works well for for languages like English where words are white-space delimited. It does not work for languages like Chinese for two reasons:
Words or phrases so marked do no stand out in the plain text:
We cannot rely on space characters to help determine whether the markup is starting or ending a highlighted segment.
We can address #1 by choosing alternatives to the asterisk *. Unlike English, for CJK it is a disadvantage not an advantage to limit markup symbols to ASCII. While I would defer to native CJK readers on the actual symbol selection (What do CJK writers naturally use in text messages today?), here is the same using alternate emphasis and strong markup symbols:
The new markup symbols would be plain text alternatives to * and _. They would still be rendered in bold or italics or whatever strong and emphasis are mapped to in the output format (e.g. HTML).
Having separate by visually symmetrical open and close markup symbols eliminates ambiguities that even exist in space delimited languages. Such alternate markup symbols would have universal applicability.
Symbols could be selected that gracefully degrade under legacy Markdown rendering – for example CJK readers might feel that the above alternate symbols appearing in the rendered rich text still serves the same purpose as, even if not as ideal as, italic/bold rendering.
@keaton, I think you misunderstand my proposal. I’m not saying that new symbols be used instead of bold. I’m saying provide alternate markup symbols optimized for CJK plain text that will also render as bold in the output. Thus:
plain text (Markdown)
rich text (e.g. rendered HTML)
English
This **word** is highlighted.
This word is highlighted.
Chinese
该┏单词┓被突出显示。
该单词被突出显示。
English
This ┏word┓ is highlighted.
This word is highlighted.
It is important to understand that * was a choice optimized for languages like English. If Markdown was first invented in China, for example, some other symbol would most definitely have been chosen.
Without augmenting the grammar I don’t think there is a solution to your problem. As I said above Markdown’s *-based scheme works for white-space delimited languages. See reason number 2 above. See the spec for Emphasis and strong emphasis, particularly the part about left flanking and right flanking and the rules for opening and closing. For example, the following can be supported precisely because of the whitespace between words:
***strong** in emph*
*emph *with emph* in it*
**strong **with strong** in it**
and the following can be handled without ambiguity:
*foo *bar**
*foo **bar** baz*
I’m afraid making changes to the semantics of * to handle your original example above would break Markdown for space-delimited languages. Perhaps there is a brilliant solution that no one thought of over all these years, but I think it unlikely.
(I guess we could come up with an alternate rule set that is invoked only in the context of CJK text? If we could do that, what would those rules be?)
If I am correct, we either have to accept the limitations for CJK (and use workarounds like @haqer1suggested) or we have to augment Markdown. If the latter, I’d start with my question: If Markdown were invented in China, what symbol or combination of symbols would have been chosen?
In the international world, there is no need to change the current markdown syntax because of CJK. For example, in an article mixed in Chinese and English, the standard markdown syntax must be used. So, we have to find another way. What @haqer1 said is feasible.
I don’t know how blackfriday works. Using blackfriday will not cause this problem. There might be a solution, but I can’t understand the code.
Well, there is the legacy reduplication of printable ASCII characters compatible with CJK sinograms in Unicode block Halfwidth and Fullwidth Forms, which, of course, includes U+FF0A * and U+FF3F _.
I am a macOS user who uses system’s default Chinese input method for a long time, and I still don’t know any simple ways to input those fullwidth *s and _s.
The only three ways to input them I knew are:
search "fullwidth asterisk/underline " on web and copy/paste;
add “Unicode Hex Input” to my input method, and use magic numbers to input them;
use Japanese input method to input them.
None of those ways above are fast or convenient, and will just break the coherent writing experience.
And even if inputting them are not problems for any Chinese/Japanese users, what about users of other languages that neither use spaces to separate words nor has fullwidth characters, like Thai…?
Introducing new symbols also causes inconsistence and confusion. Every other markup are ASCII-based, people will always expect halfwidth *s and _s to work, and will never expect that they must use fullwidth *s and _ to markup Chinese sentences specially.
And also, I have never seen people use fullwidth *s and _ in Chinese communities. Maybe it is because those two characters are harder to input. Anyway, I don’t think many Chinese users are familiar to input them, or at least use them on a daily basis.
By the way, could someone give an exmaple on why are nested emphasises needed. I don’t quite understand the use case for that “delimiter run” thing.
It is quite surprising that zero-width space (ZWSP) is not a whitespace character according to the spec.
Therefore, inserting ZWSP as a HTML entity (**super-**​man) works, but directly inserting ZWSP as a unicode character (**super-**man; here, although invisible, ZWSP is inserted between * and m.) does not.
If we augment Markdown as @vas suggested, Perhaps the combination of ascii characters like ~*, *~, ~**, **~, etc. are better when it comes to typeability.