Single asterisks in subsequent words should not lead to emphasis

Exactly.
In English I’d say: intraword emphasis can only be opened and closed in the same word.
In the language of the spec, I think it might be something along the lines of:

A single * can close emphasis started by a * that is both right and left flanked iff that delimiter run was itself opened by a * that was both right and left flanked.

Or maybe along the lines of point 9

Emphasis begins with a delimiter that can open emphasis and ends with a delimiter that can close emphasis, and that uses the same character (_ or *) as the opening delimiter. There must be a nonempty sequence of inlines between the open delimiter and the closing delimiter; these form the contents of the emphasis inline.

added: if emphasis was opened by a right and left flanked delimiter, no non-word characters (whitespace, punctuation) can occur in between.

The difficulty of phrasing this in the language of the spec may imply that it’s technically difficult (or that I suck at writing spec :slight_smile: )

A single * can close emphasis started by a * that is both right and left flanked iff that delimiter run was itself opened by a * that was both right and left flanked.
added: if emphasis was opened by a right and left flanked delimiter, no non-word characters (whitespace, punctuation) can occur in between.

One fact that seems to be constantly overlooked by markup language specs (see: reStructuredText) is that not every language uses non-word characters to delimit words. Namely, CJK languages do not have word delimiters at all. Intra-word emphasis is essential to them.

It is however not uncommon to have Indo-Europian words in an article written in CJK. An example:

Goose house是由日本一群創作歌手所組成的樂團。

Now if I’m to emphasise Goose house in this sentence (as in the original page), I would need to turn on intra-word emphasis, and write

**Goose house**是由日本一群創作歌手所組成的樂團。

But with the proposed logic, this simply will not work.

In short:

  1. Please do not do this.
  2. When trying to improve usability in your own language, please also try to be considerate to other languages. Not everyone in the world write like you, and we want to use Markdown, too.

Thank you.

2 Likes

+++ Tzu-ping Chung [Feb 02 15 00:16 ]:

 Now if I’m to emphasise Goose house in this sentence (as in the
 original page), I would need to turn on intra-word emphasis, and write

Goose house是由日本一群創作歌手所組成的樂團。

 But with the proposed logic, this simply will not work.

 In short:
  1. Please do not do this.
  2. When trying to improve usability in your own language, please also
     try to be considerate to other languages. Not everyone in the world
     write like you, and we want to use Markdown, too.

Excellent point. I think this is a compelling argument against the proposed change.

When trying to improve usability in your own language, please also try to be considerate to other languages.

I was trying to, in German, where this behaviour is annoying, while this may have not occurred to English users who predominate here (and don’t have as many gendered nouns). It’s difficult to consider edge cases in languages you don’t know, but that’s why there’s a proposal.

It seems to me that
**Goose** **house**是由日本一群創作歌手所組成的樂團。
would also do the job for you.

So I don’t find this argument compelling. In your case it looks ok, because CJK characters look very different from the Latin alphabet, so there’s a visual demarkation of sorts. But in the Latin alphabet, this Liebe Professor*innen und Mitarbeiter*innen, is more readable to a parser than to a human if it’s intended for emphasis. A human would write something like this Liebe Professor*innen* *und* *Mitarbeiter*innen,.
It’s an edge case in a language written by millions against an edge case in languages written by more than a billion.

I don’t know if there’s been a big discussion, but if CJK scripts handle whitespace very differently (see also the discussion on linebreaks), maybe it makes sense to have a different set of defaults.

+++ Ruben [Feb 02 15 08:41 ]:

I was trying to, in German, where this behaviour is annoying, while this may have not occurred to English users who predominate here (and don’t have as many gendered nouns). It’s difficult to consider edge cases in languages you don’t know, but that’s why there’s a proposal.

And it’s great to hear about things like this; such comments are very helpful.

It seems to me that
**Goose** **house**是由日本一群創作歌手所組成的樂團。
would also do the job for you.

So I don’t find this argument compelling. In your case it looks ok, because CJK characters look very different from the Latin alphabet, so there’s a visual demarkation of sorts. But in the Latin alphabet, this Liebe Professor*innen und Mitarbeiter*innen, is more readable to a parser than to a human if it’s intended for emphasis. A human would write something like this Liebe Professor*innen* *und* *Mitarbeiter*innen,.
It’s an edge case in a language written by millions against an edge case in languages written by more than a billion.

Hm, weren’t you the one advocating resolving issues by voting?

But seriously, I don’t think it’s a matter of numbers. Just like with the line breaks, we have a situation where making something slightly easier for people who write in language or style X would make things impossible for people who write in language or style Y. It’s not very nice to make it impossible for Chinese writers to emphasize foreign phrases in order to make it a bit easier for German writers to write gender-ambiguous nouns.

I don’t know if there’s been a big discussion, but if CJK scripts handle whitespace very differently (see also the discussion on linebreaks), maybe it makes sense to have a different set of defaults.

Currently there is no localization in the spec. One might consider adding it, so that the syntax rules for Chinese locales were different than, say, German.

But I’d really like to avoid this kind of complexity. I think it also has a high potential to confuse people. And what if you have a mixed text, with passages in Chinese and passages in German? Doing this right would require adding “start-Chinese” and “end-Chinese” commands, and now it’s looking like LaTeX.

A related area that probably needs more thought here is languages with right-to-left writing: Explicit RTL indication in pure Markdown

1 Like

+++ John MacFarlane [Feb 02 15 23:22 ]:

It seems to me that
**Goose** **house**是由日本一群創作歌手所組成的樂團。
would also do the job for you.

Sorry, @Ruben, in my reply to your message I somehow missed this argument completely. @uranusjr, what do you think of this as a solution?

weren’t you the one advocating resolving issues by voting

Believe me, I noticed the irony :smile:

I can understand that MD can’t be all things to all users and that it’s better to allow both usages where necessary. In this case, I thought since it’s edge cases in both languages (but can definitely lead to confusion every once in a while), it might be agreeable to make it more explicit.

Would it maybe also make sense to treat CJK characters more like words? If that is technically possible at all. IANALinguist but from some of the discussions here, it seems like this would be closer to the way there’re used.

I’d hope this wouldn’t be the solution. It does the job, but feels clumsy and unintuitive. I can’t speak for others, but I would guess many eastern Asians feel the same. Also the rendered result (<strong>Goose</strong> <strong>house</strong>) would semantically be different from <strong>Goose house</strong>, wouldn’t it?

I’m thinking, though, since preferences for intra-word asterisks are kind of language-dependent, maybe it would be possible to detect characters around them to determine how it should be rendered? Something along the line of if emphasis was opened by a delimiter flanked by non-CJK-word characters at both sides, no non-word characters can occur in between. I’m not sure what languages would prefer which rule, but that’s the basic idea.

Another (probably much simpler) solution is to borrow invisible word delimiters from other markup languages. reStructuredText (sorry for the constant reference to it; I’m a Python programmer, and it is my primary markup language) does not support intra-word emphases at all, but treats \ (backslash and a space) as a delimiter that renders to nothing. So this

中文\ **foo bar**\ 更多

would become

<p>中文<strong>foo bar</strong>更多</p>

which is still a bit clumsy but acceptable. I’d suspect the change will be well-received in any case though.

1 Like

There’s certainly much to be said for the reST solution, which allows you to have a simple rule: emphasis must begin with a delimiter string that is left flanking but not right flanking, and end with one that is right flanking but not left flanking. Exceptions, which are not common in languages that use spaces to separate words, could be handled with the invisible word delimiter. CJK languages would have to use these for nearly all emphasis, though, which might be unnatural.

The main disadvantages are (a) you have to learn a fairly technical device to do intraword emphasis, (b) this looks unnatural on the page, © it’s not part of original Markdown and further breaks backward compatibility. A minor disadvantage, perhaps, is that you lose the ability to use a backslash-escaped space for other purposes (pandoc uses it as a nonbreaking space).

Making behavior sensitive to whether the characters around the emphasis are CJK is tempting, but adds a lot of complexity, and I’m not sure it would always be what you want. (Is it universally true that spaces are not used as word separators for CJK?) One way it could be implemented would be by treating CJK characters as between non-CJK word characters and punctuation characters, when it comes to determining flankingness. So, in

中文*foo bar*更多

the first * would be left flanking and the second right flanking, so the special rule governing intraword emphasis wouldn’t kick in, whereas in

Mitarbeiter*innen und Professor*innen

it would, since both *s are both left- and right-flanking. But again, this is making a complex rule even more complex.

1 Like

Making behavior sensitive to whether the characters around the emphasis are CJK is tempting, but adds a lot of complexity, and I’m not sure it would always be what you want. (Is it universally true that spaces are not used as word separators for CJK?) One way it could be implemented would be by treating CJK characters as between non-CJK word characters and punctuation characters, when it comes to determining flankingness.

AFAIK spaces are quite commonly used as word separators in contemporary Korean, but extremely rare in Japanese and Chinese. And since it is also rare to have asterisks in CJK documents (more likely to see full-width asterisks ), it should not cause much problem (IANALinguist either). Not sure about other languages. I think treating CJK characters something like “semi-punctuation flakers” would work too, but the main problem here will always be exactly what characters should be put into this category.

1 Like

CommonMark generates a fairly good result for intra words in Chinese so far. :beers:

:heavy_check_mark: 中文**foo bar**更多 => <p> Chinese 中文<strong><code>*asterisks*</code></strong></p>

However, I found a edge case:

:negative_squared_cross_mark: Chinese 中文**asterisks** => <p>Chinese 中文<strong>asterisks</strong>更多</p>

http://spec.commonmark.org/dingus/?text=Chinese%20中文**`asterisks%60**%0A%0Aenglish%20**%60asterisks%60**

I considered it as a kind of bug since it’s inconsistent with surrounding by English.

EDIT:

It’s not edge at all. It prevents links, either.
https://markdown-it.github.io/#md3={“source”%3A"CommonMark%20建立了[简要的帮助](http%3A%2F%2Fcommonmark.cn%2Fhelp%2F)和**[This%20should%20be%20strong交互式教程](http%3A%2F%2Fcommonmark.cn%2Fhelp%2Ftutorial%2F)**\n"%2C"defaults"%3A{“html”%3Afalse%2C"xhtmlOut"%3Afalse%2C"breaks"%3Afalse%2C"langPrefix"%3A"language-"%2C"linkify"%3Atrue%2C"typographer"%3Atrue%2C"_highlight"%3Atrue%2C"_strict"%3Afalse%2C"_view"%3A"html"}}

This has nothing to do with Chinese. You can use an English example:

english**`*asterisks*`**

Here we don’t get strong emphasis. Why not? Because the first ** is not “left-flanking,” as defined by the spec. (It is preceded by a letter and followed by punctuation, so it’s right-flanking and not left-flanking.)

It’s the same reason

*(*foo*)

gets parsed as

*(<em>foo</em>)

not as

<em>(</em>foo*)

Changing this case would require major revisions in the spec for emphasis, which would probably have bad consequences elsewhere, so I’m not sure what to do.

2 Likes

Thanks for the details. It takes time for me to read & understand part of spec and connect them with my own language.

As a continuation of your example:

english**`*asterisks*`**english

That first ** breaks emphasis because of it’s left-flanking.

english **`*asterisks*`**

This works because it’s left-flanking and right-flanking (new line)

Language nature

As mentioned earlier, some languages don’t use blank as a delimiter (language sense. Words segmentation is a series of NLP problem in CS field).

It’s nature for the users to type punctuation character (ascii, used for markdown) and characters in a sequence without the blank.

Delimiter run

For left-franking:

(b) either not followed by a punctuation character, or preceded by Unicode whitespace or a punctuation character.

I would suggest it can be preceded by not only Unicode whitespace but also some sets of Unicode code point. CJK in this case.

For right-franking:

or followed by Unicode whitespace or a punctuation character.

Same principle, it can be opened for some sets of Unicode code point.

Hope I use the right words to make sense :worried:

Unicode code point blocks

These characters are described as CJK Unified Ideographs.

Unfortunately, apart from six blocks of unified blocks, some not unified CJK characters:

Apart from the six blocks of “Unified Ideographs”, Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different.

Random thoughts

In this way, it doesn’t change much to the specs. But it looks like a little bit hard for JS engines to distinguish code points?

I may be horribly wrong. But that’s some of my solely thoughts.

中文\ foo bar\ 更多

Another possibility could be using fancy Unicode characters, like zero width non-joiner. The result will be readable for both humans and common mark parser.

I agree that the current rules make things difficult in languages that don’t use spaces. There are also occasional difficulties like this:

*aaa**bbb**ccc*

which CommonMark parses as emph “aaa”, emph “bbb”, emph “ccc” rather than emph (“aaa”, strong “bbb”, “ccc”). Having some way to force a delimiter to be left- or right- flanking would help here. I believe @tinpot suggested allowing <!> to be used this way:

*aaa<!>**bbb**<!>ccc*
1 Like

I would like to reopen this discussion as this behavior is annoying.

Often in chat (like Discord) or forums I end up posting something like:

The formula for getting the perimeter of a circle is 2piradius.

Or (this also occurs with double asterisks that represent the “power” operator in languages like Python) :

A googleplex is 1010100.

Or even:

The player score calculation is now killspower - deaths. It was changed from 1.3, where it was killsassists - deaths.

As you can see, the asterisks break the meaning of the message, with a part of the sentence being italicized. If the reader isn’t familiar with markdown quirks, then he will very likely won’t get what is “killspower”, and why I seemingly randomly put a part of the message in italics. I then have to go back and make sure to backslash every instance of “*”, which is annoying with long posts.

The solution is to require a word boundary on the left of the left asterisk, and on the right of the right asterisk.

For the CJK problem that has been mentioned earlier in this thread, it indeed wouldn’t italicize the “由” character in 是*由*日. I do not speak Chinese, but it seems from http://hua.umf.maine.edu/Chinese/topics/math/douying.html that Chinese do not use the “*” character for multiplication, and as such do not have the confusion that English and other latin languages have.

The solution would then be to require that the left asterisk does not have a latin character to its left, and the right asterisk does not have a latin character to its right.

This makes it so:

  • A*B*C is displayed as A*B*C
  • 是*由*日 is displayed as 是

As I understand it, Markdown is just a way to properly render the pseudo-syntax that was used in emails. But in emails, if a human sees A*B*C, he would parse it as a multiplication and not italics. Markdown parsers should therefore not parse that as italics, in order to be more intuitive.

Thanks :slight_smile:

1 Like

The asterisks and intra-word emphasis issue is definitely frustrating. That said, I would posit that

A×B×C

is better formatting in our unicode-everywhere (or even UTF-8 everywhere) world?

Sadly, many keyboards do not offer the option to enter the “×” character (unless you learn by heart the alt-code). The asterisk is going to be used in the huge majority of cases.

FWIW, many phones have × available under a long-press * and on Windows 10 you find it easily in the pop-up that appears when you hit Windows key plus dot ., which is primarily used for emoji input.

Indeed. However it is hidden away and users do not use it; even though most people are on phones, I never saw anyone use the “×” character.