Single asterisks in subsequent words should not lead to emphasis

I’ve seen a lot of discussion about intra-word emphasis in Markdown. It seems the spec has come down in favour of it when using asterisks.

I haven’t seen these discussions touch subsequent words with a single asterisk/underscore in each. I’m having trouble to find a use of Markdown where I would want to italicise the last part of one word, all words in-between and the beginning of the first word and I haven’t found a documented instance in the spec either.
Yet, this is handled just like intra-word emphasis by all implementations.

A case where it conflicts with the way my German users use it is when gendering nouns. This has become very common in university contexts. Addressing both female and male professors and co-workers in an email, you’d write something like Liebe Professor*innen und Mitarbeiter*innen, (though the same with underscores and capitalising the i are also common. When using asterisks, this ends up italicising the innen und Mitarbeiter part in CommonMark.

I made an example in Babelmark2

There are cases where you need intraword emphasis, so it seemed to us that we should not outlaw it altogether. In this case you can backslash-escape the asterisks, though I realize this is awkward.

Oh, but intra-word emphasis is not what I’m talking about (I think the solution of allowing asterisks but not underscores is okay).

I’m talking about one asterisk in one word and another asterisk in a subsequent word. I struggle to think of a case where this should lead to italics.
I also don’t think it would break much, because I’ve never seen people use this.

I think I missed what you were getting at when I replied earlier. I think what you’re proposing is that emphasis beginning with an asterisk delimiter inside a word can only be closed by an asterisk delimiter inside the same word. I haven’t looked into the technical difficulties, but this seems to me a promising suggestion.

1 Like

Exactly.
In English I’d say: intraword emphasis can only be opened and closed in the same word.
In the language of the spec, I think it might be something along the lines of:

A single * can close emphasis started by a * that is both right and left flanked iff that delimiter run was itself opened by a * that was both right and left flanked.

Or maybe along the lines of point 9

Emphasis begins with a delimiter that can open emphasis and ends with a delimiter that can close emphasis, and that uses the same character (_ or *) as the opening delimiter. There must be a nonempty sequence of inlines between the open delimiter and the closing delimiter; these form the contents of the emphasis inline.

added: if emphasis was opened by a right and left flanked delimiter, no non-word characters (whitespace, punctuation) can occur in between.

The difficulty of phrasing this in the language of the spec may imply that it’s technically difficult (or that I suck at writing spec :slight_smile: )

A single * can close emphasis started by a * that is both right and left flanked iff that delimiter run was itself opened by a * that was both right and left flanked.
added: if emphasis was opened by a right and left flanked delimiter, no non-word characters (whitespace, punctuation) can occur in between.

One fact that seems to be constantly overlooked by markup language specs (see: reStructuredText) is that not every language uses non-word characters to delimit words. Namely, CJK languages do not have word delimiters at all. Intra-word emphasis is essential to them.

It is however not uncommon to have Indo-Europian words in an article written in CJK. An example:

Goose house是由日本一群創作歌手所組成的樂團。

Now if I’m to emphasise Goose house in this sentence (as in the original page), I would need to turn on intra-word emphasis, and write

**Goose house**是由日本一群創作歌手所組成的樂團。

But with the proposed logic, this simply will not work.

In short:

  1. Please do not do this.
  2. When trying to improve usability in your own language, please also try to be considerate to other languages. Not everyone in the world write like you, and we want to use Markdown, too.

Thank you.

2 Likes

+++ Tzu-ping Chung [Feb 02 15 00:16 ]:

 Now if I’m to emphasise Goose house in this sentence (as in the
 original page), I would need to turn on intra-word emphasis, and write

Goose house是由日本一群創作歌手所組成的樂團。

 But with the proposed logic, this simply will not work.

 In short:
  1. Please do not do this.
  2. When trying to improve usability in your own language, please also
     try to be considerate to other languages. Not everyone in the world
     write like you, and we want to use Markdown, too.

Excellent point. I think this is a compelling argument against the proposed change.

When trying to improve usability in your own language, please also try to be considerate to other languages.

I was trying to, in German, where this behaviour is annoying, while this may have not occurred to English users who predominate here (and don’t have as many gendered nouns). It’s difficult to consider edge cases in languages you don’t know, but that’s why there’s a proposal.

It seems to me that
**Goose** **house**是由日本一群創作歌手所組成的樂團。
would also do the job for you.

So I don’t find this argument compelling. In your case it looks ok, because CJK characters look very different from the Latin alphabet, so there’s a visual demarkation of sorts. But in the Latin alphabet, this Liebe Professor*innen und Mitarbeiter*innen, is more readable to a parser than to a human if it’s intended for emphasis. A human would write something like this Liebe Professor*innen* *und* *Mitarbeiter*innen,.
It’s an edge case in a language written by millions against an edge case in languages written by more than a billion.

I don’t know if there’s been a big discussion, but if CJK scripts handle whitespace very differently (see also the discussion on linebreaks), maybe it makes sense to have a different set of defaults.

+++ Ruben [Feb 02 15 08:41 ]:

I was trying to, in German, where this behaviour is annoying, while this may have not occurred to English users who predominate here (and don’t have as many gendered nouns). It’s difficult to consider edge cases in languages you don’t know, but that’s why there’s a proposal.

And it’s great to hear about things like this; such comments are very helpful.

It seems to me that
**Goose** **house**是由日本一群創作歌手所組成的樂團。
would also do the job for you.

So I don’t find this argument compelling. In your case it looks ok, because CJK characters look very different from the Latin alphabet, so there’s a visual demarkation of sorts. But in the Latin alphabet, this Liebe Professor*innen und Mitarbeiter*innen, is more readable to a parser than to a human if it’s intended for emphasis. A human would write something like this Liebe Professor*innen* *und* *Mitarbeiter*innen,.
It’s an edge case in a language written by millions against an edge case in languages written by more than a billion.

Hm, weren’t you the one advocating resolving issues by voting?

But seriously, I don’t think it’s a matter of numbers. Just like with the line breaks, we have a situation where making something slightly easier for people who write in language or style X would make things impossible for people who write in language or style Y. It’s not very nice to make it impossible for Chinese writers to emphasize foreign phrases in order to make it a bit easier for German writers to write gender-ambiguous nouns.

I don’t know if there’s been a big discussion, but if CJK scripts handle whitespace very differently (see also the discussion on linebreaks), maybe it makes sense to have a different set of defaults.

Currently there is no localization in the spec. One might consider adding it, so that the syntax rules for Chinese locales were different than, say, German.

But I’d really like to avoid this kind of complexity. I think it also has a high potential to confuse people. And what if you have a mixed text, with passages in Chinese and passages in German? Doing this right would require adding “start-Chinese” and “end-Chinese” commands, and now it’s looking like LaTeX.

A related area that probably needs more thought here is languages with right-to-left writing: Explicit RTL indication in pure Markdown

1 Like

+++ John MacFarlane [Feb 02 15 23:22 ]:

It seems to me that
**Goose** **house**是由日本一群創作歌手所組成的樂團。
would also do the job for you.

Sorry, @Ruben, in my reply to your message I somehow missed this argument completely. @uranusjr, what do you think of this as a solution?

weren’t you the one advocating resolving issues by voting

Believe me, I noticed the irony :smile:

I can understand that MD can’t be all things to all users and that it’s better to allow both usages where necessary. In this case, I thought since it’s edge cases in both languages (but can definitely lead to confusion every once in a while), it might be agreeable to make it more explicit.

Would it maybe also make sense to treat CJK characters more like words? If that is technically possible at all. IANALinguist but from some of the discussions here, it seems like this would be closer to the way there’re used.

I’d hope this wouldn’t be the solution. It does the job, but feels clumsy and unintuitive. I can’t speak for others, but I would guess many eastern Asians feel the same. Also the rendered result (<strong>Goose</strong> <strong>house</strong>) would semantically be different from <strong>Goose house</strong>, wouldn’t it?

I’m thinking, though, since preferences for intra-word asterisks are kind of language-dependent, maybe it would be possible to detect characters around them to determine how it should be rendered? Something along the line of if emphasis was opened by a delimiter flanked by non-CJK-word characters at both sides, no non-word characters can occur in between. I’m not sure what languages would prefer which rule, but that’s the basic idea.

Another (probably much simpler) solution is to borrow invisible word delimiters from other markup languages. reStructuredText (sorry for the constant reference to it; I’m a Python programmer, and it is my primary markup language) does not support intra-word emphases at all, but treats \ (backslash and a space) as a delimiter that renders to nothing. So this

中文\ **foo bar**\ 更多

would become

<p>中文<strong>foo bar</strong>更多</p>

which is still a bit clumsy but acceptable. I’d suspect the change will be well-received in any case though.

1 Like

There’s certainly much to be said for the reST solution, which allows you to have a simple rule: emphasis must begin with a delimiter string that is left flanking but not right flanking, and end with one that is right flanking but not left flanking. Exceptions, which are not common in languages that use spaces to separate words, could be handled with the invisible word delimiter. CJK languages would have to use these for nearly all emphasis, though, which might be unnatural.

The main disadvantages are (a) you have to learn a fairly technical device to do intraword emphasis, (b) this looks unnatural on the page, © it’s not part of original Markdown and further breaks backward compatibility. A minor disadvantage, perhaps, is that you lose the ability to use a backslash-escaped space for other purposes (pandoc uses it as a nonbreaking space).

Making behavior sensitive to whether the characters around the emphasis are CJK is tempting, but adds a lot of complexity, and I’m not sure it would always be what you want. (Is it universally true that spaces are not used as word separators for CJK?) One way it could be implemented would be by treating CJK characters as between non-CJK word characters and punctuation characters, when it comes to determining flankingness. So, in

中文*foo bar*更多

the first * would be left flanking and the second right flanking, so the special rule governing intraword emphasis wouldn’t kick in, whereas in

Mitarbeiter*innen und Professor*innen

it would, since both *s are both left- and right-flanking. But again, this is making a complex rule even more complex.

1 Like

Making behavior sensitive to whether the characters around the emphasis are CJK is tempting, but adds a lot of complexity, and I’m not sure it would always be what you want. (Is it universally true that spaces are not used as word separators for CJK?) One way it could be implemented would be by treating CJK characters as between non-CJK word characters and punctuation characters, when it comes to determining flankingness.

AFAIK spaces are quite commonly used as word separators in contemporary Korean, but extremely rare in Japanese and Chinese. And since it is also rare to have asterisks in CJK documents (more likely to see full-width asterisks ), it should not cause much problem (IANALinguist either). Not sure about other languages. I think treating CJK characters something like “semi-punctuation flakers” would work too, but the main problem here will always be exactly what characters should be put into this category.

1 Like

CommonMark generates a fairly good result for intra words in Chinese so far. :beers:

:heavy_check_mark: 中文**foo bar**更多 => <p> Chinese 中文<strong><code>*asterisks*</code></strong></p>

However, I found a edge case:

:negative_squared_cross_mark: Chinese 中文**asterisks** => <p>Chinese 中文<strong>asterisks</strong>更多</p>

http://spec.commonmark.org/dingus/?text=Chinese%20中文**`asterisks%60**%0A%0Aenglish%20**%60asterisks%60**

I considered it as a kind of bug since it’s inconsistent with surrounding by English.

EDIT:

It’s not edge at all. It prevents links, either.
https://markdown-it.github.io/#md3={“source”%3A"CommonMark%20建立了[简要的帮助](http%3A%2F%2Fcommonmark.cn%2Fhelp%2F)和**[This%20should%20be%20strong交互式教程](http%3A%2F%2Fcommonmark.cn%2Fhelp%2Ftutorial%2F)**\n"%2C"defaults"%3A{“html”%3Afalse%2C"xhtmlOut"%3Afalse%2C"breaks"%3Afalse%2C"langPrefix"%3A"language-"%2C"linkify"%3Atrue%2C"typographer"%3Atrue%2C"_highlight"%3Atrue%2C"_strict"%3Afalse%2C"_view"%3A"html"}}

This has nothing to do with Chinese. You can use an English example:

english**`*asterisks*`**

Here we don’t get strong emphasis. Why not? Because the first ** is not “left-flanking,” as defined by the spec. (It is preceded by a letter and followed by punctuation, so it’s right-flanking and not left-flanking.)

It’s the same reason

*(*foo*)

gets parsed as

*(<em>foo</em>)

not as

<em>(</em>foo*)

Changing this case would require major revisions in the spec for emphasis, which would probably have bad consequences elsewhere, so I’m not sure what to do.

2 Likes

Thanks for the details. It takes time for me to read & understand part of spec and connect them with my own language.

As a continuation of your example:

english**`*asterisks*`**english

That first ** breaks emphasis because of it’s left-flanking.

english **`*asterisks*`**

This works because it’s left-flanking and right-flanking (new line)

Language nature

As mentioned earlier, some languages don’t use blank as a delimiter (language sense. Words segmentation is a series of NLP problem in CS field).

It’s nature for the users to type punctuation character (ascii, used for markdown) and characters in a sequence without the blank.

Delimiter run

For left-franking:

(b) either not followed by a punctuation character, or preceded by Unicode whitespace or a punctuation character.

I would suggest it can be preceded by not only Unicode whitespace but also some sets of Unicode code point. CJK in this case.

For right-franking:

or followed by Unicode whitespace or a punctuation character.

Same principle, it can be opened for some sets of Unicode code point.

Hope I use the right words to make sense :worried:

Unicode code point blocks

These characters are described as CJK Unified Ideographs.

Unfortunately, apart from six blocks of unified blocks, some not unified CJK characters:

Apart from the six blocks of “Unified Ideographs”, Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different.

Random thoughts

In this way, it doesn’t change much to the specs. But it looks like a little bit hard for JS engines to distinguish code points?

I may be horribly wrong. But that’s some of my solely thoughts.

中文\ foo bar\ 更多

Another possibility could be using fancy Unicode characters, like zero width non-joiner. The result will be readable for both humans and common mark parser.

I agree that the current rules make things difficult in languages that don’t use spaces. There are also occasional difficulties like this:

*aaa**bbb**ccc*

which CommonMark parses as emph “aaa”, emph “bbb”, emph “ccc” rather than emph (“aaa”, strong “bbb”, “ccc”). Having some way to force a delimiter to be left- or right- flanking would help here. I believe @tinpot suggested allowing <!> to be used this way:

*aaa<!>**bbb**<!>ccc*
1 Like