`测试foo可以` works but `测试foo()可以` fails

fantasticfears · July 18, 2017, 9:29am

Since Discourse moves to markdown-it. Quite many cased for CJK is working normally. But I found a corner case though I believe it’s in the spec somehow. I am full of surprised when I meet this case.

测试**foo()**可以

测试foo可以

测试foo()可以
测试foo可以

I would rather expect:

测试fun()可以

测试fun可以

测试fun()可以
测试fun可以

The test code is:

测试**fun()**可以

测试**fun**可以

测试**fun()**可以
测试**fun**可以

The behaviour is tested on
https://johnmacfarlane.net/babelmark2/?text=%3E+%E6%B5%8B%E8%AF%95foo()%E5%8F%AF%E4%BB%A5%0A%3E+%0A%3E+%E6%B5%8B%E8%AF%95foo%E5%8F%AF%E4%BB%A5%0A%3E+%0A%3E+%E6%B5%8B%E8%AF%95foo()%E5%8F%AF%E4%BB%A5%0A%3E+%E6%B5%8B%E8%AF%95foo%E5%8F%AF%E4%BB%A5%0A

mity · July 18, 2017, 12:43pm

Specs says that

A right-flanking delimiter run is a delimiter run that is (a) not preceded by Unicode whitespace, and (b) either not preceded by a punctuation character, or followed by Unicode whitespace or a punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

and a bit later

A double ** can close strong emphasis iff it is part of a right-flanking delimiter run.

Note that when non-whitespace and non-punctuation characters is both before the ** as well as after it, then it does not form left-flanking and right-flanking delimiter runs and it cannot be used to begin/end the strong emphasis span.

So as I can see it, the example follows the specification correctly.

fantasticfears · July 18, 2017, 5:53pm

Thanks for the explanation. I am looking forward to some insights over article (b).

As of Chinese, Japanese and Korean language which doesn’t use whitespace as a script delimiter. My example is expected to be natural to those cases.

It’s not true. I think it’s working for Chinese. These ** are actually treated as left-flanking delietemter.

测试可能这样

> 测试**可能**这样

vsch · July 18, 2017, 11:19pm

This behaviour is caused by opening and closing brackets being treated as punctuation characters for both left flanking and right flanking delimiter logic.

This is actually something that may need to be addressed in the spec by changing the definition of punctuation characters for left and right flanking delimiters to only include appropriate brackets to make left/right flanking logic take the bracket type into account.

Currently punctuation includes (){}<>[]. However, for left flanking delimiter condition of before is punctuation should not include the opening brackets (<}]. Similarly for right flanking delimiter after is punctuation should not include the closing brackets )>}].

For example (** cannot be right flanking but can be left flanking. Similarly, **) cannot be left flanking but can be right flanking. This makes sense since it is assumed that open brackets delimit start of inner text and closing brackets do the same for end of text.

This way aa(**foo**)aa and aa(**foo)**aa and aa**foo()**aa will recognize the strong emphasis but aa**)foo(**aa will not.

This seems more intuitive than current spec where aa**foo()**aa is not recognized as delimited but aa**foo**aa is recognized as delimited.

BTW, I tested this delimiter logic in my implementation and the original spec 0.27 tests pass, So it does not break existing cases but will need a couple of new test cases added to catch the added nuances.

fantasticfears · July 24, 2017, 7:59am

This is stunning good news to me as well as millions CJK users. I would like to provide assistance if any needed.

lifthrasiir · July 30, 2017, 5:08pm

If my understanding is correct, this also causes that a certain construct is never satisfactorily written in CommonMark:

Inter**[web](http://example.com/)**

I guess there is no CommonMark input besides from the raw HTML input that would give the intended result. This would be especially problematic if the CommonMark input is an output of other applications (e.g. document processors or converters).

Edit: There are two, possibly conflicting, issues around:

One wants to make a punctuation to participate on the flanking business (e.g. )*x is currently not left-flanking but some may want to make so).
One wants to make a punctuation not to participate on the flanking business (e.g. (*x used to be right-flanking but it no longer is).

@vsch’s suggestion address 1 by narrowing the applicable punctuation characters down. This however has a side effect that [Inter]**web** is not recognized as an emphasis now! While this seems unnatural for English, remember that the original problem was from CJK scripts where intraword links and emphases are possibly more common.

I hereby suggest the following additional rules, not necessarily in conflict with @vsch’s proposal.

The escaped punctuations are treated like normal letters when determining left- and right-flanking delimiter runs. Therefore (*x would be left-flanking even in the proposal, but \(*x won’t be. This will solve the issue 2.
Any HTML comments are, no matter what precedes or follows, treated like active punctuation characters. Therefore )*x wouldn’t be left-flanking with the proposal, but )*x will be. This will solve the issue 1.

We can extend this logic to other constructs, but that may cause backtracking, e.g. checking if x*[very very long text here]() is right-flanking will take a lot. Let’s consider this to be an escape hatch and not a general mechanism.)

Therefore no matter we can optimize the rule we will have some way to represent otherwise. This will require some lookahead and lookbehind, but they will be limited to at most 4 bytes so I believe this is a reasonable bet.

vsch · July 31, 2017, 8:48pm

@lifthrasiir, what would be the side-effects of removing the brackets from list of punctuation characters and to allow delimiters around brackets?

This would allow delimiters before/after brackets in general so a plain * would require escaping in these cases. I think this would make the behavior more intuitive and affect mostly unspaced * in math equations: 5*(x+1)*y. Putting spaces gets around the problem: 5 * (x+1) * y or escaping 5\*(x+1)\*y.

The other occurrence would be for _ at end of function names abc_(); def_(); _abc_() but this can be addressed by treating the _ as a special case by leaving the brackets as punctuation symbols.

Is there any way to get statistics on the use of such problem cases in markdown sources?

This would give a better idea of how much markdown source would break in such cases. Maybe the benefit of having a less restrictive delimiter parsing would outweigh the few cases that would require escaping or spacing delimiters.

`测试**foo**可以` works but `测试**foo()**可以` fails

`测试foo可以` works but `测试foo()可以` fails