Emphasis/strong emphasis corner cases

jgm · June 5, 2016, 4:16am

Fletcher Penney pointed out to me that the current reference implementations give unintuitive results in a few cases.

The first kind of case is an unexpected symmetry violation:

***a**b*
<p><em><strong>a</strong>b</em></p>
*b**a***
<p><em>b</em><em>a</em>**</p>

Intuitively, the second case should yield

<p><em>b<strong>a</strong></em></p>

The second kind of case involves unnatural groupings:

*a**b**c*
<p><em>a</em><em>b</em><em>c</em></p>
**a*b*c**
<p><em><em>a</em>b</em>c**</p>
**a b*b*b c**
<p><em><em>a b</em>b</em>b c**</p>

Here you’d intuitively expect:

*a**b**c*
<p><em>a<strong>b</strong>c</em></p>
**a*b*c**
<p><strong>a<em>b</em>c</strong></p>
**a b*b*b c**
<p><strong>a b<em>b</em>b c</strong></p>

There are two problems here. One is that the current declarative spec for emph/strong doesn’t say quite enough to determine the interpretations of the reference implementations in every case. The other is that the algorithm used by these implementations is giving fairly counterintuitive results, at least in these cases.

In the newemph5 branch of jgm/cmark I have made a small tweak to the parsing algorithm that gives much better results in these cases. The change is this: when considering matches between an interior delimiter run (one that can open and can close) and another delimiter run, we require that the sum of the lengths of the two delimiter runs is not a multiple of 3.

Thus, for example, in

*a**b*
1 23 4

delimiter 1 cannot match 2, since the sum of the lengths of the first delimiter run (1) and the second (1,2) == 3. Thus we get a**b instead of ab.

This gives better behavior on things like

*a**b**c*

which previously got parsed as

<em>a</em><em>b</em><em>c</em>

and now would be parsed as

<em>a<strong>b</strong>c</em>

With this change we get four spec test failures, but in each case the output seems more “intuitive”:

Example 386 (lines 6490-6494) Emphasis and strong emphasis
*foo**bar**baz*

--- expected HTML
+++ actual HTML
@@ -1 +1 @@
-<p><em>foo</em><em>bar</em><em>baz</em></p>
+<p><em>foo<strong>bar</strong>baz</em></p>

Example 389 (lines 6518-6522) Emphasis and strong emphasis
*foo**bar***

--- expected HTML
+++ actual HTML
@@ -1 +1 @@
-<p><em>foo</em><em>bar</em>**</p>
+<p><em>foo<strong>bar</strong></em></p>

Example 401 (lines 6620-6624) Emphasis and strong emphasis
**foo*bar*baz**

--- expected HTML
+++ actual HTML
@@ -1 +1 @@
-<p><em><em>foo</em>bar</em>baz**</p>
+<p><strong>foo<em>bar</em>baz</strong></p>

Example 442 (lines 6944-6948) Emphasis and strong emphasis
**foo*bar**

--- expected HTML
+++ actual HTML
@@ -1 +1 @@
-<p><em><em>foo</em>bar</em>*</p>
+<p><strong>foo*bar</strong></p>

So I’d like to propose making this change to the parsing algorithm, and adjusting the declarative spec accordingly (as well as perhaps adding some language to rule out ambiguities). Comments welcome.

xoofx · June 6, 2016, 12:24am

That looks indeed a lot more intuitive (though I have never feel the need to use nested emphasis), so good if it makes into the specs

jgm · June 25, 2016, 5:05am

I have implemented this change in the dev version of the spec, cmark, and commonmark.js.

xoofx · June 25, 2016, 9:36am

Great, I have replicated the changes to markdig as well.

haqer1 · April 24, 2020, 8:10am

FWIW, i thought things like the following were a bug, until i skimmed the complex [emphasis and strong emphasis] rules:
Hello *Super-*man

Since it appears to not be a bug (or at least it’s spec-compliant behavior), i added it to a PR as a suggestion for improving the spec:

P.S. But please do let me know if it’s a bug, then i’d revert the commit, & log a bug.

Perhaps some ZWS-requiring things like that could be simplified Beyond MD as well… E.g., catching such stuff in auto transformation from HTML is tricky, to say the least.