Fletcher Penney pointed out to me that the current reference implementations give unintuitive results in a few cases.
The first kind of case is an unexpected symmetry violation:
***a**b*
<p><em><strong>a</strong>b</em></p>
*b**a***
<p><em>b</em><em>a</em>**</p>
Intuitively, the second case should yield
<p><em>b<strong>a</strong></em></p>
The second kind of case involves unnatural groupings:
*a**b**c*
<p><em>a</em><em>b</em><em>c</em></p>
**a*b*c**
<p><em><em>a</em>b</em>c**</p>
**a b*b*b c**
<p><em><em>a b</em>b</em>b c**</p>
Here you’d intuitively expect:
*a**b**c*
<p><em>a<strong>b</strong>c</em></p>
**a*b*c**
<p><strong>a<em>b</em>c</strong></p>
**a b*b*b c**
<p><strong>a b<em>b</em>b c</strong></p>
There are two problems here. One is that the current declarative spec for emph/strong doesn’t say quite enough to determine the interpretations of the reference implementations in every case. The other is that the algorithm used by these implementations is giving fairly counterintuitive results, at least in these cases.
In the newemph5
branch of jgm/cmark
I have made a small tweak to the parsing algorithm that gives much better results in these cases. The change is this: when considering matches between an interior delimiter run (one that can open and can close) and another delimiter run, we require that the sum of the lengths of the two delimiter runs is not a multiple of 3.
Thus, for example, in
*a**b*
1 23 4
delimiter 1 cannot match 2, since the sum of the lengths of the first delimiter run (1) and the second (1,2) == 3. Thus we get <em>a**b</em>
instead of <em>a</em><em>b</em>
.
This gives better behavior on things like
*a**b**c*
which previously got parsed as
<em>a</em><em>b</em><em>c</em>
and now would be parsed as
<em>a<strong>b</strong>c</em>
With this change we get four spec test failures, but in each case the output seems more “intuitive”:
Example 386 (lines 6490-6494) Emphasis and strong emphasis
*foo**bar**baz*
--- expected HTML
+++ actual HTML
@@ -1 +1 @@
-<p><em>foo</em><em>bar</em><em>baz</em></p>
+<p><em>foo<strong>bar</strong>baz</em></p>
Example 389 (lines 6518-6522) Emphasis and strong emphasis
*foo**bar***
--- expected HTML
+++ actual HTML
@@ -1 +1 @@
-<p><em>foo</em><em>bar</em>**</p>
+<p><em>foo<strong>bar</strong></em></p>
Example 401 (lines 6620-6624) Emphasis and strong emphasis
**foo*bar*baz**
--- expected HTML
+++ actual HTML
@@ -1 +1 @@
-<p><em><em>foo</em>bar</em>baz**</p>
+<p><strong>foo<em>bar</em>baz</strong></p>
Example 442 (lines 6944-6948) Emphasis and strong emphasis
**foo*bar**
--- expected HTML
+++ actual HTML
@@ -1 +1 @@
-<p><em><em>foo</em>bar</em>*</p>
+<p><strong>foo*bar</strong></p>
So I’d like to propose making this change to the parsing algorithm, and adjusting the declarative spec accordingly (as well as perhaps adding some language to rule out ambiguities). Comments welcome.