I don't understand how emphasis is parsed

Aloso · May 7, 2021, 11:49pm

I wrote a Markdown implementation, and the emphasis (**Bold**, *Italic*) should follow the CommonMark specification. But while implementing it, I came across some weird edge cases where the reference implementation seems to diverge from the spec. If I just understood something wrong, can you please help me understand?

When I type *****Hello*world**** in the dingus, the result is

<p>*****Hello<em>world</em>***</p>

But I think it should be

<p>**<em><strong>Hello<em>world</em></strong></em></p>

Can someone explain why that is?

The spec contains this interesting rule:

Emphasis begins with a delimiter that can open emphasis and ends with a delimiter that can close emphasis, and that uses the same character (_ or *) as the opening delimiter.

The opening and closing delimiters must belong to separate delimiter runs.

If one of the delimiters can both open and close emphasis, then the sum of the lengths of the delimiter runs containing the opening and closing delimiters must not be a multiple of 3 unless both lengths are multiples of 3.

In the above example, there are 5 stars on the left and 4 on the right, so the number of stars in their delimiter runs is a multiple of 3. However, neither of them can both open and close emphasis (the stars on the left can only open emphasis, and the stars on the right can only close emphasis), so the last sentence of this rule should not apply.

mity · May 8, 2021, 8:49am

Your interpretation breaks the rule 11:

A literal * character cannot occur at the beginning or end of *-delimited emphasis or **-delimited strong emphasis, unless it is backslash-escaped.

I.e. you have either to use all (un-escaped) * in the whole delimiter run for the purpose of starting or ending emphasis or strong emphasis spans, or none of them.

(But you’re not alone, looking at it, my MD4C parser exhibits the same problem. There’s no example for catching this so I overlooked too.)

mity · May 8, 2021, 8:59am

But 2nd look shows that the specification itself provides the example 416 which imho breaks the rule too, so the rule is likely implemented incompletely in the cmark (maybe as of now, it guarantees it only for the opening delimiter?)

Furthermore, I’m not sure whether it’s possible to implement the rules 11 and 12 without opening a door to a quadratic parsing time, because the implementation would have to implement some rollback mechanism if it eventually sees some bogus * or _ would be left unused from any partially used opening or closing delimiter run and try to match the openers and closers differently.

So this definitely needs some attention.

@jgm?

jgm · May 8, 2021, 2:25pm

The intended of interpreted of rule 11 is that you can’t have an unescaped * at the beginning or end of the text that is inside *-delimited emphasis or strong emphasis. Sorry if that wasn’t clear enough from the wording. So for example **foo bar ** can’t be interpreted as emphasized *foo bar * (Example 390). But Example 416 is okay because the unescaped *s come outside of the emphasis.

jgm · May 8, 2021, 2:27pm

In *****Hello*world**** the first delimiter run is ***** and the second is the lone *. Sum of the number of stars is 6, a multiple of 3, and the second one here can both open and close, so rule 9 is triggered.

Aloso · May 8, 2021, 3:14pm

I’m aware of that, that’s why Hello isn’t emphasized. However, world can be emphasized (it’s surrounded by 1 star on the left and 4 stars on the right, and 5 isn’t divisible by 3). That leaves 5 unmatched stars on the left and 3 stars on the right, which for some reason don’t emphasize the text, unless the star in the middle is removed.

Aloso · May 8, 2021, 3:16pm

That’s also how I read it, but I agree that the sentence could be phrased better to avoid confusion.

mity · May 8, 2021, 5:24pm

The intended of interpreted of rule 11 is that you can’t have an unescaped * at the beginning or end of the text that is inside *-delimited emphasis or strong emphasis. Sorry if that wasn’t clear enough from the wording. So for example **foo bar ** can’t be interpreted as emphasized *foo bar * (Example 390). But Example 416 is okay because the unescaped *s come outside of the emphasis.

Ok, got it but then I don’t understand the behavior.

In *****Hello*world**** the first delimiter run is ***** and the second is the lone *. Sum of the number of stars is 6, a multiple of 3, and the second one here can both open and close, so rule 9 is triggered.

Yes, that explains why the last star from the 1st delimiter run and the star of the 2nd run do not form an emphasis of Hello. No mentioned parser does that.

Once that is resolved, the parser then tries to use the middle star as an opener because the last delimiter has 4 stars. 4 + 1 = 5 is not divisible by 3. All the mentioned parsers follow that and produce the emphasis of world.

However why then dingus/cmark does not try to continue using the stars of the 1st delimtier run and the 3rd delimiter run for additional emphasis/strong emphasis? The 2nd delimiter plays no role here anymore. None of the 1st and 3rd delimiter runs can be both an opener and a closer. Therefore the rule 9 should not apply anymore in this step.

jgm · May 11, 2021, 1:22pm

OK, I see the issue now. This does look like a bug; I’ll have to look into the algorithm to see why it’s happening. If you don’t mind putting up an issue at GitHub commonmark/cmark, that will help me keep track of it.

We should probably also have a spec example like this (so maybe we also need an issue on commonmark/commonmark-spec).

EDIT: just reading the algorithm at the end of the spec, I think I see the issue. We have an openers_bottom table that limits how far back you have to look for an opener. It is indexed to the type of delimiter (_, *) and the length of the closing delimiter mod 3. So after we fail to match the opener ***** to *, we set the openers_bottom for (*, 1) to the location of *, effectively removing the ***** as a possible opener for any run of *s with a length mod 3 of 1, including the final **** in this example. This procedure ignores the fact that the length mod 3 thing only matters if one of the delimiters can be both an opener and a closer.

mity · May 12, 2021, 6:07am

Filed as Incorrect emphasis handling · Issue #383 · commonmark/cmark · GitHub and Incorrect emphasis handling · Issue #221 · commonmark/commonmark.js · GitHub.