Emphasis inside Strong broken in JS implementation when parenthesis involved

Nicolas_Cadilhac · November 12, 2014, 7:19pm

Why does the following markdown string:

**Gomphocarpus (*Gomphocarpus physocarpus*, syn. *Asclepias physocarpa*)**

produces:

Gomphocarpus (Gomphocarpus physocarpus, syn. Asclepias physocarpa)**

i.e.

<p><em><em>Gomphocarpus (</em>Gomphocarpus physocarpus</em>, syn. <em>Asclepias physocarpa</em>)**</p>

I experience this in the Try page of this site and on my own web site where I installed the prebuilt JS parser. Strangely, it is well formatted in the Preview pane at the right of the box where I typed this message… (same on stackoverflow, so I guess that a different parser is used here for the preview of WMD).

Knagis · November 12, 2014, 7:22pm

It is because of rule 3 for the emphasis parsing:

3. A single * character can close emphasis iff it is not preceded by whitespace.

This rule probably should be extended to consider opening parenthesis - if there isn’t a good example not to?

Nicolas_Cadilhac · November 12, 2014, 7:28pm

You’re right. Thanks.
It would be great to include parenthesis. Writing my sentence as I did seemed so natural (and this is the right way to write in botany).

May I add a non related question?.. Is the source of this WMD editor available somewhere? Like here and on StackOverflow, I’m looking to be able to plug my image upload service.

jgm · November 12, 2014, 8:06pm

Yes, it’s probably a good idea to extend the rules so that closers can’t be preceded by ( (or [ or {?) and openers can’t be followed by ) (or ] or }?). Does anyone see problems with this?

Of course, it would mean you’d need to write

 *`)`*

if you wanted an italicized parenthesis.

jgm · November 12, 2014, 8:11pm

By the way, you can work around this for now by using _ for the inner emphasis markers.

Nicolas_Cadilhac · November 12, 2014, 8:17pm

but I’m still in dev process right now so this is not a hurry…

Knagis · November 18, 2014, 3:32pm

@jgm - There could be a better approach for handling cases such as this - skip all symbols and punctuation before checking for whitespace before or after the emphasis char.

This would also handle cases like:

foo*. bar -- only closer because followed by space (skipping the dot)
**foo "*bar*" foo** -- skips quotes

jgm · November 18, 2014, 8:29pm

There are some good ideas here. It looks hairy, but if I understand correctly, basic idea is fairly simple:

Strings of * or _ are divided into “left flanking” and “right flanking,” based on two things: the character immediately before them and the character immediately after.
Left-flanking delimiters can open emphasis, right flanking can close, and non-flanking delimiters are just regular text.
A delimiter is left-flanking if the character to the left has a lower rank than the character to the right, according to the following ranking: spaces and newlines are 0, punctuation (unicode categories Pc, Pd, Ps, Pe, Pi, Pf, Po, Sc, Sk, Sm or So) is 1, the rest 2. And similarly a delimiter is right-flanking if the character to the left has a higher rank than the character to the right.

jgm · November 18, 2014, 9:46pm

@Knagis, your idea is also a good one, I just wanted to mention another alternative that might be worth looking into. Both require recognizing unicode punctuation, which adds another level of complexity to the C parser. And your suggestion might require indefinite lookbehind or lookahead in cases where you have a whole pile of punctuation characters before or after the delimiter.

jgm · December 15, 2014, 2:31am

I’ve implemented a solution on the newemph branch – still need to update the spec and the JS parser. But the examples in this thread are now handled well. If anyone is intersted, the commit is here:

jgm · December 25, 2014, 6:51pm

I’ve polished this change and merged it into master. Spec, C, and JS implementations have all been updated.

% ./cmark
**Gomphocarpus (*Gomphocarpus physocarpus*, syn. *Asclepias physocarpa*)**
^D
<p><strong>Gomphocarpus (<em>Gomphocarpus physocarpus</em>, syn. <em>Asclepias physocarpa</em>)</strong></p>

lu_zero · December 25, 2014, 10:44pm

do you have the commit line handy? I’d sync up the python implementation later.

Crissov · November 26, 2023, 10:05pm

Nine years ago, @jgm concluded:

Strings of * or _ are divided into “left flanking” and “right flanking,” based on two things:
the character immediately before them
and the character immediately after.

Left-flanking delimiters can open emphasis,
right flanking can close,
and non-flanking delimiters are just regular text.

A delimiter is left-flanking if the character to the left has a lower rank than the character to the right, according to the following ranking:
spaces and newlines are 0,
punctuation (unicode categories Pc, Pd, Ps, Pe, Pi, Pf, Po, Sc, Sk, Sm or So) is 1,
the rest 2.
And similarly a delimiter is right-flanking if the character to the left has a higher rank than the character to the right.

With minor subsequent changes, this led to the following text in the specification:

A left-flanking delimiter run is a delimiter run that is
(1) not followed by Unicode whitespace, and
either (2a) not followed by a Unicode punctuation character,
or (2b) followed by a Unicode punctuation character and preceded by Unicode whitespace or a Unicode punctuation character. (…)

A right-flanking delimiter run is a delimiter run that is
(1) not preceded by Unicode whitespace, and
either (2a) not preceded by a Unicode punctuation character,
or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character.

A single * character can open emphasis
iff (if and only if) it is part of a left-flanking delimiter run.

A single _ character can open emphasis
iff it is part of a left-flanking delimiter run and
either (a) not part of a right-flanking delimiter run
or (b) part of a right-flanking delimiter run preceded by a Unicode punctuation character.

A single * character can close emphasis
iff it is part of a right-flanking delimiter run.

A single _ character can close emphasis
iff it is part of a right-flanking delimiter run and
either (a) not part of a left-flanking delimiter run
or (b) part of a left-flanking delimiter run followed by a Unicode punctuation character.

In an issue comment on GitHub, I suggested that it may be useful – especially for East Asian languages and others written without inter-word spacing –, to treat the different Unicode punctuation classes differently for flanking (or can open/close emphasis) behavior.

Open/Start Ps, e.g. (
Initial Pi, e.g. “
Close/End Pe, e.g. )
Final Pf, e.g. ”
Connector Pc, e.g. _
Dash Pd, e.g. -
Other Po, e.g. ,

Current classification of delimiter runs

␠: any (Unicode) whitespace, including line start and end
.: any (Unicode) punctuation
a: neither punctuation nor whitespace

Before	After	Flanking	Open	Close
␠	␠	none
a	a	both (2a)	`*`	`*`
.	.	both (2b2)	`*`, `_` (b)	`*`, `_` (b)
.	a	left (2a)	`*`, `_` (a)
␠	a	left (2a)	`*`, `_` (a)
␠	.	left (2b1)	`*`, `_` (a)
a	.	right (2a)		`*`, `_` (a)
a	␠	right (2a)		`*`, `_` (a)
.	␠	right (2b1)		`*`, `_` (a)

Proposed differentiation of punctuation classes

(: Ps or Pi
): Pe or Pf
,: Pc, Pd or Po

Before	After	Flanking	Open	Close
,	,	both	`*`, `_`	`*`, `_`
(	(	left (~~both~~)	`*`, `_`	`*`, `_`
)	)	right (~~both~~)	`*`, `_`	`*`, `_`
(	)	both	`*`, `_`?	`*`, `_`?
)	(	both (none?)	`*`, `_`	`*`, `_`
(	,	both (left?)	`*`, `_`	`*`, `_`
)	,	right (~~both~~)	`*`, `_`	`*`, `_`
,	(	left (~~both~~)	`*`, `_`	`*`, `_`
,	)	both (right?)	`*`, `_`	`*`, `_`
(	a	left	`*`, `_`
␠	(	left	`*`, `_`
a	(	both ~~right~~	*``, `_`**	`*`, `_`
(	␠	right		`*`, `_`
)	a	both ~~left~~	`*`, `_`	*``, `_`**
␠	)	left	`*`, `_`
a	)	right		`*`, `_`
)	␠	right		`*`, `_`
,	a	left	`*`, `_`
␠	,	left	`*`, `_`
a	,	right		`*`, `_`
,	␠	right		`*`, `_`

A left-flanking delimiter run is a delimiter run that is
(1) not followed by whitespace, and
either (2a) not followed by punctuation,
or (2b) followed by non-starting punctuation and preceded by whitespace or non-ending punctuation,
or (2c) followed by starting punctuation.
A strongly left-flanking delimiter run is a left-flanking delimiter run that is
either (a) not part of a right-flanking delimiter run
or (b) part of a right-flanking delimiter run preceded by punctuation
(…)

A single * character can open emphasis
iff (if and only if) it is part of a left-flanking delimiter run.

A single _ character can open emphasis
iff it is part of a strongly left-flanking delimiter run

(…)