Emphasis inside Strong broken in JS implementation when parenthesis involved

Why does the following markdown string:

**Gomphocarpus (*Gomphocarpus physocarpus*, syn. *Asclepias physocarpa*)**

produces:

Gomphocarpus (Gomphocarpus physocarpus, syn. Asclepias physocarpa)**

i.e.
<p><em><em>Gomphocarpus (</em>Gomphocarpus physocarpus</em>, syn. <em>Asclepias physocarpa</em>)**</p>

I experience this in the Try page of this site and on my own web site where I installed the prebuilt JS parser. Strangely, it is well formatted in the Preview pane at the right of the box where I typed this message… (same on stackoverflow, so I guess that a different parser is used here for the preview of WMD).

It is because of rule 3 for the emphasis parsing:

3. A single * character can close emphasis iff it is not preceded by whitespace.

This rule probably should be extended to consider opening parenthesis - if there isn’t a good example not to?

You’re right. Thanks.
It would be great to include parenthesis. Writing my sentence as I did seemed so natural (and this is the right way to write in botany).

May I add a non related question?.. Is the source of this WMD editor available somewhere? Like here and on StackOverflow, I’m looking to be able to plug my image upload service.

Yes, it’s probably a good idea to extend the rules so that closers can’t be preceded by ( (or [ or {?) and openers can’t be followed by ) (or ] or }?). Does anyone see problems with this?

Of course, it would mean you’d need to write

 *`)`*

if you wanted an italicized parenthesis.

By the way, you can work around this for now by using _ for the inner emphasis markers.

:smile: but I’m still in dev process right now so this is not a hurry…

@jgm - There could be a better approach for handling cases such as this - skip all symbols and punctuation before checking for whitespace before or after the emphasis char.

This would also handle cases like:

foo*. bar -- only closer because followed by space (skipping the dot)
**foo "*bar*" foo** -- skips quotes

There are some good ideas here. It looks hairy, but if I understand correctly, basic idea is fairly simple:

  1. Strings of * or _ are divided into “left flanking” and “right flanking,” based on two things: the character immediately before them and the character immediately after.
  2. Left-flanking delimiters can open emphasis, right flanking can close, and non-flanking delimiters are just regular text.
  3. A delimiter is left-flanking if the character to the left has a lower rank than the character to the right, according to the following ranking: spaces and newlines are 0, punctuation (unicode categories Pc, Pd, Ps, Pe, Pi, Pf, Po, Sc, Sk, Sm or So) is 1, the rest 2. And similarly a delimiter is right-flanking if the character to the left has a higher rank than the character to the right.

@Knagis, your idea is also a good one, I just wanted to mention another alternative that might be worth looking into. Both require recognizing unicode punctuation, which adds another level of complexity to the C parser. And your suggestion might require indefinite lookbehind or lookahead in cases where you have a whole pile of punctuation characters before or after the delimiter.

I’ve implemented a solution on the newemph branch – still need to update the spec and the JS parser. But the examples in this thread are now handled well. If anyone is intersted, the commit is here:

I’ve polished this change and merged it into master. Spec, C, and JS implementations have all been updated.

% ./cmark
**Gomphocarpus (*Gomphocarpus physocarpus*, syn. *Asclepias physocarpa*)**
^D
<p><strong>Gomphocarpus (<em>Gomphocarpus physocarpus</em>, syn. <em>Asclepias physocarpa</em>)</strong></p>

do you have the commit line handy? I’d sync up the python implementation later.

Nine years ago, @jgm concluded:

  1. Strings of * or _ are divided into “left flanking” and “right flanking,” based on two things:
    the character immediately before them
    and the character immediately after.
  2. Left-flanking delimiters can open emphasis,
    right flanking can close,
    and non-flanking delimiters are just regular text.
  3. A delimiter is left-flanking if the character to the left has a lower rank than the character to the right, according to the following ranking:
    spaces and newlines are 0,
    punctuation (unicode categories Pc, Pd, Ps, Pe, Pi, Pf, Po, Sc, Sk, Sm or So) is 1,
    the rest 2.
    And similarly a delimiter is right-flanking if the character to the left has a higher rank than the character to the right.

With minor subsequent changes, this led to the following text in the specification:

A left-flanking delimiter run is a delimiter run that is
(1) not followed by Unicode whitespace, and
either (2a) not followed by a Unicode punctuation character,
or (2b) followed by a Unicode punctuation character and preceded by Unicode whitespace or a Unicode punctuation character. (…)

A right-flanking delimiter run is a delimiter run that is
(1) not preceded by Unicode whitespace, and
either (2a) not preceded by a Unicode punctuation character,
or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character.

  1. A single * character can open emphasis
    iff (if and only if) it is part of a left-flanking delimiter run.
  2. A single _ character can open emphasis
    iff it is part of a left-flanking delimiter run and
    either (a) not part of a right-flanking delimiter run
    or (b) part of a right-flanking delimiter run preceded by a Unicode punctuation character.
  3. A single * character can close emphasis
    iff it is part of a right-flanking delimiter run.
  4. A single _ character can close emphasis
    iff it is part of a right-flanking delimiter run and
    either (a) not part of a left-flanking delimiter run
    or (b) part of a left-flanking delimiter run followed by a Unicode punctuation character.

In an issue comment on GitHub, I suggested that it may be useful – especially for East Asian languages and others written without inter-word spacing –, to treat the different Unicode punctuation classes differently for flanking (or can open/close emphasis) behavior.

  • Open/Start Ps, e.g. (
  • Initial Pi, e.g.
  • Close/End Pe, e.g. )
  • Final Pf, e.g.
  • Connector Pc, e.g. _
  • Dash Pd, e.g. -
  • Other Po, e.g. ,

Current classification of delimiter runs

  • : any (Unicode) whitespace, including line start and end
  • .: any (Unicode) punctuation
  • a: neither punctuation nor whitespace
Before After Flanking Open Close
none
a a both (2a) * *
. . both (2b2) *, _ (b) *, _ (b)
. a left (2a) *, _ (a)
a left (2a) *, _ (a)
. left (2b1) *, _ (a)
a . right (2a) *, _ (a)
a right (2a) *, _ (a)
. right (2b1) *, _ (a)

Proposed differentiation of punctuation classes

  • (: Ps or Pi
  • ): Pe or Pf
  • ,: Pc, Pd or Po
Before After Flanking Open Close
, , both *, _ *, _
( ( left (both) *, _ *, _
) ) right (both) *, _ *, _
( ) both *, _? *, _?
) ( both (none?) *, _ *, _
( , both (left?) *, _ *, _
) , right (both) *, _ *, _
, ( left (both) *, _ *, _
, ) both (right?) *, _ *, _
( a left *, _
( left *, _
a ( both right *, _ *, _
( right *, _
) a both left *, _ *, _
) left *, _
a ) right *, _
) right *, _
, a left *, _
, left *, _
a , right *, _
, right *, _

A left-flanking delimiter run is a delimiter run that is
(1) not followed by whitespace, and
either (2a) not followed by punctuation,
or (2b) followed by non-starting punctuation and preceded by whitespace or non-ending punctuation,
or (2c) followed by starting punctuation.
A strongly left-flanking delimiter run is a left-flanking delimiter run that is
either (a) not part of a right-flanking delimiter run
or (b) part of a right-flanking delimiter run preceded by punctuation
(…)

  1. A single * character can open emphasis
    iff (if and only if) it is part of a left-flanking delimiter run.
  2. A single _ character can open emphasis
    iff it is part of a strongly left-flanking delimiter run
  3. (…)