Emphasis inside Strong broken in JS implementation when parenthesis involved


#1

Why does the following markdown string:

**Gomphocarpus (*Gomphocarpus physocarpus*, syn. *Asclepias physocarpa*)**

produces:

Gomphocarpus (Gomphocarpus physocarpus, syn. Asclepias physocarpa)**

i.e.
<p><em><em>Gomphocarpus (</em>Gomphocarpus physocarpus</em>, syn. <em>Asclepias physocarpa</em>)**</p>

I experience this in the Try page of this site and on my own web site where I installed the prebuilt JS parser. Strangely, it is well formatted in the Preview pane at the right of the box where I typed this message… (same on stackoverflow, so I guess that a different parser is used here for the preview of WMD).


#2

It is because of rule 3 for the emphasis parsing:

3. A single * character can close emphasis iff it is not preceded by whitespace.

This rule probably should be extended to consider opening parenthesis - if there isn’t a good example not to?


#3

You’re right. Thanks.
It would be great to include parenthesis. Writing my sentence as I did seemed so natural (and this is the right way to write in botany).

May I add a non related question?.. Is the source of this WMD editor available somewhere? Like here and on StackOverflow, I’m looking to be able to plug my image upload service.


#4

Yes, it’s probably a good idea to extend the rules so that closers can’t be preceded by ( (or [ or {?) and openers can’t be followed by ) (or ] or }?). Does anyone see problems with this?

Of course, it would mean you’d need to write

 *`)`*

if you wanted an italicized parenthesis.


#5

By the way, you can work around this for now by using _ for the inner emphasis markers.


#6

:smile: but I’m still in dev process right now so this is not a hurry…


Spec 0.15 release
#7

@jgm - There could be a better approach for handling cases such as this - skip all symbols and punctuation before checking for whitespace before or after the emphasis char.

This would also handle cases like:

foo*. bar -- only closer because followed by space (skipping the dot)
**foo "*bar*" foo** -- skips quotes

#8

There are some good ideas here. It looks hairy, but if I understand correctly, basic idea is fairly simple:

  1. Strings of * or _ are divided into “left flanking” and “right flanking,” based on two things: the character immediately before them and the character immediately after.
  2. Left-flanking delimiters can open emphasis, right flanking can close, and non-flanking delimiters are just regular text.
  3. A delimiter is left-flanking if the character to the left has a lower rank than the character to the right, according to the following ranking: spaces and newlines are 0, punctuation (unicode categories Pc, Pd, Ps, Pe, Pi, Pf, Po, Sc, Sk, Sm or So) is 1, the rest 2. And similarly a delimiter is right-flanking if the character to the left has a higher rank than the character to the right.

#9

@Knagis, your idea is also a good one, I just wanted to mention another alternative that might be worth looking into. Both require recognizing unicode punctuation, which adds another level of complexity to the C parser. And your suggestion might require indefinite lookbehind or lookahead in cases where you have a whole pile of punctuation characters before or after the delimiter.


#10

I’ve implemented a solution on the newemph branch – still need to update the spec and the JS parser. But the examples in this thread are now handled well. If anyone is intersted, the commit is here:


#11

I’ve polished this change and merged it into master. Spec, C, and JS implementations have all been updated.

% ./cmark
**Gomphocarpus (*Gomphocarpus physocarpus*, syn. *Asclepias physocarpa*)**
^D
<p><strong>Gomphocarpus (<em>Gomphocarpus physocarpus</em>, syn. <em>Asclepias physocarpa</em>)</strong></p>

#12

do you have the commit line handy? I’d sync up the python implementation later.