Inline emphasis, unicode punctuation&spacing and proper grammar

worstje · July 11, 2020, 7:37am

I just signed up because I ran into being unable to represent the intent of a writer & editor in MarkDown format. And before we continue: yes, I know it may end up looking like a tiny, meaningless issue, but it might affect more people than the particular instance I ran into.

Take the following sample text, markdown markup intentionally removed since it won’t work anyways!

Stupid—stupid—why is the world such a mess?

The emphasis of the writer in the original document is to have the second ‘stupid’ plus the following em-dash emphasized. However, they do not want a space after the em-dash; this punctuation is most commonly used without spaces on either side. (For example, see: this question or one of the other topics that are easily found on this topic.)

Thus, conversion from the original writing medium led to the following naive approach:

Stupid—*stupid—*why is the world such a mess?

This, and its underscore alternative both fail to parse as valid emphasis. I have read the spec (6.4) about left- and right-flanking delimiter runs, and it makes sense: the em-dash (U+2014) is punctuation, and the first letter of the word that follows it is not unicode whitespace.

I was like… OK, annoying. But I’ll work around it by inputting a zero-width-space (U+200B) that does not render. It’s not a nice thing to do since grammar says not to put spaces on either side of an em dash, but if it works, it works.

But it doesn’t work. Why? Simple: U+200B falls in the Cf category, and is not listed as Zs in unicode.

At this point, how can I represent the desired markup in Markdown so that it renders as intended? Perhaps there is a codepoint that does not render and has all the correct behaviour, but if it does, I have not found it and also been unable to ascertain whether using that character breaks with the established meaning.

At present, I see only two less-than-ideal workarounds, both of which leave visual distinction:

I insert a space after the em-dash, ruining the established and intended grammar, or
I adjust the markup to close the emphasis before the em-dash, but in turn I risk ruining the typographical look. An italicized em-dash is likely to look different than a plain one.

I realize 2 may seem like the ideal solution, but that is only because this character tends to render as a mere strip of pixels in most text. Increase the font size, and it may become more obvious.

Similarly, I think there may be other characters that could suffer in a similar way given the variety of punctuation symbols that exist in Unicode of which I am not aware what grammar rules are like in all the languages of this world.

Accounting for infinite possibilities of characters seems impossible, especially in regards to ambiguity, so despite wanting to say please fix the exotic dashes and other punctuation, that is either going to turn into some really nasty error-prone special casing of em-dashes in particular, or an outright butchery of existing syntax in many places.

So instead, I would like to propose the inclusion of the Zero Width Space in the list of characters that Commonmark classifies as Unicode Whitespace:

A Unicode whitespace character is any code point in the Unicode Zs general category, or a tab ( U+0009 ), carriage return ( U+000D ), newline ( U+000A ), or form feed ( U+000C ).

Would be changed to:

A Unicode whitespace character is any code point in the Unicode Zs general category, or a tab ( U+0009 ), carriage return ( U+000D ), newline ( U+000A ), form feed ( U+000C ), or zero width space ( U+200B ).

This would allow users to insert invisible spacing to satisfy CommonMark syntax and parsing where needed in those cases where it is unavoidable. Besides, I think most users would expect a Zero Width Space to count as whitespace to begin with!

Admittedly, I lack experience with the specification, so I may be missing ways in which this not viable. Intuitively, I feel it is unlikely for the Zero Width Space to exist in any existing documents in places that would change the way CommonMark syntax is interpreted, or for this change to break any existing features. (I imagine the most common reason for its existence is to allow wrapping to happen, but that would typically already happen when punctuation is present to begin with.)

If there is a more suitable Unicode character for this very purpose than U+200B that I have missed and already enables the desired behaviour without falling into codepoint abuse territory, I’d be very happy to hear it.

To those who made it here: thank you for taking the time to read this post from an outsider.

Crissov · July 11, 2020, 12:39pm

Where did you put the ZWSP? If it comes after the second asterisk, it works as expected.

worstje · July 15, 2020, 1:45am

I have finally had a chance to look at this issue again, because I wondered if I missed something, and if so, what I missed. I did my tests using the CommonMark interactive ‘dingus’.

My results, replacing all symbols for my everyones sanity (but especially my own):

Stupid<EM-DASH><STAR>stupid<EM-DASH><STAR>why? nope

Stupid<EM-DASH><UNDERSCORE>stupid<EM-DASH><UNDERSCORE>why? nope

Stupid<EM-DASH><STAR>stupid<EM-DASH><STAR><ZWS>why? nope

Stupid<EM-DASH><UNDERSCORE>stupid<EM-DASH><UNDERSCORE><ZWS>why? nope

Stupid<EM-DASH><STAR>stupid<EM-DASH><ZWS><STAR>why? YAY!!!

Stupid<EM-DASH><UNDERSCORE>stupid<EM-DASH><ZWS><UNDERSCORE>why? nope

So your suggestion, which is example 3, does in fact not work. I wonder where you got that test result from? However, example 5 (em-zws-star) does work.

Basically, I suspect my problem when running my original tests may have been two-fold:

a) I have tested with the ZWS after the closing *, but not before.

b) I may have forgotten to test with the * and only tested on _ in some occasions due to that being the format my pipeline produces and the results not having differed in any previous tests. My bad. (The realization that * and _ have different properties in Unicode definitely did not cross my mind when I ran my previous tests!)

I am a bit too tired to dig into the specification at this point, but your suggestion (example 3) is definitely what makes more intuitive sense to me than the one I ended up going with. I wonder why they differ, but right now I am too happy to possibly have a working situation to care. xD

Thank you for your comment that made me give all these tests a second go!

Crissov · July 15, 2020, 5:43am

I tested in Dingus as well and it works like I said.