Bug in commonmark.js about initial and final Unicode whitespaces in a paragraph

squeuei · July 23, 2017, 8:53am

Given:

　aaa　

There are two 'IDEOGRAPHIC SPACE’s (U+3000) before and after the word “aaa”.
(Update: I found that there weren’t 'IDEOGRAPHIC SPACE’s, now corrected)

4.8 says:

The paragraph’s raw content is formed by concatenating the lines and removing initial and final whitespace.

However, commonmark.js dingus returns:
(Update: modified the link, now it jumps to Babelmark)

<p>aaa</p>

These U+3000s shouldn’t be removed because U+3000 is defined as an Unicode whitespace character but not as a whitespace character.

Brian_Lalonde · July 26, 2017, 5:23am

Interesting. I wonder what the rationale is for distinguishing between a (CommonMark) whitespace character and a Unicode whitespace character.

squeuei · July 26, 2017, 12:26pm

As far as I know, if CJKV writers use fullwidth whitespace, it should be for text layout because they won’t use fullwidth whitespace as a word divider. And Unicode whitespace characters seem to be for text layout, too.