Soft line-breaks should not introduce spaces

Relevant part of the spec: http://jgm.github.io/stmd/spec.html#soft-line-breaks

A conforming parser may render a soft line break in HTML either as a line break or as a space.

This introduces problems with languages like Chinese that traditionally does not use spaces to separate words but uses soft line-breaks to format text. A good example can be seen in a recent discussion at Ghost.

+++ halfdan [Sep 04 14 12:12 ]:

Relevant part of the spec: http://jgm.github.io/stmd/spec.html#soft-line-breaks

A conforming parser may render a soft line break in HTML either as a line break or as a space.

This introduces problems with languages like Chinese that traditionally does not use spaces to separate words but uses soft line-breaks to format text. A good example can be seen in a recent discussion at Ghost.

People who want this can configure stmd’s html writer to treat soft
breaks as hard breaks. That is explicitly allowed in the spec.

This maybe needs to be handled by language sensitivity in the renderer. Writing in English with soft-wrapped paragraphs, as is my habit, I very much do not want to have to put a trailing space at the end of each line in order for the words not to get run together in the rendered output. Concretely:

In this same New Bedford there stands a Whaleman's Chapel,↩
and few are the moody fishermen, shortly bound for the Indian↩
Ocean or Pacific, who fail to make a Sunday visit to the↩
spot. I am sure that I did not.↩

should not become

In this same New Bedford there stands a Whaleman’s Chapel,and few are the moody fishermen, shortly bound for the IndianOcean or Pacific, who fail to make a Sunday visit to thespot. I am sure that I did not.

in the rendering (boldface indicates malformatted words).

Conversely, if I understand the Ghost discussion correctly (as a non-speaker of any East Asian language),

不必说碧绿的菜畦,光滑的石井栏,高大的皂荚树,紫红的桑椹;也不必说鸣蝉↩
在树叶里长吟,肥胖的黄蜂伏在菜花上,轻捷的叫天子(云雀)忽然从草间直窜↩
向云霄里去了。

should not become

不必说碧绿的菜畦,光滑的石井栏,高大的皂荚树,紫红的桑椹;也不必说鸣蝉_在树叶里长吟,肥胖的黄蜂伏在菜花上,轻捷的叫天子(云雀)忽然从草间直窜_向云霄里去了。

in the rendering (_ characters indicate inappropriate spaces).

There’s really no way to get this right without language sensitivity. The typographic conventions are just too different.

2 Likes

Aren’t line breaks in HTML also normalised to spaces? Has this ever been discussed on that end of things? It would be interesting to see if people have gone through this before.

This doesn’t seem too hard to incorporate in Standard Markdown, just a lot of trouble for implementations to go through. The spec could talk about rendering a line break “either as a line break or as a language appropriate in-line replacement, e.g. a space”, but that would not magically fix things for the Chinese.

should this be done on the parser level, or in the renderer? @vitaly @codinghorror

Here is a simple heuristic I think would work well:

  • if both the last character of a line and the first character of the next line is CJK then softbreak should be rendered as nothing.
  • if either side is none CJK character then the softbreak should be rendered as space.

https://www.w3.org/TR/css-text-3/#word-spacing-property

I don’t know. Have no experience with asian group languages.

I’d say, heuristic is not a good base for standard, because it’s not 100% predictable. That can be attractive for private use, but not for spec.

Another problem is, that you can be on border between link/text etc. Then compare neighbour visible characters become non trivial.

Thanks for the reply @vitaly. I guess the solution I’m going to go with is avoiding softline breaks in CJK markdown source. The users can do this to retain control when they don’t want softbreak turning into spaces in between their chinese characters. Although this solution is a bit surprising at first. For someone who put little thought into markdown rendering, it’s not obvious that those returns are parsed into whitespaces!

It would be nice if the parser enforces a white-space rule that is language aware. The “correct” way is to insert one space in between CJK character and anything else, including links and so on. This could be done using UNICODE range look-up.

On the other hand, I completely agree with you, that the current space is simple and effective for a spec.

looks like Pandoc has implemented this intelligent white-space handling. linebreak smart option discussion

@episodeyang your idea sounds promising. In general, it seems to me that this is something that is best handled at the renderer level (either automatically as you suggest or through an option), rather than in the parser.

EDIT: pandoc does have a Markdown extension ignore_line_breaks which does affect the parser; it causes newlines within the paragraph to be ignored rather than being treated as spaces or as hard line breaks. It also has east_asian_line_breaks, which causes the newlines to be ignored when they occur between two East Asian wide characters.

1 Like

Plugin for markdown-it to drop softbreaks between CJK chars. Similar to one in pandoc.