A conforming parser may render a soft line break in HTML either as a line break or as a space.
This introduces problems with languages like Chinese that traditionally does not use spaces to separate words but uses soft line-breaks to format text. A good example can be seen in a recent discussion at Ghost.
A conforming parser may render a soft line break in HTML either as a line break or as a space.
This introduces problems with languages like Chinese that traditionally does not use spaces to separate words but uses soft line-breaks to format text. A good example can be seen in a recent discussion at Ghost.
People who want this can configure stmd’s html writer to treat soft
breaks as hard breaks. That is explicitly allowed in the spec.
This maybe needs to be handled by language sensitivity in the renderer. Writing in English with soft-wrapped paragraphs, as is my habit, I very much do not want to have to put a trailing space at the end of each line in order for the words not to get run together in the rendered output. Concretely:
In this same New Bedford there stands a Whaleman's Chapel,↩
and few are the moody fishermen, shortly bound for the Indian↩
Ocean or Pacific, who fail to make a Sunday visit to the↩
spot. I am sure that I did not.↩
should not become
In this same New Bedford there stands a Whaleman’s Chapel,and few are the moody fishermen, shortly bound for the IndianOcean or Pacific, who fail to make a Sunday visit to thespot. I am sure that I did not.
in the rendering (boldface indicates malformatted words).
Conversely, if I understand the Ghost discussion correctly (as a non-speaker of any East Asian language),
Aren’t line breaks in HTML also normalised to spaces? Has this ever been discussed on that end of things? It would be interesting to see if people have gone through this before.
This doesn’t seem too hard to incorporate in Standard Markdown, just a lot of trouble for implementations to go through. The spec could talk about rendering a line break “either as a line break or as a language appropriate in-line replacement, e.g. a space”, but that would not magically fix things for the Chinese.
Thanks for the reply @vitaly. I guess the solution I’m going to go with is avoiding softline breaks in CJK markdown source. The users can do this to retain control when they don’t want softbreak turning into spaces in between their chinese characters. Although this solution is a bit surprising at first. For someone who put little thought into markdown rendering, it’s not obvious that those returns are parsed into whitespaces!
It would be nice if the parser enforces a white-space rule that is language aware. The “correct” way is to insert one space in between CJK character and anything else, including links and so on. This could be done using UNICODE range look-up.
On the other hand, I completely agree with you, that the current space is simple and effective for a spec.
@episodeyang your idea sounds promising. In general, it seems to me that this is something that is best handled at the renderer level (either automatically as you suggest or through an option), rather than in the parser.
EDIT: pandoc does have a Markdown extension ignore_line_breaks which does affect the parser; it causes newlines within the paragraph to be ignored rather than being treated as spaces or as hard line breaks. It also has east_asian_line_breaks, which causes the newlines to be ignored when they occur between two East Asian wide characters.