Why is "Unicode whitespace character" a superset of "Whitespace character"?; related error

JoeDuarte · April 16, 2020, 2:11pm

Hi all – In the spec, why is “Unicode whitespace character” a superset of “Whitespace character”? It’s counterintuitive that a more specific term is a superset of a broader term.

In the spec’s section 2.1, “Whitespace character” is defined as a space, tab, newline, line tab, form feed, or carriage return.

In the same section, “Unicode whitespace character” is defined as “any code point in the Unicode Zs general category, or a tab ( U+0009 ), carriage return ( U+000D ), newline ( U+000A ), or form feed ( U+000C ).”

Do we need two different categories of whitespace character? Would it be better to just give “Whitespace character” the definition currently used for “Unicode whitespace character”?
Why would “Whitespace character” be a subset of “Unicode whitespace character”?
There’s an error in the definition of Unicode whitespace character: line tab is missing (U+000B). It’s not in the list and it’s not in the Unicode Zs category. (You might think space is also missing, because it’s not listed, but it’s in the Zs category.)

vas · April 16, 2020, 3:50pm

It took me a couple passes to understand your point.

The spec uses “whitespace character” to mean “ASCII whitespace character”, as opposed to “Unicode whitespace character”. The latter is most definitely a superset of the former.

Do we need two different categories of whitespace character? Would it be better to just give “Whitespace character” the definition currently used for “Unicode whitespace character”?

Yes, I think we do need it so. Whitespace serves two distinct purposes in the spec:

a lexical purpose within Markdown’s markup grammar, e.g. the space required between the list marker and the content of this line item.
its normal linguistic role, e.g. the spaces between the words I am writing now.

I haven’t reviewed the entire spec just now, but I’m sure that only ASCII whitespace is recognized for role one, and probably for parsing performance reasons. Obviously, Unicode should be recognized in role two.

Why would “Whitespace character” be a subset of “Unicode whitespace character”?

Yes, it’s counterintuitive, as you put it, if you looked at the terms, not their definitions. I’m sure “whitespace character” and often just “whitespace” is used as shorthand for “ASCII whitespace character” because the otherwise the body of the spec would become unwieldy. Try reading through it but inserting “ASCII whitepace character” every place you see “whitespace character” and “whitespace”, and “Unicode character that is not an ASCII whitepace character” every place you see “non-whitespace”.

Perhaps the terms section should clarify this shorthand.

vas · April 16, 2020, 4:15pm

From a quick scan of the specs, the following languages also only allow ASCII whitespace to serve as whitespace in the lexical context:

ECMAScript (i.e. Javascript) apparently accepts a broader set, though not the full Unicode whitespace set.

jgm · April 16, 2020, 5:48pm

The terminology is confusing, agreed. It is cleaned up in a PR
that has not yet been merged but probably should be (the author
dropped out before discussion was complete, but I think only
minor changes are needed now):

github.com/commonmark/commonmark-spec

Clarify wording in spec for character groups

master ← wooorm:whitespace

opened 08:34AM - 02 Oct 19 UTC

wooorm

+190 -187

* Remove line tab and form feed from “whitespace” * Rename `newline` to line fe…ed (the LF, `\n` character) or line ending (the concept) * Remove the “whitespace” grouping, and be explicit everywhere whether spaces as indentation; spaces; spaces or tabs; spaces, tabs, and up to one line ending; is allowed * Rename “Punctuation” to “Unicode punctuation”, to disambiguate it more clearly from “ASCII punctuation” * Reword line breaks, in some cases they meant line endings, blank lines, or either hard or soft line breaks --- This is a substantial change and needs a thorough review Open questions: - Do tabs play a role in how spaces in the inline code span algorithm are stripped and merged? - What about whitespace in HTML tags? Those are used block HTML of kind 7, which can only be on one line and therefore does not support line endings as whitespace, and inline HTML. HTML [understands form feed as whitespace](https://infra.spec.whatwg.org/#ascii-whitespace) here

JoeDuarte · April 17, 2020, 12:24pm

What’s the significance of whitespace as allowed by HTML, Python, and Swift? I guess HTML is the presumptive compilation/rendering target for Markdown (though the relationship between Markdown and HTML could use some formal clarification).

vas · April 17, 2020, 3:36pm

The significance is that all of these as well as most other text forms that must be parsed efficiently only recognize ASCII whitespace lexically, even when they recognize Unicode for data. The only exception I found was ECMAScript, which accepts a broader set of whitespace lexically, but still not full the Unicode – I’ll bet that was a performance choice as well.

There is nothing to gain by supporting other forms of whitespace lexically. It isn’t an internationalization issue. Unicode needs to be supported for content, not markup.

JoeDuarte · April 18, 2020, 12:12pm

Good point.

Are you saying that the “Unicode whitespace” category in the spec is currently supported for markup?

Do we need to define two separate whitespace categories in the spec? Do we need to define any whitespace categories, or can we instead just specify particular whitespace characters as needed?

vas · April 18, 2020, 7:13pm

no, I’m saying exactly the opposite. Not for markup, only for content. That’s what I mean by “lexically”. Take a look at the spec and you’ll see the places where Unicode is supported is in content segments only, not markup or whitespace used to delimit markup.

The PR that @jgm links to does exactly what you suggest: “specify particular whitespace characters as needed” in the lexical/markup contexts instead of the ASCII “whitespace character” set.

cben · April 24, 2020, 9:18am

I imagine CJK users may sometimes prefer fullwidth space (U+3000?), say between fullwidth text and markup? No idea if it helps unless we also support fullwidth punctuation for markup, which is a separate discussion.
I really don’t know what I’m talking about ;-), just wanted to point out “nothing to gain” is not self-evident…

OTOH there is the problem that set of Unicode whitespace is growing with time, so recognizing it in markup may open interoperability risk.