Why is "Unicode whitespace character" a superset of "Whitespace character"?; related error

Hi all – In the spec, why is “Unicode whitespace character” a superset of “Whitespace character”? It’s counterintuitive that a more specific term is a superset of a broader term.

In the spec’s section 2.1, “Whitespace character” is defined as a space, tab, newline, line tab, form feed, or carriage return.

In the same section, “Unicode whitespace character” is defined as “any code point in the Unicode Zs general category, or a tab ( U+0009 ), carriage return ( U+000D ), newline ( U+000A ), or form feed ( U+000C ).”

  1. Do we need two different categories of whitespace character? Would it be better to just give “Whitespace character” the definition currently used for “Unicode whitespace character”?

  2. Why would “Whitespace character” be a subset of “Unicode whitespace character”?

  3. There’s an error in the definition of Unicode whitespace character: line tab is missing (U+000B). It’s not in the list and it’s not in the Unicode Zs category. (You might think space is also missing, because it’s not listed, but it’s in the Zs category.)

It took me a couple passes to understand your point.

The spec uses “whitespace character” to mean “ASCII whitespace character”, as opposed to “Unicode whitespace character”. The latter is most definitely a superset of the former.

  1. Do we need two different categories of whitespace character? Would it be better to just give “Whitespace character” the definition currently used for “Unicode whitespace character”?

Yes, I think we do need it so. Whitespace serves two distinct purposes in the spec:

  1. a lexical purpose within Markdown’s markup grammar, e.g. the space required between the list marker and the content of this line item.
  2. its normal linguistic role, e.g. the spaces between the words I am writing now.

I haven’t reviewed the entire spec just now, but I’m sure that only ASCII whitespace is recognized for role one, and probably for parsing performance reasons. Obviously, Unicode should be recognized in role two.

  1. Why would “Whitespace character” be a subset of “Unicode whitespace character”?

Yes, it’s counterintuitive, as you put it, if you looked at the terms, not their definitions. I’m sure “whitespace character” and often just “whitespace” is used as shorthand for “ASCII whitespace character” because the otherwise the body of the spec would become unwieldy. Try reading through it but inserting “ASCII whitepace character” every place you see “whitespace character” and “whitespace”, and “Unicode character that is not an ASCII whitepace character” every place you see “non-whitespace”.

Perhaps the terms section should clarify this shorthand.

1 Like

From a quick scan of the specs, the following languages also only allow ASCII whitespace to serve as whitespace in the lexical context:

ECMAScript (i.e. Javascript) apparently accepts a broader set, though not the full Unicode whitespace set.

1 Like

The terminology is confusing, agreed. It is cleaned up in a PR
that has not yet been merged but probably should be (the author
dropped out before discussion was complete, but I think only
minor changes are needed now):

1 Like

What’s the significance of whitespace as allowed by HTML, Python, and Swift? I guess HTML is the presumptive compilation/rendering target for Markdown (though the relationship between Markdown and HTML could use some formal clarification).

The significance is that all of these as well as most other text forms that must be parsed efficiently only recognize ASCII whitespace lexically, even when they recognize Unicode for data. The only exception I found was ECMAScript, which accepts a broader set of whitespace lexically, but still not full the Unicode – I’ll bet that was a performance choice as well.

There is nothing to gain by supporting other forms of whitespace lexically. It isn’t an internationalization issue. Unicode needs to be supported for content, not markup.

Good point.

Are you saying that the “Unicode whitespace” category in the spec is currently supported for markup?

Do we need to define two separate whitespace categories in the spec? Do we need to define any whitespace categories, or can we instead just specify particular whitespace characters as needed?

no, I’m saying exactly the opposite. Not for markup, only for content. That’s what I mean by “lexically”. Take a look at the spec and you’ll see the places where Unicode is supported is in content segments only, not markup or whitespace used to delimit markup.

The PR that @jgm links to does exactly what you suggest: “specify particular whitespace characters as needed” in the lexical/markup contexts instead of the ASCII “whitespace character” set.

I imagine CJK users may sometimes prefer fullwidth space (U+3000?), say between fullwidth text and markup? No idea if it helps unless we also support fullwidth punctuation for markup, which is a separate discussion.
I really don’t know what I’m talking about ;-), just wanted to point out “nothing to gain” is not self-evident…

OTOH there is the problem that set of Unicode whitespace is growing with time, so recognizing it in markup may open interoperability risk.