The terminology may need some tweaking (it’s awkward to have “unicode whitespace” not a species of “whitespace”).
Yes. It really is. [ Although I’m confused about your term “species of”? These characters don’t reproduce while we’re not looking, right? That’s equivalent to “superset”? Generalization? Every “whitespace” should be a “Unicode whitespace”? Or the other way round? Wouldn’t “ASCII whitespace” as a subset or sub-character-class of “whitespace” make more sense? These characters are all Unicode characters (or rather: are supposed to be all Unicode characters, alas the specification manages to not even fix its character repertoire). ]
It’s also “awkward”, for example, to have no word (no term available for use) for the set of (what everyone else just calls) graphic characters. In the specification, all ASCII characters are simply partitioned into
- whitespace characters,
- non-whitespace characters
and a similar, but overlapping partition exists among Unicode characters into
- Unicode whitespace characters
- eeeehm, well: everything else. No name for that either.
Not that the spec’s definition of non-whitespace character does encompass not only all remaining control characters, but also, for example, all surrogates, and all private use planes of Unicode. There’s not a single word about what feeding some or all of these “characters” to an implementation is supposed to achieve, or what the result of processing these should be. (Well, the null character is mentioned. Nice.).
The only place the category of Unicode whitespace matters in the spec is in the spec for emphasis/strong emphasis.
Yes. If you deem this area “unimportant”, why the set of 17 rules to describe a syntax which is supposedly “more complicated than HTML”? If it doesn’t matter, why not simply stick with SPACE here and “non-SPACE” there (which in a line could very easily, very cleanly mean: U+0021 … U+007E, rsp “expanded” to Unicode)?
The argument “the only place where X matters is Y” is really, really weak if you think about it.
I don’t think it would matter too much if we used “whitespace character” instead.
But I certainly do think it would matter, because inline “whitespace” would then for example suddenly exclude EN SPACE too, and how are you going to explain that to your users?
But it’s a pretty unusual case where this will matter, so I don’t think this decision is particularly important.
Maybe. Maybe not. Who knows? But …
There’s one catch though: If you decide now this way or the other (remember: the specification wants to be “unambiguous”, in the sense of “prescribing every little detail of translation”, so there’s no room for “implementation-specified” behavior variation!): if you decide it now (in whatever way), you can’t change that decision afterwards, without creating two, mutually incompatible versions of CommonMark, because every implementation can be “conforming” to at most one of both specifications.
But you can’t defer that decision either: Once “release 1.0” is out, what’s in the spec and/or what the reference implementation does is the arbiter of CommonMark conformance, so “feature-wise”, and “tweaking-wise”: that’s it.
But never mind, that’s really not particularly important.
In any case, treating both no-break space and form feed as whitespace for the purposes of emphasis delineation seems reasonable to me.
Well. For the purpose of emphasis markup, maybe.
On the other hand to have a line containing a FORM FEED character right in the middle of it, so that emphasis could possible come into play in the first place, seems not that reasonable to me. Whether you treat it the same as NBSP then, well, who cares.
Quick reminder:
FF - FORM FEED
Notation: (C0)
Representation: 00/12
FF causes the active presentation position to be moved to the corresponding character position of the line at the page home position of the next form or page in the presentation component. The page home position is established by the parameter value of SET PAGE HOME (SPH).
Pretty much the same thing as:
NO-BREAK SPACE (NBSP)
A graphic character the visual representation of which consists of the absence of a graphic symbol, for use when a line break is to be prevented in the text as presented.
The only place the category of “ASCII punctuation character” matters in the spec is the part about backslash escapes.
Nope. I was talking about “punctuation character” in the CommonMark section 6.2 sense. You’re confused by your own terminology here, which is understandable:
- DOLLAR SIGN is by definition an “ASCII punctuation character” (section 6.2).
- It is thus by definition a general “punctuation character” (section 6.2).
- POUND SIGN is not in ISO 646 IRV, hence it is not an “ASCII character”.
- So it is no “ASCII punctuation character” either, right?
- But it has (as has DOLLAR SIGN, btw) the General category
Sc
[symbol, currenty].
- “Punctuation characters” are “ASCII punctuation characters” and Unicode category
P*
.
- So POUND SIGN is not a general “punctuation character”.
- Being a general “punctuation character” does matter not with “backslash escapes”, but with emphasis markup.
Maybe you misread something about “escaping characters” into my observation.
Well, they aren’t indistinguishable; they are treated as different characters. They just don’t have any special meaning in CommonMark. Line Separator will be passed through unchanged to the output format, which may do whatever it likes with it.
Yes, that’s precisesly what I meant by the, admittedly, fuzzy term “undistinguishable”, except of course that they are not treated as different characters: they are, in each and every aspect other than their code point value, treated exactly the same (or is your cop-out that you could use them in link identifiers to distinguish them? Really?). Since you can’t “observe” the code point except in the output, they are “undistinguishable” in CommonMark. As are, for example, the letters “Q” and “R”: but that’s not a problem, for the vast majority of characters. But SOFT HYPHEN and LINE SEPARATOR? No “special meaning”?
I would say that SOFT HYPHEN has a pretty well-established “special meaning” everywhere else, and I actually would have a reasonable, useful, standard-conforming special meaning ascribed to SOFT HYPHEN to propose for CommonMark, but that’s not the time here. Pretty few CommonMark writers seem to use NBSP or SHY anyway.
As I wrote: there’s nothing inherently wrong about ignoring LINE SEPARATOR, NO-BREAK SPACE, SOFT HYPEN, NEW LINE and various other. It’s perfectly fine to ignore, say, Unicode line breaking. There’s nothing inherently weird about
-
on the one hand it purporting to not prescribe any “encoding”, but
-
on the other hand requiring that lines in the input text are separated by line ending sequences, of which there are exactly three in the spec: LF, CR, and (CR, LF).
By the way: Note that this definition has consequences for an implementation, as I would argue.
How could a CommonMark processor (not that there is this term, or a concept or a model of it in the specification, but anyway), how could a CommonMark processor (or implementation) with an API like
int commonmark::process(const std::list<std::string>& text);
or
int cmk_process(const char *const *line);
or
int cmk_add_line(const char *line);
be conformant? There are no line endings here, so these APIs can kiss conformance goodbye?
The same goes for the output end, so to speak, of the specification. What is the result of processing? An XML file? An “AST”? That’s a data structure: how is it represented, how is it accessible? What information is and is not contained in the result? Is an implementation conformant if it simply “emits” calls to one or more callback functions, which are all of course “implementation-defined” (another term that’s missing in the specification).
You’d probably say that this is not how the part about line endings was meant, and that such APIs (and many other manners of feeding input text to a processor) should of course not be precluded by the spec. And of course not is “output via callbacks” non-conformant.
But they are right now. And I see other, related, fundamental issues and open questions with the current scope, terminology, and method of description of the specification. [ We already had a short debate about the aspirations and ambitions, as well as the sheer size of it, to no avail. — Do you have any comment on my suggested single-page re-write of the debated section, by the way? ]
Heck, the specification right now does not even state whether or not it imposes requirements
- on the format of a file (or octet string),
- on the syntax of a text,
- on the way one writes CommonMark texts,
- on the processor of said text pertaining to
- how it is invoked,
- how the processor acquires the text,
- what situations it is supposed to diagnose,
- what the minimum implementation limits are, regarding, for example:
- minimal line length supported,
- minimal URL length supported,
- number of reference-style links supported
- the manner that the result of processing is made available,
- what information this result of processing should encompass,
- how conformance is defined and assessed
- how processor-defined variations are to be documented,
- in what areas implementations might vary,
- in what areas an implementation might extend the specification,
- if and how such extensions are indicated, or “switchable”,
and so on.
And it really, and increasingly (now that an official release 1.0 draws closer), worries me that no one talks about them, let alone seems to find them important, not to mention wants to do something about it.
Because afterwards changing the fundamental terminology, or—as you prefer to call it—“tweaking the terminology”—, and adjusting the concepts, scope, models used or defined in the specification is going to be
- really complicated and messy, and
- really embarrassing.
Let alone the little problem of multiple, up- and downwards incompatible specification (and implementation) versions I mentioned above.
I don’t know, maybe you see all my remarks just as some kind of heckling and nitpicking. Some of your answers read like it could be that way.
I see it as an attempt to (as long as there’s still time) lately, but not too late, bring some fundamental contents and qualities into a specification on a field that’s important to me, a specification that could have high relevance and influence, and I think that every specification of this kind should have these qualities and contents.
But if you think that basically, the spec is all right in the aspects I listed here, and it’s just some “tweaking”, which could be done some time later on, maybe after we have for example fixed the syntax for “embedded video and audio”—
please tell me, and spare me and yourself the time wasted in pointless discussions; because if that’s the kind of specification you and everyone around here would be satisfied with —
well, then I’m out of here.
Thanks for your time.