Fun with specs: NO-BREAK SPACE = FORM FEED; DOLLAR SIGN /= POUND SIGN; SOFT HYPEN = LINE SEPARATOR

tin-pot · January 9, 2016, 4:15pm

Without calling for lengthy discussions (which could hold up the 1.0 release, or important features like “emedded audio and video” or issues like “email addresses regex”), I’d just like to quickly point out—not that there’s anything wrong with this, it’s just … “interesting”—that according to my understanding of the specification:

NO-BREAK SPACE and FORM FEED (!) are treated the same

At least inline (which is where FORM FEED will end up in): Both are Unicode whitespace characters—not that FORM FEED is a Unicode “space character”, but anyway, while NO-BREAK SPACE (having General category Zs [separator, space], but being outside “ASCII”) is not a whitespace character, hence it is a non-whitespace character, but also a Unicode whitespace character (huh?); and both can’t be used to indent lines and such: that’s all there is to say about it according to the specification, and being Unicode whitespace is all that matters inline when it comes to emphasis markup and such.

DOLLAR SIGN and POUND SIGN are treated differently

One is an ASCII punctuation character, thus a punctuation character, the other one is not ASCII, hence no ASCII punctuation character (duh!), it has General category Sc (symbol, currency) just like DOLLAR SIGN, but is (therefore!) in contrast to DOLLAR SIGN not a punctuation character, hence behaves differently when it comes to emphasis markup.

SOFT HYPHEN and LINE SEPARATOR are indistinguishable

They are both neither ASCII punctuation (obviously) nor Unicode punctuation characters: one has General category Cf [other, format], the other Zl [separator, line]). So being non-whitespace characters, and also “not Unicode whitespace characters”, they are treated just the same.—And why does LINE SEPARATOR not, well: separate lines?

Of course, EN SPACE is indistinguishable from NO-BREAK SPACE in the specification’s eyes, too: they are “just some Unicode whitespace characters”.

jgm · January 9, 2016, 7:07pm

The terminology may need some tweaking (it’s awkward to have “unicode whitespace” not a species of “whitespace”).

The only place the category of Unicode whitespace matters in the spec is in the spec for emphasis/strong emphasis. I don’t think it would matter too much if we used “whitespace character” instead. But it’s a pretty unusual case where this will matter, so I don’t think this decision is particularly important. In any case, treating both no-break space and form feed as whitespace for the purposes of emphasis delineation seems reasonable to me.

The only place the category of “ASCII punctuation character” matters in the spec is the part about backslash escapes. Original Markdown allowed only the characters with special meanings in Markdown to be escaped. It was always hard to remember which those were, so a more general rule was adopted: any ASCII punctuation can be escaped. I don’t think anybody is going to think that the pound sign has a special meaning in Markdown, so I don’t see a big problem with allowing dollar sign, but not pound sign, to be escaped. (Limiting to ASCII also makes things easier for parsers.)

Well, they aren’t indistinguishable; they are treated as different characters. They just don’t have any special meaning in CommonMark. Line Separator will be passed through unchanged to the output format, which may do whatever it likes with it.

Why did we not use Line Separator as the hard line break character? It’s invisible, for one thing, and hard to type, for another.

tin-pot · January 9, 2016, 9:30pm

The terminology may need some tweaking (it’s awkward to have “unicode whitespace” not a species of “whitespace”).

Yes. It really is. [ Although I’m confused about your term “species of”? These characters don’t reproduce while we’re not looking, right? That’s equivalent to “superset”? Generalization? Every “whitespace” should be a “Unicode whitespace”? Or the other way round? Wouldn’t “ASCII whitespace” as a subset or sub-character-class of “whitespace” make more sense? These characters are all Unicode characters (or rather: are supposed to be all Unicode characters, alas the specification manages to not even fix its character repertoire). ]

It’s also “awkward”, for example, to have no word (no term available for use) for the set of (what everyone else just calls) graphic characters. In the specification, all ASCII characters are simply partitioned into

whitespace characters,
non-whitespace characters

and a similar, but overlapping partition exists among Unicode characters into

Unicode whitespace characters
eeeehm, well: everything else. No name for that either.

Not that the spec’s definition of non-whitespace character does encompass not only all remaining control characters, but also, for example, all surrogates, and all private use planes of Unicode. There’s not a single word about what feeding some or all of these “characters” to an implementation is supposed to achieve, or what the result of processing these should be. (Well, the null character is mentioned. Nice.).

The only place the category of Unicode whitespace matters in the spec is in the spec for emphasis/strong emphasis.

Yes. If you deem this area “unimportant”, why the set of 17 rules to describe a syntax which is supposedly “more complicated than HTML”? If it doesn’t matter, why not simply stick with SPACE here and “non-SPACE” there (which in a line could very easily, very cleanly mean: U+0021 … U+007E, rsp “expanded” to Unicode)?

The argument “the only place where X matters is Y” is really, really weak if you think about it.

I don’t think it would matter too much if we used “whitespace character” instead.

But I certainly do think it would matter, because inline “whitespace” would then for example suddenly exclude EN SPACE too, and how are you going to explain that to your users?

But it’s a pretty unusual case where this will matter, so I don’t think this decision is particularly important.

Maybe. Maybe not. Who knows? But …

There’s one catch though: If you decide now this way or the other (remember: the specification wants to be “unambiguous”, in the sense of “prescribing every little detail of translation”, so there’s no room for “implementation-specified” behavior variation!): if you decide it now (in whatever way), you can’t change that decision afterwards, without creating two, mutually incompatible versions of CommonMark, because every implementation can be “conforming” to at most one of both specifications.

But you can’t defer that decision either: Once “release 1.0” is out, what’s in the spec and/or what the reference implementation does is the arbiter of CommonMark conformance, so “feature-wise”, and “tweaking-wise”: that’s it.

But never mind, that’s really not particularly important.

In any case, treating both no-break space and form feed as whitespace for the purposes of emphasis delineation seems reasonable to me.

Well. For the purpose of emphasis markup, maybe.

On the other hand to have a line containing a FORM FEED character right in the middle of it, so that emphasis could possible come into play in the first place, seems not that reasonable to me. Whether you treat it the same as NBSP then, well, who cares.

Quick reminder:

FF - FORM FEED
Notation: (C0)
Representation: 00/12
FF causes the active presentation position to be moved to the corresponding character position of the line at the page home position of the next form or page in the presentation component. The page home position is established by the parameter value of SET PAGE HOME (SPH).

Pretty much the same thing as:

NO-BREAK SPACE (NBSP)
A graphic character the visual representation of which consists of the absence of a graphic symbol, for use when a line break is to be prevented in the text as presented.

The only place the category of “ASCII punctuation character” matters in the spec is the part about backslash escapes.

Nope. I was talking about “punctuation character” in the CommonMark section 6.2 sense. You’re confused by your own terminology here, which is understandable:

DOLLAR SIGN is by definition an “ASCII punctuation character” (section 6.2).
It is thus by definition a general “punctuation character” (section 6.2).
POUND SIGN is not in ISO 646 IRV, hence it is not an “ASCII character”.
So it is no “ASCII punctuation character” either, right?
But it has (as has DOLLAR SIGN, btw) the General category Sc [symbol, currenty].
“Punctuation characters” are “ASCII punctuation characters” and Unicode category P*.
So POUND SIGN is not a general “punctuation character”.
Being a general “punctuation character” does matter not with “backslash escapes”, but with emphasis markup.

Maybe you misread something about “escaping characters” into my observation.

Well, they aren’t indistinguishable; they are treated as different characters. They just don’t have any special meaning in CommonMark. Line Separator will be passed through unchanged to the output format, which may do whatever it likes with it.

Yes, that’s precisesly what I meant by the, admittedly, fuzzy term “undistinguishable”, except of course that they are not treated as different characters: they are, in each and every aspect other than their code point value, treated exactly the same (or is your cop-out that you could use them in link identifiers to distinguish them? Really?). Since you can’t “observe” the code point except in the output, they are “undistinguishable” in CommonMark. As are, for example, the letters “Q” and “R”: but that’s not a problem, for the vast majority of characters. But SOFT HYPHEN and LINE SEPARATOR? No “special meaning”?

I would say that SOFT HYPHEN has a pretty well-established “special meaning” everywhere else, and I actually would have a reasonable, useful, standard-conforming special meaning ascribed to SOFT HYPHEN to propose for CommonMark, but that’s not the time here. Pretty few CommonMark writers seem to use NBSP or SHY anyway.

As I wrote: there’s nothing inherently wrong about ignoring LINE SEPARATOR, NO-BREAK SPACE, SOFT HYPEN, NEW LINE and various other. It’s perfectly fine to ignore, say, Unicode line breaking. There’s nothing inherently weird about

on the one hand it purporting to not prescribe any “encoding”, but
on the other hand requiring that lines in the input text are separated by line ending sequences, of which there are exactly three in the spec: LF, CR, and (CR, LF).

By the way: Note that this definition has consequences for an implementation, as I would argue.

How could a CommonMark processor (not that there is this term, or a concept or a model of it in the specification, but anyway), how could a CommonMark processor (or implementation) with an API like

int commonmark::process(const std::list<std::string>& text);

or

int cmk_process(const char *const *line);

or

int cmk_add_line(const char *line);

be conformant? There are no line endings here, so these APIs can kiss conformance goodbye?

The same goes for the output end, so to speak, of the specification. What is the result of processing? An XML file? An “AST”? That’s a data structure: how is it represented, how is it accessible? What information is and is not contained in the result? Is an implementation conformant if it simply “emits” calls to one or more callback functions, which are all of course “implementation-defined” (another term that’s missing in the specification).

You’d probably say that this is not how the part about line endings was meant, and that such APIs (and many other manners of feeding input text to a processor) should of course not be precluded by the spec. And of course not is “output via callbacks” non-conformant.

But they are right now. And I see other, related, fundamental issues and open questions with the current scope, terminology, and method of description of the specification. [ We already had a short debate about the aspirations and ambitions, as well as the sheer size of it, to no avail. — Do you have any comment on my suggested single-page re-write of the debated section, by the way? ]

Heck, the specification right now does not even state whether or not it imposes requirements

on the format of a file (or octet string),
on the syntax of a text,
on the way one writes CommonMark texts,
on the processor of said text pertaining to
- how it is invoked,
- how the processor acquires the text,
- what situations it is supposed to diagnose,
- what the minimum implementation limits are, regarding, for example:
  - minimal line length supported,
  - minimal URL length supported,
  - number of reference-style links supported
- the manner that the result of processing is made available,
what information this result of processing should encompass,
how conformance is defined and assessed
how processor-defined variations are to be documented,
in what areas implementations might vary,
in what areas an implementation might extend the specification,
if and how such extensions are indicated, or “switchable”,

and so on.

And it really, and increasingly (now that an official release 1.0 draws closer), worries me that no one talks about them, let alone seems to find them important, not to mention wants to do something about it.

Because afterwards changing the fundamental terminology, or—as you prefer to call it—“tweaking the terminology”—, and adjusting the concepts, scope, models used or defined in the specification is going to be

really complicated and messy, and
really embarrassing.

Let alone the little problem of multiple, up- and downwards incompatible specification (and implementation) versions I mentioned above.

I don’t know, maybe you see all my remarks just as some kind of heckling and nitpicking. Some of your answers read like it could be that way.

I see it as an attempt to (as long as there’s still time) lately, but not too late, bring some fundamental contents and qualities into a specification on a field that’s important to me, a specification that could have high relevance and influence, and I think that every specification of this kind should have these qualities and contents.

But if you think that basically, the spec is all right in the aspects I listed here, and it’s just some “tweaking”, which could be done some time later on, maybe after we have for example fixed the syntax for “embedded video and audio”—

please tell me, and spare me and yourself the time wasted in pointless discussions; because if that’s the kind of specification you and everyone around here would be satisfied with —

well, then I’m out of here.

Thanks for your time.

codinghorror · January 9, 2016, 10:55pm

I think we always appreciate constructive input, but forcing everyone to read thousands and thousands of words to get at those points does indeed become a kind of heckling, I am sorry to say, whether intentional or not.

It is problematic because you write thousands of words, over and over. Can you state your points more succinctly?

I also find it a bit ironic to complain that the spec is too long using… extremely long, verbose posts

tin-pot · January 9, 2016, 11:26pm

I’m sorry. To my defense: there are “thematic breaks” between the topics I discussed (horizontal rules). You can pick and read them more or less independently. But then, one can’t really discuss details of the spec without discussing details, I’d say.

Executive summary so far:

If the specification nails down exactly how each input is processed (and remember that every input is valid!), there is no room for upwards-compatible changes later.
The terms employed are (in part) confusing, ideosyncratic, undefined, non-standard, or incomplete.
Treatment of Unicode character properties (General Categories) is unbalanced and implementation-hostile.
The “scope” of the specification (that is: what is and what is not supposed to be in the spec’s reach) is not stated.
There’s no (abstract but precise enough) model of the input text, nor of the parsing result. “Any character sequence” is not good enough as an input model, “we mean AST, but show you HTML” is not good enough as an output model.
There are no implementation limits given (every implementation will have limits).
There is nothing said about “extensions” (like: these are extension points, this is left for extensions to specify). How do you tell apart (a) a “legitimate” extension from (b) just a “non-conforming” deviation?

The most pressing issues IMO are (1.), (4.), (5.) and (7.). Your mileage may vary.

jgm · January 10, 2016, 1:12am

These are all legitimate areas for improvement, I agree. Many of them are things I’ve wanted to improve myself. For example, I suggested long ago that it would be better to have a non-HTML representation of the parse tree in the spec tests. But there are practical difficulties. If we did this, we’d either have to require that all conforming implementations produce this non-HTML format (otherwise they couldn’t be tested), or provide a converter from HTML to this format (which would introduce new places for bugs in the test process). I’d still like to do this eventually, but it’s not a simple thing to change.

But it’s worth saying that “the perfect is the enemy of the good.” I’m just working on this in what doesn’t add up to a lot of free time. And I have no previous experience writing such a document. If this were my full time job, I’d aim for perfection. But if I aim for perfection given how things are, then we’ll simply never have a 1.0.

I wouldn’t mind that so much. I’m a tinkerer. I think we do need to ask what 1.0 would mean – since as you point out quite correctly, virtually any change would be a breaking change. But there’s something to be said for calling some imperfect version 1.0, and improving it later (in subsequent versions).

tin-pot · January 10, 2016, 1:27am

@jgm Thank you for your substantive and constructive answer!

These are all legitimate areas for improvement, I agree. Many of them are things I’ve wanted to improve myself.

Yes, it’s my impression that there are several (mostly minor) improvements that have been discussed long ago, but didn’t make it into spec. Case in point: the code point vs scalar value vs character discussion. And I certainly understand that you have other things to do as well (I did take a look at your Berkeley page )

If we did this, we’d either have to require that all conforming implementations produce this non-HTML format (otherwise they couldn’t be tested), or provide a converter from HTML to this format (which would introduce new places for bugs in the test process). I’d still like to do this eventually, but it’s not a simple thing to change.

It’s not so much the fact that HTML is used, but that the intention behind its use is barely explained. I take it that you see room for improvement here, too; maybe I’ll try to come up with some suggestions in this area.

And I have no previous experience writing such a document. If this were my full time job, I’d aim for perfection. But if I aim for perfection given how things are, then we’ll simply never have a 1.0.

As you probably have noticed by now, I’m kind of a stickler with standards and terms. If we could somehow “organize” (ie agree on some kind of plan and priorities), I’d be happy to help out.

[…] – since as you point out quite correctly, virtually any change would be a breaking change. But there’s something to be said for calling some imperfect version 1.0, and improving it later (in subsequent versions).

That’s exactly my main point of concern: the coarse ambition to “strongly define” every aspect of how input text is to be interpreted could well turn out to fire back and force to make breaking changes later. Improving a less-ambitious (but nevertheless precise and useful!) version 1.0 later would be much easier and smoother if it left some pieces deliberately and explicitly out. Though drawing this line and presenting it clearly is probably not that trivial.

This is closely related to my (non-rhetorical!) question how to tell apart (1) “legitimate” extensions from (2) “non-conforming” deviations.

Solving one of these problems would be a solution for the other, I reckon.

codinghorror · January 10, 2016, 1:31am

I personally feel the embedded test cases in the spec are more important than intensely examining the meanings of various English words and statements.

If the language clarity can be improved, great, but I favor adding lots of test cases to clarify correct behavior versus wording changes alone.

tin-pot · January 10, 2016, 1:43am

I personally feel the embedded test cases in the spec are more important than intensely examining the meanings of various English words and statements.

Not that I deem test cases unimportant, but we can agree to disagree on this one. I personally feel moving the specification further along to kind-of a test suite with embedded comments does not help to improve it. But obviously I agree on the “more test cases is better” part, if they don’t clutter the specification body itself.

But I see no exclusive alternative here; maybe we could dedicate (“assign”?) work items or so and achieve (some of) both goals?

jgm · January 10, 2016, 2:47am

The problem is that in Markdown every input is legal. There are no “syntax errors.” If you put a * character in the wrong place for emphasis, well, it’s just treated as regular text. Everything that does not have a structural meaning falls under the remaining “plain text” clause. So, any substantive change to the spec will “break” something. To be pedantic about it, there’s no such thing as “fully conforming to the CommonMark spec but introducing some extensions.”

The basic pragmatic guidelines for extensions is not to use something people that might have been used as regular text in a real document.

Now, what we could do is introduce a generic extension mechanism, similar to what reStructuredText has (look up “interpreted text roles”). There’s some talk on this forum about various ways that might be done. But it’s really hard to know how to do this in a way that is sufficiently flexible, and syntactically unobtrusive, enough.

tin-pot · January 10, 2016, 3:19am

I absolutely agree. The “every text is legal” property, if one needs a name for it, makes the description somewhat peculiar: if every sentence is in the language, is there a “syntax” after all? Is a formal grammar of any help? [ IMO: 1. No. 2. Yes. ]

So, any substantive change to the spec will “break” something. To be pedantic about it, there’s no such thing as “fully conforming to the CommonMark spec but introducing some extensions.”

Hmm. That’s true for CommonMark specification (or at least: the specification’s intent is that this is true): It specifies a total function from input texts to parsed results, right?

Now, a specification can of course as well specify a total relation from input texts to parsed results (giving, for each input text, a set of “allowed”, “acceptable”, or “correct” outputs).

And in this setting, there’s a perfectly well-defined, precise concept what it means to have a “compatible extension”, or what it means to “conform” to such a spec: that’s often called a “refinement”.

To make this more concrete, consider emphasis markup (the use of “*” and “_” in it):

The CommonMark specification does say, for each and every occurrence of strings of “*” or “_”, whether it will or will not be used to (or be “eligible” to) open or close an emphasized span of text.
You could certainly write a specification that does say something that amounts to:
1. These will be recognized as opening/closing a span (example: “␣*A” and “O*␣”);
2. Those will not be recognized (example: “␣*␣”);
3. And the rest may or may not be recognized.

I’d say that both specifications can be written as precise and unambiguous as one wishes (and is ready to toil over it), but the important difference is that the first does not leave room for variation (it specifies a function), but the second does leave room for such variation—in fact, it precisely specifies what exactly this room is: it is the set of “possible” or “allowed” results. (In our case, you could enumerate this set by “switching on and off” those separators where there’s choice.)

And an author could actually write text, based on this spec. If she wants to be sure, she’d write “\*” where the spec leaves something open. Or where the author just isn’t sure. And lo and behold—any conforming processor will process that as intended. (If the author knows the implementation, which hopefully documents what it does, she could omit the backslash, of course.)

By the way: that’s pretty much how I write Markdown in places like here: rely pedantically on only on the most basic Markdown rules and features, and when in doubt, insert blank lines, backslashes, re-flow line breaks etc. Works pretty predictable. (BabelMark is of course the place to go to test such rules …)

So no, neither

the fact that every input text is legal, nor
the requirement to have a precise and unambiguous specification

preclude

[…] such thing as “fully conforming to the CommonMark spec but introducing some extensions.”

That’s only precluded by the (arbitrary or reasonable) decision to specify a function.

Now, what we could do is introduce a generic extension mechanism, similar to what reStructuredText has (look up “interpreted text roles”). There’s some talk on this forum about various ways that might be done. But it’s really hard to know how to do this in a way that is sufficiently flexible, and syntactically unobtrusive, enough.

Now that’s the next step: other than the “just loosely defined” example above, a generic extension mechanism would be an intentional piece of syntax where the spec says: “the implementation can do with this whatever it wants” (maybe not that clumsy!). Like #pragma in C or Ada.

I’ll take a look at reStructuredText in a moment. Let me just point to a very obvious, very flexible, very standards-conforming, but: a bit ugly option that seems to be overlooked: the processing instruction.

It is already part of the syntax,
it is already recognized by the parser (not just of CommonMark),
it is already (and has been for decades) an “extension” loophole for application’s ad-hoc needs (it was designed to be one!).

You could, with practically no effort in the parser nor the specification, introduce something like

<?cmk ... ?>

and put whatever options, parameters, identifier definitions, and so on, in there for the parser to find them.

But that’s a different topic …

tin-pot · January 10, 2016, 3:42am

a generic extension mechanism, similar to what reStructuredText has (look up “interpreted text roles”).

Looked it up. Sure, I was just about to write, as a concrete example what could be done in the spec (but with out-of-thin-air, made-up syntax, so don’t take this a syntax proposal!): The spec could say:

Listen, don’t write '{{" in your inline text, or bad things could happen to your paragraph. We don’t tell you what things, just don’t write this: write “{\{” instead, if you need two curly braces.

And there’s your generic extension mechanism, ready to be filled with meaning (either in an “implementation-defined” way, or in a future version of the spec).

The reStructuredText thing seems similar, kind of, but reminds me much more of the discussions we had about “foreign syntax” used in code blocks and code spans. The “interpreted text role” looks an awful lot like an info string on a code span, if I see this right. For CommonMark, a quite similar thing could be done. One syntax I had in mind would look like

`role|interpreted text`

So for example, a LaTeX formula could be written as `$| a^2 + b^2 = c^2`. And Z Notation as `Z| f: 1..4 >--> %N`.

Why the delimiter (the specific choice of character is, well, a matter of taste) inside the code span? Because in this way it is dirt simple to add this “feature” to existing implementations: just filter out, say, <code> elements where the character content starts with the string “$|” or “Z|”. The parser thus needs no modification at all.

But that, again, is a different topic …

Thank you very much for the inspiring discussion!

jgm · January 10, 2016, 6:19pm

@tin-pot, you are right that a precise spec could, in principle, specify a partial function from inputs to structured documents. But this is precluded by the following bit of the spec:

6.11 Textual content

Any characters not given an interpretation by the above rules will be parsed as plain textual content.

If we wanted a partial function, we’d need something else here, right? And what?

I like your idea of using processing instructions for extensions. I’d never thought of that, but it really makes sense, especially because it would degrade nicely with legacy Markdown parsers.

I had thought about the idea of letting {{...}} delineate bits that would be passed through as “custom block” or “custom inline” elements (that could be further handled by filters). I don’t think {{ is going to occur in ordinary text, so from that point of view it seems safe. But there are some applications where this wouldn’t work so well – e.g., if you want to embed LaTeX math in the {{...}}, since LaTeX math might well contain a }}.

The processing instruction idea has a similar limitation – technically, you’d have to escape & , <, > inside, and that might be awkward for some applications. (Though maybe it isn’t important if XML rules are followed – I have been told that in PHP processing instructions, > can be used unescaped.)

tin-pot · January 10, 2016, 7:18pm

@jgm: [ There are some notes regarding partial functions below ]

But this is precluded by the following bit of the spec:

6.11 Textual content
Any characters not given an interpretation by the above rules will be parsed as plain textual content.

If we wanted a partial function, we’d need something else here, right? And what?

I fail to see how this rule 6.11 would preclude a “loose specification” of some (syntactic, if you want) aspect of the transformation. If the description in 6.4 of emphasis markup recognition—which is in “the reach of” clause 6.11—would be “loose” in the style I sketched above, and would leave open for some delimiters whether they are seen as markup for opening/closing a text span, or just seen as “literal” character data: in my view, this would constitute “giving an interpretation” for the characters in questions (namely, strings of asterisk or low line in this case).

So I really don’t think the summary clause 6.11 wouldn’t preclude such a “loose” version of 6.4, or maybe you have a much stronger concept of “interpretation” than I did use here?

I like your idea of using processing instructions for extensions. I’d never thought of that, but it really makes sense, especially because it would degrade nicely with legacy Markdown parsers.

It has the nice “side-effect” too that the same schema can be used to control the CommonMark parser itself, and/or the application “behind it”, if the parser passes along at least the unknown options/parameters. Seen from the front side, you couldn’t tell the difference.

I don’t think {{ is going to occur in ordinary text […]

Wait, wait! I made the delimiter string “{{” up in a moment’s impulse, I have no idea what conflicts it would introduce if actually put in place. Nested groups in LaTeX, and nested set extension expressions in Z Notation, and nested blocks in every “curly-braces-language” from C to ECMAScript are all places where this string occurs “in the wild”, I’m not sure that would be a good choice at all …

The processing instruction idea has a similar limitation – technically, you’d have to escape & , <, > inside, and that might be awkward for some applications. (Though maybe it isn’t important if XML rules are followed – I have been told hat in PHP processing instructions, > can be used unescaped.)

Uuuh. Here we go [ Fun quote from thirty years ago: Processing instructions are deprecated, as they reduce portability of the document. (ISO 8879:1986, Clause 8) ]

Technically (according to ISO 8879:1986, which introduced SGML and the whole angle-bracket markup thing), a parser is supposed to simply plow through the processing instruction’s text until a pic is found, which is the “processing instruction close” delimiter string. Which in turn is “>” in [the “reference concrete syntax” of] SGML, and “?>” in XML. “Traditional HTML” was based on SGML, so no difference there. So far, so good.

The XML 1.0 specification also says the same thing about terminating the PI: simply at the first occurence of a literal, two-character long “?>” pic. [ It also requires a PI to start with a Name, but let’s ignore that here. ]

So technically, you simply can’t use a literal pic delimiter string inside a processing instruction, that is, you can’t use

the single-character pic delimiter string “>” in SGML’s (and hence HTML’s) processing instruction,
the two-character pic delimiter string “?>” in XML’s (and hence XHTML’s) processing instruction (and XML declaration of course).

But what tools like PHP, or the zoo of real or pretending Markdown parsers require from or do with a processing instruction (which conforms to the above standards), that’s certainly a different thing.

But quite sure you simply can not use the pic string in the content, and there’s no way to “escape” it.

“>” can be used unescaped.

I have no idea what a “PHP processing instruction” is. (I have very little experience with PHP.) Maybe PHP “sees” these so-called “PHP processing instructions” first, before any real SGML/HTML/XML parser can take a look. And then removes the whole “PHP processing instruction” from the stream, again before any real parser would see it (and barf)?

Because, according to the relevant specifications (ISO respectively W3C), anything can be used unescaped in a processing instruction except the pic delimiter; and the pic in turn can’t be “escaped”. What other tools and parsers (not conforming to these ISO and W3C rules, or not even trying …) require or do: all bets are off.

Remarks on partial functions vs relations

[…] a precise spec could, in principle, specify a partial function from inputs to structured documents.

Not “in principle”, it obviously could. See, for example, any programming language specification of your choice (I happen to know ISO/IEC 9899 “C programming language” pretty well, and there it is called “undefined behavior”).

But a partial function was not what I was proposing; I wrote:

Now, a specification can of course as well specify a total relation from input texts to parsed results (giving, for each input text, a set of “allowed”, “acceptable”, or “correct” outputs).

Note the word “relation” here, and “set” of outputs (“set” as in “multiple” entities). The essential difference between a (total) function and a (total) relation is that

a function specifies, for each input, exactly one output,
a relation specifies, for each input, a set of outputs.

[ In a total relation, there’s at least one output for every input. ]

Only with the (total) relation concept can you “reasonably circumscribe” an allowed behavior for all possible inputs, and I think this is essential here (think “no syntax errors” vs the usual programming language syntax constraints).