Fun with specs: NO-BREAK SPACE = FORM FEED; DOLLAR SIGN /= POUND SIGN; SOFT HYPEN = LINE SEPARATOR

tin-pot · January 10, 2016, 7:18pm

@jgm: [ There are some notes regarding partial functions below ]

But this is precluded by the following bit of the spec:

6.11 Textual content
Any characters not given an interpretation by the above rules will be parsed as plain textual content.

If we wanted a partial function, we’d need something else here, right? And what?

I fail to see how this rule 6.11 would preclude a “loose specification” of some (syntactic, if you want) aspect of the transformation. If the description in 6.4 of emphasis markup recognition—which is in “the reach of” clause 6.11—would be “loose” in the style I sketched above, and would leave open for some delimiters whether they are seen as markup for opening/closing a text span, or just seen as “literal” character data: in my view, this would constitute “giving an interpretation” for the characters in questions (namely, strings of asterisk or low line in this case).

So I really don’t think the summary clause 6.11 wouldn’t preclude such a “loose” version of 6.4, or maybe you have a much stronger concept of “interpretation” than I did use here?

I like your idea of using processing instructions for extensions. I’d never thought of that, but it really makes sense, especially because it would degrade nicely with legacy Markdown parsers.

It has the nice “side-effect” too that the same schema can be used to control the CommonMark parser itself, and/or the application “behind it”, if the parser passes along at least the unknown options/parameters. Seen from the front side, you couldn’t tell the difference.

I don’t think {{ is going to occur in ordinary text […]

Wait, wait! I made the delimiter string “{{” up in a moment’s impulse, I have no idea what conflicts it would introduce if actually put in place. Nested groups in LaTeX, and nested set extension expressions in Z Notation, and nested blocks in every “curly-braces-language” from C to ECMAScript are all places where this string occurs “in the wild”, I’m not sure that would be a good choice at all …

The processing instruction idea has a similar limitation – technically, you’d have to escape & , <, > inside, and that might be awkward for some applications. (Though maybe it isn’t important if XML rules are followed – I have been told hat in PHP processing instructions, > can be used unescaped.)

Uuuh. Here we go [ Fun quote from thirty years ago: Processing instructions are deprecated, as they reduce portability of the document. (ISO 8879:1986, Clause 8) ]

Technically (according to ISO 8879:1986, which introduced SGML and the whole angle-bracket markup thing), a parser is supposed to simply plow through the processing instruction’s text until a pic is found, which is the “processing instruction close” delimiter string. Which in turn is “>” in [the “reference concrete syntax” of] SGML, and “?>” in XML. “Traditional HTML” was based on SGML, so no difference there. So far, so good.

The XML 1.0 specification also says the same thing about terminating the PI: simply at the first occurence of a literal, two-character long “?>” pic. [ It also requires a PI to start with a Name, but let’s ignore that here. ]

So technically, you simply can’t use a literal pic delimiter string inside a processing instruction, that is, you can’t use

the single-character pic delimiter string “>” in SGML’s (and hence HTML’s) processing instruction,
the two-character pic delimiter string “?>” in XML’s (and hence XHTML’s) processing instruction (and XML declaration of course).

But what tools like PHP, or the zoo of real or pretending Markdown parsers require from or do with a processing instruction (which conforms to the above standards), that’s certainly a different thing.

But quite sure you simply can not use the pic string in the content, and there’s no way to “escape” it.

“>” can be used unescaped.

I have no idea what a “PHP processing instruction” is. (I have very little experience with PHP.) Maybe PHP “sees” these so-called “PHP processing instructions” first, before any real SGML/HTML/XML parser can take a look. And then removes the whole “PHP processing instruction” from the stream, again before any real parser would see it (and barf)?

Because, according to the relevant specifications (ISO respectively W3C), anything can be used unescaped in a processing instruction except the pic delimiter; and the pic in turn can’t be “escaped”. What other tools and parsers (not conforming to these ISO and W3C rules, or not even trying …) require or do: all bets are off.

Remarks on partial functions vs relations

[…] a precise spec could, in principle, specify a partial function from inputs to structured documents.

Not “in principle”, it obviously could. See, for example, any programming language specification of your choice (I happen to know ISO/IEC 9899 “C programming language” pretty well, and there it is called “undefined behavior”).

But a partial function was not what I was proposing; I wrote:

Now, a specification can of course as well specify a total relation from input texts to parsed results (giving, for each input text, a set of “allowed”, “acceptable”, or “correct” outputs).

Note the word “relation” here, and “set” of outputs (“set” as in “multiple” entities). The essential difference between a (total) function and a (total) relation is that

a function specifies, for each input, exactly one output,
a relation specifies, for each input, a set of outputs.

[ In a total relation, there’s at least one output for every input. ]

Only with the (total) relation concept can you “reasonably circumscribe” an allowed behavior for all possible inputs, and I think this is essential here (think “no syntax errors” vs the usual programming language syntax constraints).