Is the spec too big?

tin-pot · January 7, 2016, 10:38pm

This:

Well, it’s not so much the word “precedence” (and “priority” isn’t better, either). But I do see much harm done in section 6.4 which is kind-of related:

came up here.

But the following is independent from that terminology question.

Just compare a print-out of section 6.4 of the CommonMark specification (which alone is listing 17 rules!) with, for example, a print-out of the whole HTML 5 Syntax specification by the W3C:

On the one hand: Text spans with emphasis. Or with strong emphasis.
On the other hand: The f***ing whole syntax of HTML5: Doctype declarations. Character encoding declarations. Elements. Attributes. Text and character data. Character references. Comments. SVG. MathML. CDATA sections in SVG and MathML.

The former is 10 pages long. The latter is 7 pages long.

Let me repeat that: Section 6.4 in the CommonMark specification is three pages longer than the whole HTML5 syntax specification by the W3C. It needs 17 rules, explained on 10 pages, for defining how “*” and “_” can be used to mark spans of text as “bold” or “italic”, basically.

Or compare the whole specification of CommonMark with the whole specification of, say, SGML.

The printed CommonMark specification would have about 74 pages (with a font size that yields more than 100 characters per line).

The first—informational, non-normative—Annex in ISO 8879:1986 starts on page 59.

So that means: The CommonMark specification is 15 pages longer than the specification of SGML. All of SGML. Look it up. That SGML, not some “Tiny SGML” dialect invented by the W3C.

By the way, 74 pages is about the size of clause 6 in ISO 9899:1990. What is defined in this clause of this standard? Well, the whole C programming language known as ANSI C. Except for some definitions, references, environmental limit stuff, the whole language definition of C90 is about as many printed pages as the CommonMark specification!

I find this worrisome. Not only because “precedence” is an unfortunate choice of word. I would have trouble explaining this to any outsider. Heck, I have trouble explaining this to myself!

But since no one seems to have noticed this before, and no one seems to care—maybe it’s just me …

codinghorror · January 7, 2016, 10:47pm

Isn’t this partially because the spec is also the conformance tests? I don’t think the HTML spec can be used to validate implementations at the command line as the CommonMark spec can.

jgm · January 7, 2016, 11:12pm

The spec is big, both because of the conformance tests, and because Markdown was not designed for computers to parse. HTML5 and C are much simpler languages with simpler syntaxes.

tin-pot · January 7, 2016, 11:45pm

Obviously the dozens over dozens of examples take a lot of space.

Let me just say this on the topic of examples.

Roles of examples

I’d say there are two—absolutely legitimate—roles for examples in a specification like ours:

To enlighten the reader about how the syntax rules are put into use, and
to provide test cases for implementers.

It is my impression that both categories of examples are currently mixed up in the specification, and in fact that there’s a third category which I find rather problematic:

3. Examples which fill in gaps, where the rules are unclear or incomplete.

I’d like to emphasize (sic!) that I have nothing against examples of kind (1.): having plenty of illustrative examples in the specification is a great idea, and the spec should keep them. But some examples, which show only “edge cases”, fall so squarely in category (2.) that they could maybe be relegated into an appendix, or into a test suite (which exists anyway, and which is another good idea).

It is my opinion however, that all the behavior that is shown in examples should be a consequence of the rules given in the specification, or in other words: that the specification would not change or loose substance and meaning if all examples were deleted. Thus that examples of kind (3.) are not a good idea, but symptom of a bad specification.

I don’t know if anyone has ever labored through the examples to check these categories—I’m not even sure whether it is intended that the examples only illustrate, but do not specify input syntax and parser behavior.

Specification bloat

Examples aside, I do think there are some areas in the specification which are bizarrely complex, and the 17 rules given in section 6.4 are a good example.

Compare this section 6.4 with the corresponding section “Emphasis” in Gruber’s description, which fits neatly into a single block quote, and even leaves room for me to drivel on:

Emphasis

Markdown treats asterisks (*) and underscores (_) as indicators of emphasis. Text wrapped with one * or _ will be wrapped with an HTML  tag; double *’s or _’s will be wrapped with an HTML  tag. E.g., this input:
*single asterisks*
_single underscores_
**double asterisks**
__double underscores__
will produce:
single asterisks
single underscores
double asterisks
double underscores
You can use whichever style you prefer; the lone restriction is that the same character must be used to open and close an emphasis span.

Emphasis can be used in the middle of a word:
un*frigging*believable
But if you surround an * or _ with spaces, it’ll be treated as a literal asterisk or underscore.

To produce a literal asterisk or underscore at a position where it would otherwise be used as an emphasis delimiter, you can backslash escape it:
\*this text is surrounded by literal asterisks\*

Yes, I know: this description leaves several questions unanswered. And I’m not praising Gruber’s text as an exemplary good specification, we all know it is obviously not. But I’m not convinced that it really does take 17 rules to describe the interaction between “*” and “_” in input text in a way that

is upwards-compatible with a (reasonable interpretation) of the cited description, and
is precise enough for CommonMark authors and CommonMark implementors to rely on.

[ Funny enough, some implementations manage to disregard even some of those syntax rules in Gruber’s description which are clear, unambiguous, easy to understand … ]

Questions about the specification’s ambitions

Regarding my use of the word “precise” above, it is intended that it

does mean “unambiguous”, but that it
does not mean “for every possible input there is one and only one output conforming to the specification”.

And I think it is important to be honest and clear about what the goals of the specification are, and what properties one wishes it would have. I’d say, for example, that “precise”, synonymous with “unambiguous”, is certainly one important property.

Now I’d like to pause (before this is post is again a veritable wall of text) with two remarks:

If the goal of the CommonMark specification effort is appropriately described by the latter meaning (second in the above list), then I’d say there are much bigger problems ahead (that’s another topic).
If you feel that “unambiguous” and “for every possible input there is one and only one conforming output” is the same after all, and any presumed difference between these two possible meanings of “precise” (or “strongly defined”, as it was called elsewhere) is quibbling over adjectives, we’d have to resolve this misunderstanding first.

tin-pot · January 8, 2016, 12:29am

Regarding the inclusion of conformance tests, see my comment below.

[…] because Markdown was not designed for computers to parse.

If I tell you that

I could probably explain to anyone who knows nothing about Markdown the basic rules for “*” and “_”, quoted in my post below (with the additional advise to, when in doubt, use “\”); but
that I have read the 17 rules in section 6.4, probably multiple times, but can’t remember them, let alone explain them to anyone (and would simply fall back on the “basic rules” mentioned),

then—I assume—your answer would be akin to: “well, you don’t have to know the exact rules, as long as your input is interpreted in the way you intended it (without knowing), only implementers have to care.”

If so, aren’t you basically denying that it’s important for the user to be able to understand the rules by which an author’s text is parsed and processed? That would be a rather surprising point of view, at least for me.

HTML5 and C are much simpler languages with simpler syntaxes.

Come on—that’s cheap: yes, C has (maybe) a “simpler” syntax. It is listed on 6 pages in annex B.2. Would you also maintain that C has a “simpler semantics” than CommonMark? Because that’s obviously what the language specification is concerned with, and that’s what I compared with the CommonMark specification: in both cases, the specification comprises syntax and semantics.

Then: I did compare section 6.4 alone with the whole HTML5 syntax specification. Are you saying that the syntax for marking spans of text as emphasized or strongly emphasized is more complex than HTML5, and that this is intentional, because HTML was “designed for computers to parse”?

The comparison with SGML’s definition is even more relevant: the CommonMark specification is about the size of the SGML specification, and in fact SGML has pretty much no semantics too, as it is also a markup language. And, in fact, SGML was also and explicitly not “designed for computers to parse” — that would be XML, which is more or less precisely SGML without the features provided for the human author (like markup minimization). But then, it was certainly the intention of the designers that the author would be able to understand the rules (for example, the content model rules were intentionally simplified, mostly with human understanding in mind).

So the argument that the syntax “wasn’t [primarily] designed for parsers” is not that convincing, unless there is at least some explanation of what was gained (for the human author) in return for what price (in increased complexity).

And if CommonMark is (syntactically) in fact more complex than HTML or C90, and if this is intentional: why does it have to? What advantages does this complexity achieve? Certainly none for implementers, but how is this complexity helpful for a CommonMark user (ie author)?

I’m not polemic here: I honestly would like to know the rationale for this design decision, that the syntax “was not designed for computers to parse”. That would mean it was designed only for CommonMark authors, right? How? Using what assumptions about these users? And about their expectations? And writing habits? And cultural influences? Why is it unimportant that the author can understand the rules? Because the rules achieve what the user wants, anyway?

But the heart of my question is not only the size of the specification text, but the goal of the specification itself. Case in point, and to stay with section 6.4, is the remark concerning Gruber’s (overly terse) description, right at the beginning of section 6.4:

This is enough for most users, but these rules leave much undecided, especially when it comes to nested emphasis.

Does that mean that section 6.4 is supposed to leave nothing undecided when it comes to inline “*” and “_”? I suppose it does mean it.

Generally, and I think you can answer that easily with yes or no:

Is it intended that according to the CommonMark specification (leaving “implementation-defined extensions” aside)

for each input text
there is one and only one output (that is: parsing result, AST, canonical-XML fragment, you name it)
which is conforming to the specification,

or in other words, a conforming parser can only produce one result for each input, and this result could be derived by interpreting the specification alone? (Technically: does or should the specification define a function from input texts to parsing results?)

Somehow this question never came up here, at least not that I know of, but I reckon it is a very fundamental question.

With an obvious influence on the specification’s size.

codinghorror · January 8, 2016, 1:14am

I worry that you are distracting @jgm from the goal, which is shipping CommonMark 1.0. Is this discussion necessary? Does it get us toward 1.0?

In the past I have been rather protective of @jgm’s time as he is crtitical to this project’s success.

I suggest we shelve this discussion for now.

tin-pot · January 8, 2016, 1:20am

We don’t need to discuss much about the merits and reasons or risks and harms of the size and style of the specification as it is right now.

But I’d still be very interested in a simple (yes or no) answer to my question above; and I don’t believe answering it would cost any more time than writing “yes” or “no”—after all, I assume, @jgm knows the answer already, because answering that question is one of the first to settle when embarking to write a specification, right?

So my question is again, the meaning of “strongly defined”, or what the specification’s goal is, after all:

Is it intended that according to the CommonMark specification (leaving “implementation-defined extensions” aside)

for each input text
there is one and only one output (that is: parsing result, AST, canonical-XML fragment, you name it)
which is conforming to the specification,

or in other words, a conforming parser can only produce one result for each input, and this result could be derived by interpreting the specification alone? (Technically: does or should the specification define a function from input texts to parsing results?)

codinghorror · January 8, 2016, 1:22am

I think you need to leave this alone for now, as it is a distraction from our immediate goals, and bring it up at towards the end of 2016 as needed.

From my perspective, the goal is to validate conforming output, given a standard set of input. So if you, Joe Programmer, decide to write a CommonMark compatible implementation, you can run these tests against Joe’s CommonMark for COBOL and see if it works – that is, given the exact same text input as all the other CommonMark parsers, does it produce the correct text output?

You can see a similar ad-hoc community effort here: GitHub - michelf/mdtest: Test suite for Markdown implementations

MDTest is a Markdown test suite derived from the older MarkdownTest by John Gruber. MDTest is primarily used for the developement of PHP Markdown but is strucutred in a way that can test and benchmark various implementations.

Perhaps direct your questions there?

tin-pot · January 8, 2016, 1:27am

From my perspective, the goal is to validate conforming output, given a standard set of input.

Well, certainly a specification should say (should allow to decide) whether that output, given this input, is allowed to be produced by a conforming implementation. And an implementation which always produces output that is “allowed” by the specification: that’s a conforming one.

If you can’t use the specification in this way, there’s no specification.

But that was not my question; my question is: are there “processor-defined”, or “processor-specified” behaviors? Or is the specification trying to “nail down” for each input text exactly one output that an implementation should produce?

[ Edit: Ups, we are cross-posting … ]

That’s certainly a useful test (suite), but a bunch of test cases with “correct” output is no specification either.

Oooh, wait: you write […]run these tests against Joe’s CommonMark for COBOL[…]! Would that mean

There’s the reference implementation.
An implementation is conforming to the specification iff, given the exact same text input, it does produce the correct text output (ie, the same output as the reference implementation).

Is that what you’re saying?

But a (reference or not) implementation is no specification either: in this case, the specification would simply be a long-winded explanation of the implementations behavior—and the latter is the arbiter of conformance.

So what is the specification’s purpose and goal?

codinghorror · January 8, 2016, 1:29am

Unless there are custom extensions in play, yes. Custom (or eventually, official…) extensions could redefine behaviors or add new ones.

tin-pot · January 8, 2016, 1:38am

You can see a similar ad-hoc community effort here: https://github.com/michelf/mdtest/

Ah, thanks for the hint, that looks interesting. (I’m writing a little Markdown specification on my own. Or rather: a set of guidelines, describing a syntax or style that maximizes portability, ie the chance to be processed “as intended” by as many implementations as possible. BabelMark and this one come in handy for things like that.)

Perhaps direct your questions there?

How are the people there supposed to know what the CommonMark specification (effort) is trying to achieve?

tin-pot · January 8, 2016, 1:45am

Okay, we leave out any “extension”.

So the envisioned way to ascertain conformity of Joe’s COBOL CommonMark implementation is to compare his output with the output of cmark, right? And if there are differences—that’s bad luck for Joe! (Or COBOL )

Leaving aside the possibility of the specification and the reference implementation contradicting each other, these are the rules:

The specification (text) is explanation, or annotation, to the actual reference implementation’s behavior, which decides what is or is not “conforming”;
Because this implementation will always (and repeatably) produce exactly one result for each particular given input, there is only one result a conforming implementation may produce for this input: the same result as the reference implementation produces.

Do I understand this correctly now?

jgm · January 8, 2016, 5:35am

I agree. The intent isn’t to lay down law with examples, but to illustrate law already laid down. If there are exceptions to this, point them out and we can fix them.

On emphasis: Gruber’s description is clear enough for most end users, but there are too many things it doesn’t settle. (And yes, the ambition is to specify a unique parse tree for every possible input.) If there is a clearer, more concise way to capture the CommonMark emphasis syntax than in the spec, feel free to suggest it.

But please keep your suggestions constructive and concise.

tin-pot · January 8, 2016, 8:36am

Good to know that we agree on this one; I’ll see if I can work my way through the examples with this perspective in mind.

And yes, the ambition is to specify a unique parse tree for every possible input.

Thanks for this answer—that’s what I wanted to know, but couldn’t find mentioned in the specification or elsewhere.

If there is a clearer, more concise way to capture the CommonMark emphasis syntax than in the spec, feel free to suggest it.

I’m not sure I can find a more concise description for the exact same behavior, but I’ll try. A more concise specification for which the given behavior (specification and/or reference implementation) is a refinement should certainly be possible.

But please keep your suggestions constructive and concise.

I honestly didn’t want to annoy you or steal your time, I’m sorry. But I still feel the question (which you answered above) was an important one—and I would hope you’d kind-of agree here.

tin-pot · January 8, 2016, 11:03am

Ok, here we go: this is my first take on it.

The main difference is that I use a model in terms of the “*” and “_” characters that make up the delimiters for emphasized spans.

Depending on how you count, I have between three and eight rules (less than a printed page).

The wording probably needs some polishing (preferably by a native speaker).

I think the description below is equivalent to the current section 6.4; or at least that it can be modified/corrected to be so.

To pick and choose (or simply copy the existing) examples could be done later.

6.4 Emphasis and strong emphasis

Delimiter strings

Strings consisting of adjacent “*” or “_” characters are recognized as left and/or right delimiters in these contexts:

A maximal string of “*” characters, followed by a graphic character, is recognized as a left delimiter.
A maximal string of “*” characters, preceded by graphic character, is recognized as a right delimiter.
A maximal string of “_” characters, followed, but not preceded, by a graphic character is recognized as a left delimiter.
A maximal string of “_” characters, preceded, but not followed, by a graphic character is recognized as a right delimiter.

NOTE - A string of “*” or “_” characters which is not adjacent to a graphic character is never recognized as a delimiter.

NOTE - Delimiter strings are not recognized in a word like foo_bar_baz, but are recognized in foo*bar*baz.

EXAMPLES:

TBD

Associated delimiter characters

When a right delimiter is recognized, as many as possible “*” (respective “_”) characters of it are associated with corresponding and still unassociated “*” (respective “_”) characters in preceding left delimiters, in right-to-left order.

NOTE - Each “*” and “_” character in a delimiter gets associated at most once.

NOTE - Characters in a right delimiter can be associated with characters from different left delimiters, and vice versa.

EXAMPLES:

TBD

An association between a character in a right delimiter and a character in a preceding left delimiter is not introduced if this would create overlapping emphasized spans.

Emphasized spans

A single “*” (respective “_”) character association between a left and a right delimiter introduces a  span.

A double “*” (respective “_”) character association between a left and right delimiter introduces a  span.

Crissov · January 8, 2016, 12:31pm

If I read your description correctly, _*emem*_ would be double-emphasized, but *_em_* wouldn’t, for instance, and *_*noem*_* would have emphasized underscores whereas _*_emem_*_ would result in double-emphasized inner underscores and text.

Does this behavior for mixed “delimiters” intentionally differ from the current specification and reference implementation where these examples result in double or triple emphasis?

tin-pot · January 8, 2016, 12:59pm

The explicit classification of (parts of) mixed delimiter strings as opening and/or closing is simply not there yet (as is the fine print about punctuation, alphanumerics and so on).

It is however intended that in “_*emem*_”, as well as in “*_em_*”, a sequence of (1) opening, (2) opening, (3) closing, (4) closing delimiter strings (each consisting of one character) is recognized.

Given that, the specification as written would introduce spans emem in both cases, which is consistent with the CommonMark specification and the reference implementation.

If you reach a different conclusion, I’d be interested in knowing how and why—because then the description is probably not clear enough.

[…] and *_*noem*_* would have emphasized underscores whereas _*_emem_*_ would result in double-emphasized inner underscores and text.

The same remarks apply here: if delimiter strings are recognized as O, O, O, C, C, C (in both cases), as they should, then associating the “*” with “*” and “_” with “_” characters would introduce three times a single  span in both cases, yielding noem as the result—which is again consistent with the reference implementation.

The easiest way to specify how to deal with “mixed delimiter strings” would be to state that strings of “*” rsp “_” are recognized as opening and/or closing independently of each other.

In your example “_*emem*_” that would mean:

recognition and role of “*” is determined by ignoring “_”, ie by “looking at” the string “*emem*”: this gives (1) opening and (2) closing for the “*” sub-strings.
and similarly for “_”.

But that’s a first idea which still needs some refinement.

Thanks for your comments!

[ Edit: Here is a slightly revised specification; where punctuation, white space, and alphanumeric characters are taken into account. At least for your four example strings, I reach the correct results using this description. I you don’t, can you please let me know how and why? ]

Given input “*_em_*”, we proceed as follows:

The first “*” is an opening delimiter: preceded by BOL, followed by a graphic character (the “_”). It is not closing, because it is not preceded by a graphic character, nor is it enclosed in alphanumerics.
The first “_” is also an opening delimiter: preceded by punctuation (“*”), and followed by a graphic character (the “e”). It is not closing, because it is not followed by white space, a line break, or punctuation.
The second “_” is a closing delimiter: preceded by a graphic character (“m”), followed by punctuation (“*”).
The second “*” is also a closing delimiter: preceded by a graphic character (“_”), followed by EOL.

So the first closing delimiter to encounter is the second “_” character: this gets associated with the first “_”, introducing an  span around “em”.

The next closing delimiter we find is the second “*” character. This gets associated with the first character. This introduces another  span, enclosing the previously introduced one.

The string ends, end the result is “em”.

6.4 Emphasis and strong emphasis

The “*” and “_” characters are used to delimit emphasized spans of text.

Delimiter strings

Maximal strings of adjacent “*” or “_” are recognized in the following contexts as delimiters, which can take on an opening and/or closing role:

A string of “*” or “_” characters is recognized as a opening (rsp closing) delimiter if
- it is preceded (rsp followed) by white space, a line break, or punctuation, and
- it is followed (rsp preceded) by a graphic character.
A string of “*” characters is also recognized as both a closing and opening delimiter if it is enclosed in alphanumeric characters.

NOTE - An string of “*” or “_” characters which is not adjacent to a graphic character is never recognized as a delimiter.

NOTE - The second context condition implies that “*” characters can be used to emphasize parts of an (alphanumeric) word, where “_” characters are not recognized as delimiters.

EXAMPLES:

TBD

Associated delimiter characters

When a closing delimiter is recognized, as many as possible “*” (respective “_”) characters of it are associated with corresponding and still unassociated “*” (respective “_”) characters in preceding
opening delimiters, in right-to-left order.

NOTE - Each “*” and “_” character in a delimiter gets associated at most once.

NOTE - Characters in a closing delimiter can be associated with characters from different opening delimiters, and vice versa.

EXAMPLES:

TBD

An association between a character in a closing delimiter and a character in a preceding opening delimiter is not introduced if this would create overlapping emphasized spans.

Emphasized spans

A single “*” (respective “_”) character association between an opening and closing delimiter introduces a span.

A double “*” (respective “_”) character association between an opening and closing delimiter introduce a span.

Crissov · January 8, 2016, 10:30pm

Are * and _ “graphic characters”? Are they “punctuation”?

Is **** in ****test**** a “maximal string of adjacent *” and hence a single delimiter (one opening, one closing)?

tin-pot · January 9, 2016, 7:39am

Yes. Both (according to the CommonMark definition of punctuation character).

The usual definition (ISO 2382-4:1999) is:

graphic character
A character, other than a control character, that has a visual representation and is normally produced by writing, printing, or displaying on a screen.

Whether or not SPACE is seen as a graphic character should be stated explicitly, for example,

in ISO 646 (“ASCII”), graphic characters are U+0021 … U+001E, and SPACE and DEL are treated separately,
while in ISO 8859-1 (“Latin 1”) the SPACE character is a graphic character the visual representation of which consists of the absence of a graphic symbol.
This is also the classification in ISO 2382 (“Vocabulary”).
Then again, in the ISO 9899 (“C programming language”) <ctype.h> functions, there’s a character class printing character (isprint()) which comprises SPACE and graphic characters (isgraph()), while the isspace() class crosses the usual hierarchy and contains the “standard white space characters”: both the non-control character SPACE and control characters like HT, while the isblank() class excludes control characters but only contains SPACE (in ISO 646), and corresponds to the term blank character in ISO 2382-4, which is in turn not included in ISO 10646 …

All in all: It’s messy. The classification used in ISO 9899 (and ISO 646), namely that:

graphic characters exclude SPACE,
printing character include SPACE and graphic characters,
SPACE is a white space character together with HT, CR, LF

seems to reflect to the CommonMark usage best and is the one I used (implicitly) here.

The classification of graphic characters is the usual one, except that the CommonMark specification uses punctuation character for pretty much anything which is not alphanumeric; this is very similar to the standard term “special character” (ISO 2382-4:1999):

special character
A graphic character that is neither a letter, digit nor blank, and usually not an ideogram.
Examples: A punctuation mark, a percent sign, a mathematical symbol.

Here is the classification of “everything but control characters”:

Printing character
- SPACE
- Graphic character
  - Special character
    - Punctuation character
  - Alphanumeric character
    - Letter
      - Upper case letter
      - Lower case letter
    - Digit

Is “****” in “****test****” a “maximal string of adjacent “*” ” and hence a single delimiter (one opening, one closing)?

Yes. That’s why there are two delimiter strings, one opening, one closing.

The “****” occurs twice as a substring in “****test****”. Both occurrences (one at the start of the string, one after the letter t) are maximal, in the sense that they are not part of any “greater” substring in “****test****” which also consists of “adjacent * characters”. That is, they are “maximal” in the partial order on strings given by the relation (_ is-substring-of _), hence the term.

That should probably defined explicitly, you’re right. And maybe saying “maximal occurrence” would be more accurate. But rather clumsy.