Delimiter run definitions need clarification

1. The core definition of delimiter run reads:

A delimiter run is either a sequence of one or more * characters that is not preceded or followed by a * character, or a sequence of one or more _ characters that is not preceded or followed by a _ character.

As written, this defines all asterisks and underscores, even singletons, as delimiter runs, but this cannot be. Markdown has never prohibited literal asterisks or underscores, or treated them as prima facie delimiters. Gruber:

But if you surround an * or _ with spaces, it’ll be treated as a literal asterisk or underscore.

The concept of delimiters wrapping text, as Gruber puts it, needs to surface in the definition. Delimiters are by their nature paired entities, else nothing is delimited, so defining a delimiter as a standalone sequence is probably not ideal here.

2. A left-flanking delimiter run is defined as:

A left-flanking delimiter run is a delimiter run that is (a) not followed by Unicode whitespace, and (b) either not followed by a punctuation character, or preceded by Unicode whitespace or a punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

This is hard to follow. I know this much:

  • It must not be followed by Unicode whitespace

  • And it must satisfy at least one other condition. Either:

    • It must not be followed by a punctuation character
    • Or, it must or must not be preceded by Unicode whitespace — it’s unclear if the negation from punctuation character clause is carried forward to this clause
    • Or, it must or must not be preceded by a punctuation character — again, whether the negation is still in force is hazy

It’s also unclear if the condition “preceded by Unicode whitespace or a punctuation character” imposes dual requirements or just one, but that’s due to the possible negation.

From the examples, I assume that the negation only refers to the first clause (“followed by a punctuation…”), and that the intent is to have whitespace before the left-flanking run, not to prohibit it, though I’m not sure why we would care. However, I’m stumped by what the intent is with the preceding punctuation clause.

In any case, the logic should be decomposed so that the negation is clearly bounded to its referent and each condition isolated as appropriate. An easy fix is to put the negation clause at the end of the sentence, instead of the beginning, but a bulleted structure might be best here. Note that having too many restrictions on where we can apply bold or italics might interfere with certain pedagogical use cases where we want to use pronunciation respelling, as is common in dictionaries and teaching materials, or similar intra-string mechanics.

The notion of a “delimiter run” is a technical notion, used to define what we really care about, which is emphasis and strong emphasis. It is harmless if a standalone string of *** with spaces on each side counts as as delimiter run, since it will neither be left-flanking nor right-flanking, so no emphasis will be triggered.

One could define “delimiter run” differently, so that only left-flanking and right-flanking strings of delimiters count as delimiter runs. But this would actually make the definitions more complicated, and because “delimiter run” is a technical notion, we can define it in the way that is most convenient for the overall definition of emphasis.

The intent here is:

(NOT followed by unicode whitespace) AND (EITHER (NOT followed by punctuation) OR (preceded by whitespace or punctuation)

I thought that was the natural reading (because with the “either” after (b), you can’t really get the negation to scope over both disjuncts. But I’m open to suggestions about how it might be made less confusing.

The basic idea is that a delimiter run becomes left-flanking when the character to its right ranks higher that the character to its left on the following scale:

non-space & non-punctuation > punctuation > space

The logical definition just spells this out. I think the “ranking” idea is more intuitive and had this in a previous version of the spec, if I recall. But some implementers complained that it would be easier just to give the simple Boolean condition. Maybe we should rethink that.

A lot of thought and experimentation has gone into this part of the spec. Simpler ideas yield too many intuitively wrong cases. Note that * emphasis can be used intra-word, but _ cannot. So *em*phasis, not em*phas*is works as you’d expect.

1 Like

I thought that was the natural reading (because with the “either” after (b), you can’t really get the negation to scope over both disjuncts. But I’m open to suggestions about how it might be made less confusing.

Note that in the latest spec, version 0.28, the word either has been deleted from those clauses, so now it’s maximally confusing. For example:

A left-flanking delimiter run is a delimiter run that is (a) not followed by Unicode whitespace, and (b) not followed by a punctuation character, or preceded by Unicode whitespace or a punctuation character.

It’s not clear if the not starting clause b carries through to the two or clauses. For clarity, I suggest using bullets, and with no ambiguity with respect to whether something shall or shall not be the case.

Note also that the core definition of delimiters (before we get to left- or right-flanking varieties) is logically incoherent. Drawing from the asterisks portion alone, we have:

A delimiter run is either a sequence of one or more * characters that is not preceded or followed by a non-backslash-escaped * character…

Which logically reduces to “a delimiter run is a sequence of one or more asterisks that is not preceded or followed by any asterisks”

This makes any sequence of asterisks a delimiter run, since we have no way to identify the first two asterisks in ********** as a delimiter run, since they are followed by more asterisks.

I’m not sure why we’d define delimiter runs this way. One cost is that we’ve now ruled out ASCII art, which still appears in contemporary readme and release notes documents, many of which might be written in Markdown now and in the future.

Note that in the latest spec, version 0.28, the word either has been deleted from those clauses, so now it’s maximally confusing.

It was deleted because somebody thought that, with the “either”, it would be interpreted as exclusive rather than inclusive disjunction. We could use bullets, as you suggest, if you think that would be clearer. Go ahead and put up an issue on the tracker, or better a PR.

A delimiter run is either a sequence of one or more * characters that is not preceded or followed by a non-backslash-escaped * character…

Which logically reduces to “a delimiter run is a sequence of one or more asterisks that is not preceded or followed by any asterisks”

No, it doesn’t. In \*** you have a sequence of three asterisks that is not preceded or followed by any asterisks, but a delimiter run of only two asterisks, since the first is backslash-escaped.

But yes, ignoring backslash-escaping, any sequence of asterisks is a delimiter run. This doesn’t rule out ASCII art. Normally you’d put ASCII art in a code block anyway, to prevent interpretation as Markdown and to get a monospace font.

Why not just cap the number of asterisks that will be treated as delimiters? As of now, infinite asterisks are allowed, but why would we want to allow them? It needlessly complicates the spec. Asterisks beyond the first two in a string could be automatically escaped and treated as literals. Or is this an issue you’d rather deal with in your Beyond Markdown brainstorm?

1 Like

I made pull request 534 to add words and clarify the definitions (additions in bold):

A left-flanking delimiter run is a delimiter run that is (a) not followed by Unicode whitespace, and either (b1) not followed by a punctuation character, or (b2) followed by a punctuation character and preceded by Unicode whitespace or a punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

A right-flanking delimiter run is a delimiter run that is (a) not preceded by Unicode whitespace, and either (b1) not preceded by a punctuation character, or (b2) preceded by a punctuation character and followed by Unicode whitespace or a punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

1 Like