Ambiguities in definition of blank lines

Four clauses quoted from Section 2.1

  1. A character is a unicode code point. This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding.

  2. A line is a sequence of zero or more characters followed by a line ending or by the end of file.

  3. A line ending is a newline (U+000A), carriage return (U+000D), or carriage return + newline.

  4. A line containing no characters, or a line containing only spaces (U+0020) or tabs (U+0009), is called a blank line.

JD: Point 4 contradicts Point 2. If a line must include a line ending by definition, as Point 2 implies, a line cannot contain no characters. It contains at least one character – an end of line or end of file character.

It’s possible that “character” is being used differently in the opening clause of Point 4, to mean a visible text character, excluding control characters or whitespace. However, this would contradict Point 1, which defines a character as a Unicode code point. A control or whitespace character is a Unicode code point.

Suggested edit:

Clarify Point 1 by making it explicit that a character can be displayable text, whitespace, or control characters. e.g. “A character is a Unicode code point, whether text, whitespace, or control characters – any Unicode code point is a character.”

Change Point 4 to “A line that contains no characters other than a line ending is a blank line. A line that contains no characters other than spaces (U+0020) and/or horizontal tabs (U+0009) before a line ending is also a blank line. Note that a blank line must have a line ending, else it is not a line.”

Notes on the suggested edits:

  1. Note that I capitalized Unicode, which is the proper form.
  2. Note that I specified a horizontal tab in Point 4, instead of just a tab. You do have the codepoint there, so it shouldn’t confuse anyone, but since here are other tabs, like vertical tab (U+000B), I like to spell it out.
  3. Note I inserted “and/or” in Point 4 in place of “or”. I assume you mean and/or there – that is, I assume that a line that contained both spaces and horizontal tabs would qualify as a blank line.

I’ve made some of these changes. But there’s a substantive issue. When I originally wrote the spec, I was not thinking of the line endings as parts of the lines. (Rather, they are line separators.) And I certainly don’t want to requrie that a blank line have a line ending – if you have two tabs followed by EOF, that’s a blank line.

Maybe this should be revisited, and it certainly should be clarified. But if the line endings are going to be considered part of the line, there are probably many parts of the spec that would need revision.

1 Like