Mixed-marker bullet lists

Crissov · January 2, 2016, 11:16am

Although most current markdown implementations allow arbitrary mixes of the supported list markers -, * and + to generate a single uniform list in the output, the Commonmark specification now requires that each change in marker starts a new list:

A list is a sequence of one or more list items of the same type. (…)

Two list items are of the same type if they begin with a list marker of the same type.
Two list markers are of the same type if
(a) they are bullet list markers using the same character (-, +, or *) or
(b) they are ordered list numbers with the same delimiter (either . or )).

(Please note that some implementations do weird things: Cebe treats a change in list item marker as a new nested list and Gambas uses plus + for ordered lists.)

I believe there are some valid, existing use cases of mixed-marker lists. Supporting these would require a slight change in philosophy: Lists that mix the bullet types are considered to associate special meaning with them.

If all three markers are being used, the list can be treated like a change log or diff.
Empty items are ignored except that they take part in determining the kind of list.
Plus + designates a new feature, minus - a removed or deprecated one and asterisk * is for general informational changes. It’s a matter of philosophy whether bug fixes belong into the - or * category.
These conventions are followed more or less by many plain-text change logs already, even if they do not use markdown explicitly.

+ New feature, addition
- Deprecated feature, removal, (bug fix)
* Other change, informational, (bug fix)

If only plus + and hyphen-minus - are mixed, the list is treated as a comparison of pros and cons.
As you can see below, the code highlighter used here recognizes that convention (because it auto-guesses the code to be a diff, not markdown):

+ Pro
- Con

If hyphen - is mixed with asterisk *, this makes an agenda, i.e. a to-do, shopping or task list.
The hyphen - introduces unfinished entries (“open”), the asterisk * ones that already have been ticked off (“closed”).
Most existing flavors that support something like this use different syntax with square brackets [+] or parentheses ( ) to mock checkboxes and radio buttons.

* breakfast
- lunch
- supper

If we anticipated an extension to support such advanced lists, would it be better to keep the current phrasing of the spec to make separate lists or would the universal existing practice of making it a single list be better suited?

tin-pot · January 2, 2016, 4:15pm

I do agree that the CommonMark rule, that is “splitting” the list whenever the marker character used in the input text varies, seems a bit arbitrary—and non-portable.

If two adjacent, unordered (or “bullet”) lists are really what is wanted, maybe an intervening comment declaration could be used to mark up the splitting point.

However, I’d even prefer to require an intervening paragraph, since adjacent lists feel somewhat “degenerate”, and I think they aren’t even possible in some document formats. (The “splitting” paragraph could be empty in any case, making the kludge character of the adjacent lists more visible, while increasing the chance that this can be mapped into a wider variety of target document formats.)

The CommonMark text for two such lists would then look like this:

* breakfast
- lunch
- supper

<p></p>

+ spring
- summer
* autumn

Alternatively, use a comment declaration like  in place of the empty <p> element, or maybe even a <br> would do.

Using comment declarations in CommonMark text for purposes like this would be much nicer if an empty comment <!> were available¹⁾, but that’s not covered by the current CommonMark rules …

______

The syntax “<!>” is defined in SGML, but not in XML or HTML; which does not matter or is even fortunate, because <!> would be “used up” by the CommonMark processor and never occur in the parsed result.

Furthermore, to convey the “meaning” of using different marker characters in your proposal, the CommonMark <item> element (into which every list item is parsed) would need some attribute to hold the marker string, like this (currently <item> has no attributes at all):

<!ATTLIST item
          marker CDATA #IMPLIED
>

So the “+”, “-”, “*” characters in your example would be represented in this marker attribute, ready for an application to base decisions on it.

Crissov · January 2, 2016, 4:39pm

Two consecutive blank lines would split two adjacent lists already, so no empty comment or paragraph needed.

Dmitry · January 2, 2016, 5:06pm

Cancelling the two blanks rule has been suggested (on the grounds of different markers breaking a list, no less), and a “list break” syntax has been proposed, which would be consistent with hard line breaks:

Two lists:
1. item
2. item
\
1. item
2. item

(although those are ordered lists, the particular list type is unimportant).

CommonMark.NET already holds these (as well as other) data as part of the syntax tree and its (non-XML ATM) AST format.

tin-pot · January 2, 2016, 5:20pm

One more reason to ditch the “changing marker character implies splitting the list” rule in CommonMark, I would say.

(I’m not too fond of the “two blank lines split the list” rule either, but so be it.)

tin-pot · January 2, 2016, 5:23pm

Good to know.

Do you know the “attribute name” (or whatever it is called in “non-XML ATM”) for the corresponding data element? Is it “marker” already? Does it hold a single character, or a Unicode string, or something else?

So that if this gets added to the CommonMark <item> element, it can at least be done consistently.

Dmitry · January 2, 2016, 6:05pm

This is currently part of the list elment, not item element, so I see no way of it being forwards-compatible with “mixed-marker bullet lists”.

The format is text (although I suppose it could be formalized as SGML), and a “bullet list” node has the form

list (type=bullet tight={0} bullet_char={1})

where

{0} is replaced with either True or False and
{1} is replaced with a code point.

tin-pot · January 2, 2016, 6:16pm

Thank you for the information!

This looks indeed quite like the current definition of the CommonMark <list> element:

<!ELEMENT list (item)+>
<!ATTLIST list
      type (bullet|ordered) #REQUIRED
      start CDATA #IMPLIED
      tight (true|false) #REQUIRED
      delimiter (period|paren) #IMPLIED>

Regarding “mixed-marker bullet lists”, the only kind-of “forwards-compatible” way to have “mixed” markers transmitted I can come up with is again using an attribute per item, eg in the CommonMark <item> element, like my marker attribute above.

This could be declared #IMPLIED, and only explicitly output by the CommonMark processor if the marker character (or string) and hence the marker attribute value differs from the one in the first element in the list (which is, I suppose, what ends up being the bullet_char value?), in other words: if it differs from the value of bullet_char (rsp delimiter attribute) in the <list> element.

Dmitry · January 2, 2016, 7:01pm

delimiter isn’t quite the same, since it can only be period or paren (so adding support for e.g. (1) lists would require adding a new value).

start and delimiter are both #IMPLIED since start is present iff type == ordered and delimiter is present iff type == bullet. It would be much better IMO to have two separate list elements (or three, if definition lists are to be accounted for).

Changing bullet_char to marker might be a good idea. A similar attribute could be present in a definition list, where : or ~ would serve as definition markers.

BTW should ~ definition2 after : definiton1 start a new list?

tin-pot · January 2, 2016, 7:50pm

Yes and no: the delimiter (period | paren) #IMPLIED declaration makes (more or less) sense in the current DTD, as it is only used in “ordered” lists, and these can only use these two alternatives (at least in the current CommonMark syntax).

It would be much better IMO to have two separate list elements (or three, if definition lists are to be accounted for).

I’m not sure about that (we do still talk about the elements defined in the CommonMark DTD, do we?). The “one <list>, various attributes” approach is in fact somewhat more flexible and easier to extend (putting the “marker string” into an marker attribute is an example of this!).

This “dichotomy” between elements and attributes is very common design problem, and in the exact context of these two approaches to define list elements (ie multiple element types vs discerning attributes) this is discussed in some depth in ISO/IEC/TR 9573:1988 (see how old these questions are!).

Let me—assuming this is “fair use”—quote from the pertaining section [the “first approach” here means the one with multiple element types for lists, the second being the use of attributes]:

4.2.3.3 Discussion of the two approaches to lists

The first design approach has the advantage of fewer keystrokes, and may be appropriate in a very well-defined or bounded application. The second approach is more suited to general, or evolving, applications:
Treating the attributes of the lists as true SGML attributes gives flexibility. For example, if one were to provide the option of a glossary in which the items were numbered, then the markup:
 <list gloss ord>
[ this would be <list form="gloss" seq="ord"> in XML – tin-pot ] follows naturally from that already described. With the first approach one would need to define a new element, or add an “ordered” attribute to the GL element (which would be inconsistent with the use of “ordered” as part of the GI of the OL element).

Similarly, a “flowed” list (in which the list is flowed as part of a sentence or paragraph) can be supported by defining flow as an alternative to display for the form= attribute.
If a document is being edited without the aid of an SGML syntax-directed editing system […]
On the other hand this approach implies that in implementing an SGML syntax-directed editing system it is not sufficient to use the DTD to “prompt” and verify valid sub-elements, since certain combinations of attributes imply that only a subset of the sub-elements permitted by the DTD are actually valid. In a <list unord> element only the <li> sub-elements are valid whereas the DTD also permits <lt> [ for “list term” – tin-pot ] sub-elements to occur.

Without going into much detail here, I think it is obvious that CommonMark falls squarely into the more suited to general, or evolving, applications camp, and saving keystrokes is pretty irrelevant in parser-generated XML output …

The character (or string?) in the “bullet” role in an (item of an) “unordered” list would go into two places:

In an attribute of the <list> element (which—you’re right—should probably not be a redefined delimiter, not the least because a “+” marker is not a “delimiter”);
In an attribute of each <item> in the list.

It seems to me quite reasonable to use the same name, and the same declared value, for this attribute, like so:

<!ATTLIST (list | item) marker CDATA #IMPLIED >

So a “missing” marker attribute value specification in an <item> element could be “inherited” from the value given in the enclosing <list> element; and vice versa if the <item> element does have a value specified for marker, this would “override” the value given in the enclosing <list>.

Definition list? Did I miss something and CommonMark suddenly has definition lists—would be great!

Dmitry · January 2, 2016, 9:04pm

I’d suggest

<!ATTLIST (unordered_list | item) marker CDATA #REQUIRED>

Since the processing agent would be required to inspect the list marker anyway, marker should be required for unordered lists and for list items of both types (numerals normally serve as markers in ordered lists).

I believe OOD wan’t very popular back in 1988. In 2016 an abstract list type along with ordered_list and unordered_list specializations seems more fitting IMO.

CommonMark doesn’t, but hopefully CommonMark.NET will soon. Had it had a DTD, that would need to be extended to

<!ATTLIST (unordered_list | item | definition) marker CDATA #REQUIRED>

tin-pot · January 2, 2016, 9:21pm

Well, there is no unordered_list in CommonMark, at least right now. And as tried to convince you, there probably should not, but instead only one <list> element type.

You mean providing in the marker attribute of the <list> and/or <item> element

either the “bullet” character, ie “-”, “+”, “*”, etc (in “unordered” <list> and <item> elements);
or the “number style” or “picture” or “format” (resembling old HTML use), ie “1.”, “a.”, “1)”, “a)” in <list>, and
the same, but with 1 or a replaced by the appropriate numeral in the value specified for marker in <item> attributes.

That’s a whole lot of changes here …

I believe OOD wasn’t very popular back in 1988.

I guess you weren’t around back then: it was the new and hot topic in programming languages, AFAIR.

In 2016 an abstract list type along with ordered_list and unordered_list specializations seems more fitting IMO.

Whatever jargon floats your boat (in “CommonMark.NET”)—but XML it ain’t. And certainly the current year is the last thing a specification is supposed to fit to, IMO.

[ And you do know that OO models with classes can well and cleanly be defined extensional rather than nominal, do you? That was even the case back then in the 80ies! ]

Dmitry · January 2, 2016, 9:52pm

I don’t own CommonMark.NET; I’m a mere contributor.
CommonMark.NET is not object-oriented in its current state (although I am working on changing that).
The first OOP language was Simula 67. Guess what 67 stood for.
“new and hot topic in programming languages” does not directly translate into specifications (quite the opposite, unfortunately).
This is an ordered list, which is a specialization of list, which is a specialization of block, which is a specialization of element.

tin-pot · January 2, 2016, 10:03pm

@Dmitry: Oh, come on … Is this really neccessary?

1. I don’t own CommonMark.NET; I’m a mere contributor.

Yes, fine. I didn’t state or imply otherwise, as far as I can tell.

2. CommonMark.NET is not object-oriented in its current state (although I am working on changing that).

Okay, good for you (and probably CommonMark.NET!)

3. The first OOP language was Simula 67. Guess what 67 stood for.

Wow, I’d never known about Simula without your help, thank you so much! [Not, obviously.] And this has what to do with “mixed-marker bullet lists” again?

4. “new and hot topic in programming languages” does not directly translate into specifications (quite the opposite, unfortunately).

I agree, and I’d even say that nothing translates directly into specifications, for otherwise no specification would be needed. [ Kidding aside, I’m not sure what you mean by this assertion anyway. ]

5. This is an ordered list, which is a specialization of list, which is a specialization of block, which is a specialization of element.

Ehem: there is no block which is a specialization of element, at least not in CommonMark (not in the specification, and not in the DTD). There is no ordered list element either, but I think I get what you mean.

If you’d now be so kind and define what you mean by “specialization” (it must be a relation between element types from your use of the word, but which one?) in the given context, namely elements in a document content model? Liskov-substitutable maybe? A projection/embedding pair?

Dmitry · January 2, 2016, 10:26pm

No. I wish there weren’t a post length minimum.

tin-pot · January 2, 2016, 10:37pm

When I first answered to your remark, it didn’t occur to me that you were probably trying to make fun of me quoting from a TR published in (gasp!) 1988, discussing a question concerning document model design in the context of (horrors!) SGML. If that’s the case: no offense taken, but let me be clear about my position in this matter:

When it comes to document description languages—and all of the discussion topics around here fall under this umbrella, or do you think otherwise?—when it comes to topics in this area there are some people, with names like Charles Goldfarb, James Clark, Norman Walsh, Tim Bray (to name a few), whose texts and comments I tend to take rather seriously; for the simple reason that they have much more experience and insight into this field than I have or I’m pretty sure any of us here has.

Now comes the twist: some of these people were actually alive “back then”, and had in the year 1988 already several years, if not decades, of experience with precisely the design decisions discussed in TR 9573 [ I do not know, but I would bet that Goldfarb had his hand in authoring it, btw ].

I don’t know about your age, or background and experience in the subject at hand. But let me just say that a comment like yours above does not reflect favorably on any of those; it rather makes you look as if you just barely learned a fashionable concept and now go around throwing it at every problem at hand. Which is not something to be ashamed of (we all did it some time or the other), but it is not a license or valid reason to feel automatically above all knowledge and insight accumulated by others.

Back to somewhat more concrete topics: throwing “.NET” and “OOD” jargon around is all nice and dandy, but the goal here is a specification (as you seem to realize in one or your enumerated remarks) on how some textual input is transformed into some “structured” output. The latter concept is, to my regret, only defined in a rather fuzzy way (as “the AST”, being more or less equivalent to a type-valid XML document instance conforming to the CommonMark DTD, as far as I can tell).

So if you have anything to contribute in this area, maybe a formal, simple, abstract, object-oriented meta-model for the element structure of said output, go ahead: I’d very much welcome it.

But don’t forget to specify the mapping into XML (or rather: XML Infoset).