I do not think that U+2022 should be added to the list of special characters for bulleted list interpretations.
U+2022 is a Unicode character. Markdown, however, is agnostic to character set: neither Gruber’s original specification nor CommonMark say anything about which character set Markdown content is to be authored in. Control characters are all drawn from the POSIX Portable Character Set, i.e., US-ASCII (without assuming particular scalar values for particular characters). Gruber’s Perl script also does not assume particular encodings–what you put in is what you get out. It should stay that way.
If I copy and paste bullet list from Microsoft Word into my plain text editor, I also get the Unicode bullets.
So, to summarize. The U+2022 Unicode bullet character:
is standardized by Unicode;
is used by some users to create a correct plain text list;
is easily typed by OS X users;
is pasted when a list is copied from Word (and possibly other text editors and browsers).
Advantages of adding the U+2022 Unicode bullet character support to Markdown:
Markdown no longer messes up seemingly correct lists that use this character.
Markdown has to add support for a non-POSIX Portable Character;
therefore implementations have to recognize Unicode;
support is currently not present in other variants.
In my opinion the disadvantages (that I could come up with) don’t outweigh the advantages. Adding a standardized character from Unicode (the most comprehensive, most standardized, most widely used character ‘set’) is not a downside to me. Users seeing their list getting messed up when copied from Word or typed with the standardized bullet character is a much bigger issue.
So, I support adding the U+2022 Unicode bullet character to CommonMark.
Those may not be the only ones. Implementations would need to:
know (or infer, but that may not be acceptable) the encoding of the input text, whereas it can currently just deal with ASCII chars and ignore others with a lot of very common encodings (e.g. UTF-8, ISO-1559-*…)
As for the rendering of a list bulleted with • , you can add two spaces at the end of the lines so that it would still get a line break. While I get that it is not what you want, you can get your “output ≥ input” with only a minor inconvenience.
Wouldn’t implementations need to understand the encoding to be able to parse a document at all? For example, UTF-16 is completely different from UTF-8. Even with single-byte character encodings I’m quite sure byte values below 128 (thus US-ASCII) are also used as part of byte sequences that are valid code points in other encodings. So, ignoring all encoding and treating all text as US-ASCII wouldn’t work.
As implementations have to determine which encoding they’re reading, you might as well expect them to support UTF-8 (of all possible encodings), and therefore Unicode.
I would say, a parser, which chooses not to support unicode, can simply ignore the unicode characters.
I also think it is not wise for a parser not to support unicode. All modern operating systems, libraries etc. support it. I can’t imagine why someone would want to write software that doesn’t support unicode.
Not supporting unicode means excluding all those who write something in any other language then English. I don’t think a spec like this can call itself “standard” or “common” by excluding the majority of text in the world.
No. This is not a minor inconvinance. This is a major annoyance. One should not force the user to do anything at all just to prevent markdown from messing the text up as I have shown in my original example.
No, I mean that the parser can skip the non-ASCII characters, since it would be a given that those are not significant in the syntax. Those characters would still be outputted, of course.
I’m talking about the significant, parsed bits. If you use, say, regexes to parse them, then you need proper Unicode support in your regex engine. That might not be the case, even though your parser still read your intput correctly.
Not necessarily. A simple parser that would match significant markers on ASCII chars and blindly output all other bytes would sill work fine with one-byte encodings like ISO-1559-* and one-byte-or-non-ASCII encodings like UTF-8.
Admittedly it won’t work with UTF-16, though. I’m just saying that including Unicode markers raises the bar for implementations.
Because text is not HTML, you have to make concessions, like requiring two spaces at the end of a line to get a line break, otherwise you cannot have a well-formatted text file OR you get line breaks every other sentence in the output. Markdown already works like this in almost all implementations, except for some that are not destined to be used with text files (e.g. embedded in an app or website).
So when you say “it doesn’t [work]”, you mean that, like most markdown implementations today, it doesn’t work for this quite specific case of both using the • bullet and not wanting to append those two spaces.
Consider the alternatives:
Loose the encoding-agnosticism – backward-incompatible in a major way plus lots of drawbacks
always breaking lines – backward-incompatible in a major way, loosing the ability to have clean text files
I think not supporting • as a list item marker it is the lesser important annoyance.
CommonMark (as I understand it) is supposed to smooth out differences between all of the implementations out there, and smooth out the ambiguities in Gruber’s writeup. Its purpose is not necessarily to add new features. Regardless of the character set issue, it looks and smells like a new feature.
No, I don’t agree @cirosantilli there are already so many ways to make lists:
* this is a list + this is a list - this is a list
So why not unicode bullet? Seems rather safe and minor of a change to me and it would help real users I see doing this in the wild a fair bit. That’s the main thing I care about. The combination of “easy, minor” and “seen in the wild a lot.”
Causal writers who do not know how to insert • may be confused when editing someone else’s unicode bullet list.
*+ and - are all easily accessible from a standard keyboard. Pressing Option+8 on the Mac to insert an additional • list item is not intuitive. If the writer does not know the key combination then they must use copy/paste, search for how to insert •, or rewrite all of the existing list markers.
No other Markdown syntax requires knowledge of key combinations (besides Shift of course).
I’m still on the fence for this change. So I’ll voice my thoughts for and against, and see if anyone can sufficiently shoot them down.
IIRC, all special characters in Markdown are in the ASCII range. Adding support for U+2022 would require that conforming implementations support unicode rather than ASCII, making the standard more restrictive.
Limited support from existing implementations. If they didn’t need it before, why do we need it now?
More divergence from the original markdown.
This particular change seems like it belongs in a unicode extension. Which could add all relevant unicode bullet characters to the list of acceptable characters, including (but not limited to):
• (U+2022) bullet
‣ (U+2023) triangular bullet
⁃ (U+2043) hyphen bullet
⁌ (U+204C) black leftwards bullet
⁍ (U+204D) black rightwards bullet
∙ (U+2219) bullet operator
◦ (U+25E6) white bullet
Additionally, unicode numerals beyond 0-9 could be supported for numeric lists.
on the other hand
“making the standard more restrictive” isn’t a strong argument. Unicode support isn’t particularly difficult when the character sets are well defined.
Many existing implementations don’t have support for fenced code blocks either, but utility was favored over popularity to reduce ambiguity.
Divergence from the original markdown is almost unavoidable, as the original is practically abandonware. If the original were to use semantic versioning, it’d be 1.0, and this would be the 2.0 spec due to the known API incompatibility.
Why go through all the extra effort to define the unicode alternatives to the core characters and then leave it up to implementors to choose whether or not they should implement an optional extension? Adding the unicode alternatives to core could help to make markdown more portable between conforming implementations.