Unicode Character 'BULLET' (U+2022)

kagan · September 4, 2014, 11:46pm

Thanks to unicode, we can write the following bulleted list in plain text, without the translation to HTML

• Hello
• World!

Again, what you see above is NOT the following code, but plain text:

<ul>
  <li>Hello</li>
  <li>World!</li>
</ul>

If you forget to add two trailing spaces after “Hello”, the markdown would convert it to

<p>• Hello
• World!</p>

and the browser would render that all in one line, which would look like:

• Hello• World!

This is obviously worsening the original text.

Therefore, I propose to extend the list of list markers -, +, or * with the unicode bullet character (U+2022) in section 5.2.

zzzzBov · September 5, 2014, 1:21am

I’m on the fence as to whether I agree with adding U+2022 to the list of special characters that a compliant markdown parser would be required to support for bullet notation.

Can you point to instances of the bullet character being used “in the wild” for plaintext documentation? It seems like it would be awkward to try to insert a bullet character instead of the standard -, +, and * characters.

roryokane · September 5, 2014, 1:37am

I personally use the bullet character whenever I have to write a list in plaintext. It is rare that I actually have to write in plain text and need a list – I usually write in Markdown – but it has happened, such as one long-form survey question I answered in a browser textarea.

On OS X, you can insert a bullet • with the key combination Option+8. I don’t mind typing real bullets on OS X because it is convenient. When I use Windows, which requires you to hold Alt and type a four-digit code to type a symbol, I don’t bother to use real bullets. So it seems that this feature mainly caters to OS X users.

kagan · September 5, 2014, 7:09am

I can’t. However I would very much like to use the •. I think it is natural to use it, because it is there. Just like using any other punctuation character in the ASCII Table.

The current line break rules prevent me from doing do. The question is, why do I need to convert a piece of text into something else, if it is already perfect in its raw form?

Thanks, I didn’t even know that Now I will use that in future.

The fact, how easy or difficult it is to insert that charcter should not drive the standard. The user interfaces change all the time. Who knows, windows and linux might make it more easy to access the • some time soon too. New markdown editors might make it easier or you might even see new keyboards with the • some day.

The point is: The • is standardized. The spirit of markdown is that a plain text document without conversion to HTML should still look good. The conversion to HTML must remain a bonus, not a necessity. The original markdown rules make a bulleted list using bullets look uglier than the input.

nagisa · September 5, 2014, 7:48am

Not really. People with proper compose key setup can write most of unicode symbols and characters with relative ease .

arthur_peka · September 5, 2014, 8:24am

I personally like the idea - it is more simple than the original list syntax, but it doesn’t seem to be convenient to me to insert unicode symbols.
Maybe • could be substituted by + sign?
So you could write:
+Hello
+World

Or even by * sign:

*Hello
*World

Looks more readable to me.

kagan · September 5, 2014, 1:27pm

The current spec allows +, - and *. This works already with pretty much all implementations. My point was to add the bullet character to this list, not to replace.

In my opinion, it is not relevant for a standard, if it is inconvenient for some people. For many it is.

And even if it was inconvenient for all, that would be still irrelevant. More important is the following question:

What happens when this is not implemented?

Answer:

Markdown destroys a plain text “bulleted” (literally) list by placing the whole list in one line.

arthur_peka · September 5, 2014, 1:45pm

Never mind, my proposition was to remove space between + (-, * - whaterver) and an item. So that you write +Item,not + Item. I’m not so sure about that now, though. It’s a little easier to parse IMO, but maybe less readable.

roryokane · September 6, 2014, 6:32pm

Wikipedia – Compose key has more information, for others who didn’t know about it.

Jim_Balter · September 7, 2014, 10:48pm

Some people have 21st century keyboards, and they will become more prevalent.

The original point remains unaddressed; for those who do use a bullet, the CommonMark output mangles the input.

Even if so, what’s wrong with making CommonMark work well for them?

rwzy · September 8, 2014, 1:30am

The point of markdown is also to use common characters available on the vast majority of keyboards to derive it’s syntax from, making it ‘easy’ to do formatting and such. I understand that your keyboard supports it easily enough, but most people’s don’t…

So one reason for not having it in the spec as a list marker is because people who can’t easily input the bullet (which is a lot of people at least right now) might be presented a situation where the markdown document they have received uses bullets. So they’d either have to change all the bullets to what they can easily use, or difficultly input bullets if they want to edit it, making it inconveniant, hard and therefore against the philosophy of markdown for them (which = most people). They’d need to do that because different list markers indicate a new list, so they can’t just mix your bullet with their preferred list marker, be it -, + or *, between different list items.

I thought the same with emacs and grid tables. Where emacs relates to your keyboard, and grid tables relate to the bullet character.

What’s a 21st century keyboard?

Jim_Balter · September 8, 2014, 2:48am

I didn’t say anything about my keyboard. The discussion is about a standard.

I doubt that I really have to explain what I meant by that.

I acknowledge that your accessibility/lowest common denominator argument is a good one that I hadn’t considered. I tried to come up with a rebuttal but couldn’t … kudos to you.

At the same time almost all new programming languages accept Unicode. More and more people will be receiving programs and other documents containing Unicode characters and will have to deal with them. This will eventually lead to everyone having easy ways to input them. When that happens, this issue can be revisited. But I think you give a compelling argument for why now is not the time.

rwzy · September 8, 2014, 6:00am

Sorry if it came off in the wrong tone, it was a genuine question! I was thinking it was some type of new keyboard which allowed easier unicode character input or something.

I had (wrongly) assumed your references to a 21st century keyboard was referring to your keybaord too. Sorry.

seantek · September 10, 2014, 11:40am

I do not think that U+2022 should be added to the list of special characters for bulleted list interpretations.

U+2022 is a Unicode character. Markdown, however, is agnostic to character set: neither Gruber’s original specification nor CommonMark say anything about which character set Markdown content is to be authored in. Control characters are all drawn from the POSIX Portable Character Set, i.e., US-ASCII (without assuming particular scalar values for particular characters). Gruber’s Perl script also does not assume particular encodings–what you put in is what you get out. It should stay that way.

kagan · September 10, 2014, 2:31pm

I wish that would work. But it doesn’t. When I enter a bulleted list using the unicode character, markdown turns it to

<p>• Hello
• World!</p>

Which is rendered by the browser to

• Hello • World!

That means input is not equal to output. Therefore, it can not stay that way as you also required and the bullet should be added to the list of markers.

Additionally, adding U+2022 to the list of list markers to the common mark spec does not harm those, who do not use any unicode.

Virtlink · September 12, 2014, 3:10pm

If I copy and paste bullet list from Microsoft Word into my plain text editor, I also get the Unicode bullets.

So, to summarize. The U+2022 Unicode bullet character:

is standardized by Unicode;
is used by some users to create a correct plain text list;
is easily typed by OS X users;
is pasted when a list is copied from Word (and possibly other text editors and browsers).

Advantages of adding the U+2022 Unicode bullet character support to Markdown:

Markdown no longer messes up seemingly correct lists that use this character.

Disadvantages:

Markdown has to add support for a non-POSIX Portable Character;
therefore implementations have to recognize Unicode;
support is currently not present in other variants.

In my opinion the disadvantages (that I could come up with) don’t outweigh the advantages. Adding a standardized character from Unicode (the most comprehensive, most standardized, most widely used character ‘set’) is not a downside to me. Users seeing their list getting messed up when copied from Word or typed with the standardized bullet character is a much bigger issue.

So, I support adding the U+2022 Unicode bullet character to CommonMark.

instanceofme · September 12, 2014, 3:36pm

Those may not be the only ones. Implementations would need to:

support Unicode parsing, which, depending on the tools you use, can be cumbersome (e.g. javascript in the browser)
know (or infer, but that may not be acceptable) the encoding of the input text, whereas it can currently just deal with ASCII chars and ignore others with a lot of very common encodings (e.g. UTF-8, ISO-1559-*…)

As for the rendering of a list bulleted with • , you can add two spaces at the end of the lines so that it would still get a line break. While I get that it is not what you want, you can get your “output ≥ input” with only a minor inconvenience.

Virtlink · September 12, 2014, 4:02pm

Wouldn’t implementations need to understand the encoding to be able to parse a document at all? For example, UTF-16 is completely different from UTF-8. Even with single-byte character encodings I’m quite sure byte values below 128 (thus US-ASCII) are also used as part of byte sequences that are valid code points in other encodings. So, ignoring all encoding and treating all text as US-ASCII wouldn’t work.

As implementations have to determine which encoding they’re reading, you might as well expect them to support UTF-8 (of all possible encodings), and therefore Unicode.

kagan · September 12, 2014, 4:03pm

Let’s compare this statement with this one:

I would say, a parser, which chooses not to support unicode, can simply ignore the unicode characters.

I also think it is not wise for a parser not to support unicode. All modern operating systems, libraries etc. support it. I can’t imagine why someone would want to write software that doesn’t support unicode.

Not supporting unicode means excluding all those who write something in any other language then English. I don’t think a spec like this can call itself “standard” or “common” by excluding the majority of text in the world.

No. This is not a minor inconvinance. This is a major annoyance. One should not force the user to do anything at all just to prevent markdown from messing the text up as I have shown in my original example.

In addition, I agree to the below quote:

I see two options to ensure this:

Adding U+2022 to the list of list markers
Getting rid of the two spaces. See the discussion about line breaks.

instanceofme · September 12, 2014, 4:22pm

No, I mean that the parser can skip the non-ASCII characters, since it would be a given that those are not significant in the syntax. Those characters would still be outputted, of course.

I’m talking about the significant, parsed bits. If you use, say, regexes to parse them, then you need proper Unicode support in your regex engine. That might not be the case, even though your parser still read your intput correctly.