Treatment of unicode BOM (U+FEFF)

Currently the spec says nothing about this. The BOM is used at the beginning of some UTF-8 and UTF-16 files. It is not really part of the content. With the current spec, a word indented four spaces after a BOM at the beginning of the file will not be parsed as an indented code block, and that seems wrong.

We could add language to the spec that a BOM at the beginning of the source (or perhaps any BOM in the source) should simply be ignored, treated as if it isn’t there.

Thoughts? The Wikipedia article on BOM is informative. Apparently U+FEFF used to have a content role as a zero-width nonbreaking space, but in recent versions of Unicode that has been deprecated, so it only functions as a BOM.

IMHO BOM is legacy shit and can be safely stripped.

1 Like

A BOM should only occur in files, but a lot of CM/MD parsing is not file-based (think <textarea>). If a parser does not enforce a certain output encoding it should probably keep the original BOM. For parsing purposes, it must always be treated as if it weren’t there.

What makes you say they only apply to files? Any network stream, for example, can use a BOM.

What about the UTF-8 “signature” (BOM)? Explicitly disambiguating UTF-8, ASCII, Windows-1252, &c is pretty handy at times.

A BOM may appear in network streams (which I would consider file-like), but is there actually any browser that adds one or even can be triggered to add one upon form submission? That’s one major use case for Commonmark. However, my conclusions remain the same.

1 Like

One approach would be to say that this is an implementation issue; implementations should strip out the BOM when appropriate, but the spec doesn’t need to say anything about it. The spec just operates on strings of characters, and abstracts from their source (file, network stream, web form).

I’m not sure whether it’s better to go that route, or to put something in the spec.