Unicode Character 'BULLET' (U+2022)

Wouldn’t implementations need to understand the encoding to be able to parse a document at all? For example, UTF-16 is completely different from UTF-8. Even with single-byte character encodings I’m quite sure byte values below 128 (thus US-ASCII) are also used as part of byte sequences that are valid code points in other encodings. So, ignoring all encoding and treating all text as US-ASCII wouldn’t work.

As implementations have to determine which encoding they’re reading, you might as well expect them to support UTF-8 (of all possible encodings), and therefore Unicode.

1 Like