Invalid Unicode Code Points

jackdw · February 9, 2020, 2:07am

In the preface for example 312 (322 for the GFM), there is the line:

Invalid Unicode code points will be replaced by the REPLACEMENT CHARACTER ( U+FFFD ).

I have been looking for a couple of days now on what a good definition of invalid unicode code points are with little success. Is there an easy reference that I am missing somewhere? Does it depend on something installed on the system itself or is there a global check?

Please advise.

mity · February 9, 2020, 9:36am

I would read it as “anything invalid or ill-formed” in the respect to the document encoding used/assumed by the implementation.

That would include for example any Codepoints larger then U+10FFFF or any ill-formed multi-byte UTF-8 sequence (when assuming UTF-8).

EDIT: For more exhaustive description, see e.g. the Unicode standard version 12, especially the chapter 3.9 about the Unicode encoding forms.

jgm · February 9, 2020, 4:33pm

Code points greater than 0x10FFFF are invalid. (Unicode standard.)