Spec section 2.3: Handling of NUL character

tin-pot · January 10, 2016, 12:47am

In [section 2.3][spec-23] (“Insecure characters”) the specification states:

For security reasons, the Unicode character U+0000 must be replaced with the replacement character (U+FFFD).

I think this should be replaced with a requirement to simply discard U+0000 NULL (NUL) from the input text.

For similar reasons, U+007F DELETE should probably be discarded, too—but NUL is more problematic.

Rationale

The reason for this suggestion is that NUL is commonly used (1) as a string terminator, and (2) to pad coded-character data elements aka string buffers or fields and the like. If NUL in the text must be replaced by U+FFFD, then padding NULs must also, more or less. This would insert or append unwanted and unneeded U+FFFD REPLACEMENT CHARACTERS in the processed text.

Example: A Standard C library <stdio.h> binary stream may

[…] have an implementation-defined number of null characters appended to the end of the stream. – (ISO/IEC 9899:2011, clause 7.21.2)

An application reading such a stream has no way to discern trailing NUL text characters from padding null characters.

May such an application truncate the input text at the first encountered NUL character? It seems the spec says “no”.

To avoid appending spurious U+FFFD, the application would have to read until the EOF while counting or buffering NUL characters, and then decide whether these were padding or text NUL characters, and replace only the latter with U+FFFD (and discard trailing NUL characters from the text anyway!). This seems too much effort for very little gain.

[ Reading the input text from a text stream poses similar or worse problems. ]