How exactly should parsers handle U+0000?

Continuing the discussion from Specify when (if ever) parsers should give up:

The spec indeed states (under §2.1 Characters and lines):

For security reasons, a conforming parser must strip or replace the Unicode character U+0000.

“Strip” is clear.
“Replace” is ambiguous.

Surely a parser can’t replace U+0000 with anything. Was a particular replacement character intended? If so, which one?

This also makes it sound like a conforming parser can independently decide which strategy to use, resulting in different output between implementations.

I suggest we keep it simple and specify to strip U+0000.

Can’t remember source of info. There is a recommendation to replace broken unicode sequense bytes with 0xFFFE. This can be good for markdown too.

Currently in markdown-it we strip zero chars, but i’d prefer replace, if exact character specified.

There’s a special unicode “replacement character” (U+FFFD) – that’s what was intended, and that’s what cmark uses. I think that might be better than stripping it, because it leaves a trace that there was something funny about the input.

4 Likes

Replacing with the Unicode replacement character U+FFFD sounds good. I don’t mind, as long as it is clear what should happen; and preferably no “or” option for different implementations to choose from.

One small disadvantage of U+FFFD is if the user/tools wants to stick with ASCII.

Alternative approach

Another approach is to remove this “must” requirement from CommonMark itself. This specification is really about defining the CommonMark syntax and its parsing to an abstract syntax tree – where U+0000 is not a security problem if it is implemented properly.

The handling of U+0000 seems better stated as a suggestion, a should or may, instead of a mandatory must.

The required handling of U+0000 then becomes an issue for the output component rather than the CommonMark parsing component. For example, “HTML must not contain U+0000”. Who does the stripping/replacing, the parser or the output generator, then becomes an implementation detail.

The only time this implementation detail makes a difference is if the implementation is an API, where the user accesses the abstract syntax tree. The should/may warning is sufficient. If an API really wants to preserve the U+0000 (e.g. perhaps so error messages can be shown to the user about which parts of the document contained unexpected content) it should be allowed to – the current “must” prevents that.