Lexical ambiguity in Processing Instruction syntax

There are subtle differences in how HTML/XML/CommonMark treat <?...> blocks. I thought I’d start a discussion about why the syntax difference might matter. The spec lawyering with chapter&verse is at the end.


  • In HTML: <?...> is a comment.
  • In XML: <?...?> is a processing instruction.
  • In CommonMark: <?...?> is an HTML block called a processing instruction.

Problem

The lexical ambiguity allows an attacker to create token boundaries apparent to the Markdown parser that are significantly different from those apparent to a browser parsing the corresponding HTML.

Consider an HTML producing implementation that

  • correctly escapes all non-HTML block content,
  • whitelists HTML blocks that are tags,
  • allows comments and processing instructions through unchanged,
  • strips all other HTML blocks
  • might include privileged content in the same origin as the rendered HTML

This would seem to restrict the set of tags that a markdown author could use.
It does not though.

<? ... >

outside bogus comment as far as the HTML parser is concerned,
but inside processing instruction as far as Markdown is concerned.

<img src=bogus onerror=alert(1337)

<!-- ?> <!-->

In HTML, this document alerts since it parses to

  1. A comment node with content “...
  2. A text node with the verbiage
  3. An img element with 4 attributes: src="bogus", onerror="alert(1337)", <!--="", and ?=""
  4. An HTML comment "<!-->.

An implementor could work around this by banning comments and processing instructions, but presumably CommonMark recognizes PIs because there is a reason to allow processing instructions in trusted inputs. There might be legitimate use-cases for processing instructions in untrusted-but-trustworthy inputs.


Possible Fixes

If making it easier for implementors to preserve HTML blocks in untrusted inputs is a goal, CommonMark could do any of

  1. Change the syntax of processing instruction to <?...?> where ... does not contain > to mach the intersection of HTML bogus comment and XML processing instruction.
  2. Change the syntax to <?...> to match HTML processing instruction.
  3. Specify that <?...> translates to <!--....--> or be dropped if ... contains "--" to consistently contribute a comment node.
  4. I couldn’t find any Security Considerations section for implementors, but such could note that processing instructions are lexically ambiguous and should be stripped.

(1) seems similar to the approach taken for “<!--...-->” where the XML syntax is preferred and valid HTML comments like “<!-->” are simply treated as non-HTML-block content.


Spec Lawyering

In CommonMark, HTML Blocks says

Start condition: line begins with the string <?.
End condition: line contains the string ?>.

and later this is defined thus

A processing instruction consists of the string <?, a string of characters not including the string ?>, and the string ?>.

This syntax matches the XML spec.

Per HTML 5, “<?...>” is a “bogus” comment node.
After a ‘<’, the HTML parser is in the tag-open state and ‘?’ leads to the bogus-comment state which is closed by the first “>”, not “?>”. Even in a foreign content context, this contributes a DOM CommentNode, not a ProcessingInstructionNode.

† - modulo NUL ∈ character which may be addressed by the note on Insecure characters which s/\0/\uFFFD/g.

Thanks for raising this issue, and for the suggested fixes. Mostly we don’t care too much about how HTML elements are classified; we just want to pass through raw HTML. And, as far as possible, we’d like to be neutral about the HTML version. So, we’d like <?php sections in HTML 4 to be passed through. Can these contain unescaped >? If so, it might be a problem for solution #2, which otherwise seems the simplest and best. I’ve never used PHP so I have no idea.

So, we’d like <?php sections in HTML 4 to be passed through. Can these contain unescaped >? If so, it might be a problem for solution #2, which otherwise seems the simplest and best. I’ve never used PHP so I have no idea.

Yes, that is a problem for all of #1, #2 and #3.

I’m not that familiar with PHP either but my understanding is that > is an operator and there are no angle-bracket safe alternatives like gt as in JSP’s expression language.

Should implementations that take untrusted inputs have the freedom to alter their output to drop or alter tokens that are controversial (interpreted differently by different parsers), or is an output based on an untrusted input itself untrusted that should then be passed through a 3rd party sanitizer?