There are subtle differences in how HTML/XML/CommonMark treat <?...> blocks. I thought I’d start a discussion about why the syntax difference might matter. The spec lawyering with chapter&verse is at the end.
- In HTML:
<?...>is a comment. - In XML:
<?...?>is a processing instruction. - In CommonMark:
<?...?>is an HTML block called a processing instruction.
Problem
The lexical ambiguity allows an attacker to create token boundaries apparent to the Markdown parser that are significantly different from those apparent to a browser parsing the corresponding HTML.
Consider an HTML producing implementation that
- correctly escapes all non-HTML block content,
- whitelists HTML blocks that are tags,
- allows comments and processing instructions through unchanged,
- strips all other HTML blocks
- might include privileged content in the same origin as the rendered HTML
This would seem to restrict the set of tags that a markdown author could use.
It does not though.
<? ... >
outside bogus comment as far as the HTML parser is concerned,
but inside processing instruction as far as Markdown is concerned.
<img src=bogus onerror=alert(1337)
<!-- ?> <!-->
In HTML, this document alerts since it parses to
- A comment node with content “
...” - A text node with the verbiage
- An
imgelement with 4 attributes:src="bogus",onerror="alert(1337)",<!--="", and?="" - An HTML comment "
<!-->.
An implementor could work around this by banning comments and processing instructions, but presumably CommonMark recognizes PIs because there is a reason to allow processing instructions in trusted inputs. There might be legitimate use-cases for processing instructions in untrusted-but-trustworthy inputs.
Possible Fixes
If making it easier for implementors to preserve HTML blocks in untrusted inputs is a goal, CommonMark could do any of
- Change the syntax of processing instruction to
<?...?>where...does not contain>to mach the intersection of HTML bogus comment and XML processing instruction. - Change the syntax to
<?...>to match HTML processing instruction. - Specify that
<?...>translates to<!--....-->or be dropped if...contains"--" to consistently contribute a comment node. - I couldn’t find any Security Considerations section for implementors, but such could note that processing instructions are lexically ambiguous and should be stripped.
(1) seems similar to the approach taken for “<!--...-->” where the XML syntax is preferred and valid HTML comments like “<!-->” are simply treated as non-HTML-block content.
Spec Lawyering
In CommonMark, HTML Blocks says
Start condition: line begins with the string
<?.
End condition: line contains the string?>.
and later this is defined thus
A processing instruction consists of the string
<?, a string of characters not including the string?>, and the string?>.
This syntax matches the XML spec†.
Per HTML 5, “<?...>” is a “bogus” comment node.
After a ‘<’, the HTML parser is in the tag-open state and ‘?’ leads to the bogus-comment state which is closed by the first “>”, not “?>”. Even in a foreign content context, this contributes a DOM CommentNode, not a ProcessingInstructionNode.
† - modulo NUL ∈ character which may be addressed by the note on Insecure characters which s/\0/\uFFFD/g.