There are subtle differences in how HTML/XML/CommonMark treat
<?...> blocks. I thought I’d start a discussion about why the syntax difference might matter. The spec lawyering with chapter&verse is at the end.
- In HTML:
<?...>is a comment.
- In XML:
<?...?>is a processing instruction.
- In CommonMark:
<?...?>is an HTML block called a processing instruction.
The lexical ambiguity allows an attacker to create token boundaries apparent to the Markdown parser that are significantly different from those apparent to a browser parsing the corresponding HTML.
Consider an HTML producing implementation that
- correctly escapes all non-HTML block content,
- whitelists HTML blocks that are tags,
- allows comments and processing instructions through unchanged,
- strips all other HTML blocks
- might include privileged content in the same origin as the rendered HTML
This would seem to restrict the set of tags that a markdown author could use.
It does not though.
<? ... > outside bogus comment as far as the HTML parser is concerned, but inside processing instruction as far as Markdown is concerned. <img src=bogus onerror=alert(1337) <!-- ?> <!-->
In HTML, this document alerts since it parses to
- A comment node with content “
- A text node with the verbiage
imgelement with 4 attributes:
- An HTML comment "
An implementor could work around this by banning comments and processing instructions, but presumably CommonMark recognizes PIs because there is a reason to allow processing instructions in trusted inputs. There might be legitimate use-cases for processing instructions in untrusted-but-trustworthy inputs.
If making it easier for implementors to preserve HTML blocks in untrusted inputs is a goal, CommonMark could do any of
- Change the syntax of processing instruction to
...does not contain
>to mach the intersection of HTML bogus comment and XML processing instruction.
- Change the syntax to
<?...>to match HTML processing instruction.
- Specify that
<!--....-->or be dropped if
"--" to consistently contribute a comment node.
- I couldn’t find any Security Considerations section for implementors, but such could note that processing instructions are lexically ambiguous and should be stripped.
(1) seems similar to the approach taken for “
<!--...-->” where the XML syntax is preferred and valid HTML comments like “
<!-->” are simply treated as non-HTML-block content.
In CommonMark, HTML Blocks says
Start condition: line begins with the string
End condition: line contains the string
and later this is defined thus
A processing instruction consists of the string
<?, a string of characters not including the string
?>, and the string
This syntax matches the XML spec†.
Per HTML 5, “
<?...>” is a “bogus” comment node.
After a ‘
<’, the HTML parser is in the tag-open state and ‘
?’ leads to the bogus-comment state which is closed by the first “
>”, not “
?>”. Even in a foreign content context, this contributes a DOM CommentNode, not a ProcessingInstructionNode.
† - modulo NUL ∈ character which may be addressed by the note on Insecure characters which