There are subtle differences in how HTML/XML/CommonMark treat <?...>
blocks. I thought I’d start a discussion about why the syntax difference might matter. The spec lawyering with chapter&verse is at the end.
- In HTML:
<?...>
is a comment. - In XML:
<?...?>
is a processing instruction. - In CommonMark:
<?...?>
is an HTML block called a processing instruction.
Problem
The lexical ambiguity allows an attacker to create token boundaries apparent to the Markdown parser that are significantly different from those apparent to a browser parsing the corresponding HTML.
Consider an HTML producing implementation that
- correctly escapes all non-HTML block content,
- whitelists HTML blocks that are tags,
- allows comments and processing instructions through unchanged,
- strips all other HTML blocks
- might include privileged content in the same origin as the rendered HTML
This would seem to restrict the set of tags that a markdown author could use.
It does not though.
<? ... >
outside bogus comment as far as the HTML parser is concerned,
but inside processing instruction as far as Markdown is concerned.
<img src=bogus onerror=alert(1337)
<!-- ?> <!-->
In HTML, this document alerts since it parses to
- A comment node with content “
...
” - A text node with the verbiage
- An
img
element with 4 attributes:src="bogus"
,onerror="alert(1337)"
,<!--=""
, and?=""
- An HTML comment "
<!-->
.
An implementor could work around this by banning comments and processing instructions, but presumably CommonMark recognizes PIs because there is a reason to allow processing instructions in trusted inputs. There might be legitimate use-cases for processing instructions in untrusted-but-trustworthy inputs.
Possible Fixes
If making it easier for implementors to preserve HTML blocks in untrusted inputs is a goal, CommonMark could do any of
- Change the syntax of processing instruction to
<?...?>
where...
does not contain>
to mach the intersection of HTML bogus comment and XML processing instruction. - Change the syntax to
<?...>
to match HTML processing instruction. - Specify that
<?...>
translates to<!--....-->
or be dropped if...
contains"--
" to consistently contribute a comment node. - I couldn’t find any Security Considerations section for implementors, but such could note that processing instructions are lexically ambiguous and should be stripped.
(1) seems similar to the approach taken for “<!--...-->
” where the XML syntax is preferred and valid HTML comments like “<!-->
” are simply treated as non-HTML-block content.
Spec Lawyering
In CommonMark, HTML Blocks says
Start condition: line begins with the string
<?
.
End condition: line contains the string?>
.
and later this is defined thus
A processing instruction consists of the string
<?
, a string of characters not including the string?>
, and the string?>
.
This syntax matches the XML spec†.
Per HTML 5, “<?...>
” is a “bogus” comment node.
After a ‘<
’, the HTML parser is in the tag-open state and ‘?
’ leads to the bogus-comment state which is closed by the first “>
”, not “?>
”. Even in a foreign content context, this contributes a DOM CommentNode, not a ProcessingInstructionNode.
† - modulo NUL ∈ character which may be addressed by the note on Insecure characters which s/\0/\uFFFD/g
.