One more thing regarding “HTML comments”:
Both the HTML (whereby I mean “real” HTML conforming to the W3C HTML 4.01 and/or ISO 15445 specifications, not the “look-alike” HTML5) and XML rules restrict the comment syntax to start with <!--
and end with -->
:
All comments in HTML document instances shall appear in comment declarations. There shall be exactly one comment per comment declaration.
there is technically the “degenerate” form <!>
too: it is not mentioned in the CommonMark spec, and (consequentially) is passed through as literal text by cmark
, and browsers like Mozialla would not recognize it as a comment anyway (as they should).
But I think that <!>
would be a perfectly fitting candidate to play the same role in CommonMark that \&
has in groff
(or nroff
, or troff
):
Insert a zero-width character, which is invisible. Its intended use is to stop interaction of a character with its surrounding.
Why would this be useful? It would in many cases provide an alternative to the “backslash escape” used to prevent parsing, eg instead to “hide” the FULL STOP to prevent recognition of a list item like this:
1\. Lorem ipsum dolor sit amet.
one could write:
<!>1. Lorem ipsum dolor sit amet.
More importantly, and not possible using backslash (as far as I can tell), one could differentiate “inline” and “block” tag markup (or rather: element types) in the CommonMark text, and thus prevent or request the CommonMark parser to wrap the content in a <P>
(or <LI>
) element.
This input would produce the first paragraph wrapped into <P>
, but the second paragraph would generate an instance of the “block-level” element type <block-elem>
, following the <P>
element instance.
Lorem ipsum dolor sit amet.
<block-elem>consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</block-elem>
On the other hand, that input would produce two consecutive <P>
elements, where the second has a <inline-elem>
as an immediate (and only) child:
Lorem ipsum dolor sit amet.
<!><inline-elem>consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</inline-elem>
The second use is of course the important one: this could fix an IMO serious flaw in the CommonMark specification, and at the same time simplify the rules for handling “HTML tags”, namely to:
- make the CommonMark specification as general as possible,
- in particular to avoid biasing it towards HTML, or
- constraining the specification to the HTML syntax, or
- the HTML element type names (which variant of HTML anyway?), or
- which named character references are available in HTML etc.
In the current specification are all kinds of HTML-specific rules, in particular in section “4.6 HTML blocks”, naturally. On the other hand, in section “6.2 Entities” it explicitly says:
With the goal of making this standard as HTML-agnostic as possible, all valid HTML entities (except in code blocks and code spans) are recognized as such and converted into Unicode characters before they are stored in the AST. This means that renderers to formats other than HTML need not be HTML-entity aware.
Maybe it is just me, but is the requirement that
-
every CommonMark processor has to know the complete HTML character entity set (again: of which HTML variant?) so that
-
the CommonMark specification can assign fixed meanings defined by HTML to all entity references which look like a HTML named character reference, and
-
require the processor to “silently” substitute the corresponding Unicode character for these references
not the direct and extreme opposite of “making this standard as HTML-agnostic as possible”?
[NOTE: Admittedly, this should be taken to another discussion topic.]