Inline HTML breaks when using indentation

The original Markdown was designed as a sort-of superset of HTML:

Markdown is not a replacement for HTML, or even close to it. Its syntax is very small, corresponding only to a very small subset of HTML tags. The idea is not to create a syntax that makes it easier to insert HTML tags. In my opinion, HTML tags are already easy to insert. The idea for Markdown is to make it easy to read, write, and edit prose.

For any markup that is not covered by Markdown’s syntax, you simply use HTML itself.

This makes easy things easy, and hard things possible.

However, Markdown.pl makes no attempt to parse Markdown inside of HTML, so if you want to create a mildly complicated container for some prose, you’re out of luck. But it doesn’t break the HTML.

To extend this, many markdown parsers, and the CommonMark spec, still treat the body of HTML blocks as markdown. But all of them break in some way when the HTML is indented, so none of these parsers are very useful as an HTML superset. Some examples:

<div class="foo">
    <div class="bar">Bar body</div>

    <div class="baz">Baz body</div>
</div>

The baz div is parsed as a code block:

<div class="foo">
    <div class="bar">Bar body</div>
<pre><code>&lt;div class=&quot;baz&quot;&gt;Baz body&lt;/div&gt;
</code></pre>

</div>

Case where we’re trying to actually use markdown inside inline HTML:

<div class="foo">
    <div class="bar">
        <div class="baz">
            <input type="text" name="quux" length="20"
                required pattern="(?:\d\.\s)?\w+">

First para.

- List item
- List item

Second para with *emph*.
        </div>
    </div>
</div>

This is the only input I found that works on even a single markdown parser, md4c. It requires:

  1. Inner markdown body not indented at all
  2. Blank line between first para and HTML
  3. No blank line between second para and HTML

If there is a blank line between second para and the HTML, then the HTML gets parsed as an indented code block and breaks the HTML. If there isn’t (as above) then most CommonMark parsers still produce broken HTML:

<div class="foo">
    <div class="bar">
        <div class="baz">
            <input type="text" name="quux" length="20" required pattern="(?:\d\.\s)?\w+">
            <p>First para.</p>
            <ul>
                <li>List item</li>
                <li>List item</li>
            </ul>
            <p>Second para with <em>emph</em>.</div>
    </div>
    </p>
</div>

I gather from reading the spec why this behaviour occurs (4.6 end condition for HTML blocks 6-7 is a blank line; the closing tags are just parsed as inline HTML of the second para; the tags with no blank lines between them are just considered one big “HTML block” chunk, since there’s no blank line to end them), so I’m unclear what the intended way of actually writing inline HTML is in CommonMark, other than simply using no indentation at all. It seems to me like the answer is simply “turn off parsing markdown inside HTML”, which feels unsatisfactory.

Is it possible for the spec to say “if you’re inside inline HTML, a line starting with spaces followed by an HTML tag is more inline HTML before it’s a code block”? It’s the only thing causing unavoidable breakage, and it’s still possible with this rule to use fenced code blocks for the case where one wants a code block inside inline HTML.

1 Like

Cameron via CommonMark Discussion noreply@talk.commonmark.org
writes:

I gather from reading the spec why this behaviour occurs (4.6
end condition for HTML blocks 6-7 is a blank line; the closing
tags are just parsed as inline HTML of the second para; the
tags with no blank lines between them are just considered one
big “HTML block” chunk, since there’s no blank line to end
them), so I’m unclear what the intended way of actually writing
inline HTML is in CommonMark, other than simply using no
indentation at all.

Don’t use blank lines in the included HTML blocks, and things
should work fine.

That requires you to be a bit more reflective when you’re
including raw HTML in a commonmark document, but that’s the price
we pay for the added flexibility of being able to include
arbitrary commonmark inside HTML tags if we like.

Is it possible for the spec to say “if you’re inside inline
HTML, a line starting with spaces followed by an HTML tag is
more inline HTML before it’s a code block”? It’s the only thing
causing unavoidable breakage, and it’s still possible with this
rule to use fenced code blocks for the case where one wants a
code block inside inline HTML.

I take it the proposal is to disable indented code blocks
when parsing commonmark content within HTML tags.

In principle, yes, it’s probably possible, but it invites
a lot of additional complexity. For one thing, we’d need
to keep track of HTML tag nesting in a way that we currently
don’t. (And what about tags that have optional closers?)
One thing we definitely want to avoid is the need for unlimited
backtracking.

Don’t use blank lines in the included HTML blocks, and things should work fine.

Yeah, this works for the first case, even if it’s pretty fickle. A big annoyance here is text being generated by a templating language, which can leave blank lines around, but it’s workable.

I take it the proposal is to disable indented code blocks when parsing commonmark content within HTML tags.

Yep. By my reckoning that’d totally resolve this. The writer stills need to be particular about blank lines, but that’s okay–we just need it to be possible. In the second example of the OP, there’s no way to write it (with indentation) in a way that CommonMark doesn’t break it. Ditto for any indented HTML with a block of CommonMark inside. As a nice side benefit it should make example 1 in the OP work fine as well.

In principle, yes, it’s probably possible, but it invites a lot of additional complexity. For one thing, we’d need to keep track of HTML tag nesting in a way that we currently don’t. (And what about tags that have optional closers?) One thing we definitely want to avoid is the need for unlimited backtracking.

I don’t have much experience writing parsers, but I don’t think keeping a context counter of how many open tags deep we are needs any backtracking. It does require whitelisting void elements like hr, input, and source.

Requiring optional closing tags to be properly closed seems reasonable. (Well, “requiring”, but failing to close them properly would only have the effect of disabling indented code blocks. Stealth feature?)

This would cause some serious confusion if someone doesn’t know about the rule and their indented code blocks stop working unexpectedly. But the principle you say applies here as well: if you’re using inline HTML, you ought to know how it works. And that case is definitely going to pop up less than people just trying to write ordinary, indented HTML.

You could have the spec specify how optional HTML tags, which I don’t think requires backtracking, but that’s certainly a lot more complicated. I think it’s already been stated a few times that you want to avoid as much as possible having the spec reference the intricate details of HTML syntax.

Personally I’d solve this by fully outdenting the innermost html tag: a bit ugly, but a lot less complex than trying to track nesting in the grammar.

At implementation level the solution (which I use) is just disabling indented code blocks altogether, since they’re redundant with fenced ones. That can’t be done on the spec level, because it breaks backwards compatibility.

I suggest this change to the spec because:

  1. Inline HTML has always been a notable feature of Markdown
  2. HTML is almost always and should be indented for readability
  3. It doesn’t break backwards compatibility except specifically for documents with an indented code block, inside HTML, that starts with an HTML tag—which surely appears many orders of magnitude less often than documents that are just ordinarily indented HTML with some blank lines in them.

That is, I think it suits the spirit of Markdown trying its best to “do the right thing” and makes it more useful overall.

To clarify the change suggested, it’s actually a less breaking change than:

I take it the proposal is to disable indented code blocks when parsing commonmark content within HTML tags.

It’s:

When parsing CommonMark within HTML, an indented line starting with an HTML tag is more HTML, not a code block.

I don’t know if this makes it harder to implement, but it would break fewer existing documents than disabling indented code blocks entirely, and it maximises expressiveness of the language, such that:

<div class="foo">

    print "This is code";

</div>

Will give you a code block, and:

<div class="foo">

    <div class="bar"></div>

</div>

…won’t.

My guess is that a decent number of documents of the former type exist, and very few of the latter (or rather, those of the latter were far more likely intended to be HTML and not code blocks—hence the suggestion).