Inline HTML breaks when using indentation

RogerDodger · January 5, 2020, 8:03am

The original Markdown was designed as a sort-of superset of HTML:

Markdown is not a replacement for HTML, or even close to it. Its syntax is very small, corresponding only to a very small subset of HTML tags. The idea is not to create a syntax that makes it easier to insert HTML tags. In my opinion, HTML tags are already easy to insert. The idea for Markdown is to make it easy to read, write, and edit prose.

For any markup that is not covered by Markdown’s syntax, you simply use HTML itself.

This makes easy things easy, and hard things possible.

However, Markdown.pl makes no attempt to parse Markdown inside of HTML, so if you want to create a mildly complicated container for some prose, you’re out of luck. But it doesn’t break the HTML.

To extend this, many markdown parsers, and the CommonMark spec, still treat the body of HTML blocks as markdown. But all of them break in some way when the HTML is indented, so none of these parsers are very useful as an HTML superset. Some examples:

<div class="foo">
    <div class="bar">Bar body</div>

    <div class="baz">Baz body</div>
</div>

The baz div is parsed as a code block:

<div class="foo">
    <div class="bar">Bar body</div>
<pre><code>&lt;div class=&quot;baz&quot;&gt;Baz body&lt;/div&gt;
</code></pre>

</div>

Case where we’re trying to actually use markdown inside inline HTML:

<div class="foo">
    <div class="bar">
        <div class="baz">
            <input type="text" name="quux" length="20"
                required pattern="(?:\d\.\s)?\w+">

First para.

- List item
- List item

Second para with *emph*.
        </div>
    </div>
</div>

This is the only input I found that works on even a single markdown parser, md4c. It requires:

Inner markdown body not indented at all
Blank line between first para and HTML
No blank line between second para and HTML

If there is a blank line between second para and the HTML, then the HTML gets parsed as an indented code block and breaks the HTML. If there isn’t (as above) then most CommonMark parsers still produce broken HTML:

<div class="foo">
    <div class="bar">
        <div class="baz">
            <input type="text" name="quux" length="20" required pattern="(?:\d\.\s)?\w+">
            <p>First para.</p>
            <ul>
                <li>List item</li>
                <li>List item</li>
            </ul>
            <p>Second para with <em>emph</em>.</div>
    </div>
    </p>
</div>

I gather from reading the spec why this behaviour occurs (4.6 end condition for HTML blocks 6-7 is a blank line; the closing tags are just parsed as inline HTML of the second para; the tags with no blank lines between them are just considered one big “HTML block” chunk, since there’s no blank line to end them), so I’m unclear what the intended way of actually writing inline HTML is in CommonMark, other than simply using no indentation at all. It seems to me like the answer is simply “turn off parsing markdown inside HTML”, which feels unsatisfactory.

Is it possible for the spec to say “if you’re inside inline HTML, a line starting with spaces followed by an HTML tag is more inline HTML before it’s a code block”? It’s the only thing causing unavoidable breakage, and it’s still possible with this rule to use fenced code blocks for the case where one wants a code block inside inline HTML.

jgm · January 5, 2020, 8:32pm

Cameron via CommonMark Discussion noreply@talk.commonmark.org
writes:

I gather from reading the spec why this behaviour occurs (4.6
end condition for HTML blocks 6-7 is a blank line; the closing
tags are just parsed as inline HTML of the second para; the
tags with no blank lines between them are just considered one
big “HTML block” chunk, since there’s no blank line to end
them), so I’m unclear what the intended way of actually writing
inline HTML is in CommonMark, other than simply using no
indentation at all.

Don’t use blank lines in the included HTML blocks, and things
should work fine.

That requires you to be a bit more reflective when you’re
including raw HTML in a commonmark document, but that’s the price
we pay for the added flexibility of being able to include
arbitrary commonmark inside HTML tags if we like.

Is it possible for the spec to say “if you’re inside inline
HTML, a line starting with spaces followed by an HTML tag is
more inline HTML before it’s a code block”? It’s the only thing
causing unavoidable breakage, and it’s still possible with this
rule to use fenced code blocks for the case where one wants a
code block inside inline HTML.

I take it the proposal is to disable indented code blocks
when parsing commonmark content within HTML tags.

In principle, yes, it’s probably possible, but it invites
a lot of additional complexity. For one thing, we’d need
to keep track of HTML tag nesting in a way that we currently
don’t. (And what about tags that have optional closers?)
One thing we definitely want to avoid is the need for unlimited
backtracking.

RogerDodger · January 5, 2020, 10:00pm

Don’t use blank lines in the included HTML blocks, and things should work fine.

Yeah, this works for the first case, even if it’s pretty fickle. A big annoyance here is text being generated by a templating language, which can leave blank lines around, but it’s workable.

I take it the proposal is to disable indented code blocks when parsing commonmark content within HTML tags.

Yep. By my reckoning that’d totally resolve this. The writer stills need to be particular about blank lines, but that’s okay–we just need it to be possible. In the second example of the OP, there’s no way to write it (with indentation) in a way that CommonMark doesn’t break it. Ditto for any indented HTML with a block of CommonMark inside. As a nice side benefit it should make example 1 in the OP work fine as well.

In principle, yes, it’s probably possible, but it invites a lot of additional complexity. For one thing, we’d need to keep track of HTML tag nesting in a way that we currently don’t. (And what about tags that have optional closers?) One thing we definitely want to avoid is the need for unlimited backtracking.

I don’t have much experience writing parsers, but I don’t think keeping a context counter of how many open tags deep we are needs any backtracking. It does require whitelisting void elements like hr, input, and source.

Requiring optional closing tags to be properly closed seems reasonable. (Well, “requiring”, but failing to close them properly would only have the effect of disabling indented code blocks. Stealth feature?)

This would cause some serious confusion if someone doesn’t know about the rule and their indented code blocks stop working unexpectedly. But the principle you say applies here as well: if you’re using inline HTML, you ought to know how it works. And that case is definitely going to pop up less than people just trying to write ordinary, indented HTML.

You could have the spec specify how optional HTML tags, which I don’t think requires backtracking, but that’s certainly a lot more complicated. I think it’s already been stated a few times that you want to avoid as much as possible having the spec reference the intricate details of HTML syntax.

ohAitch · January 7, 2020, 3:34am

Personally I’d solve this by fully outdenting the innermost html tag: a bit ugly, but a lot less complex than trying to track nesting in the grammar.

RogerDodger · January 10, 2020, 1:05am

At implementation level the solution (which I use) is just disabling indented code blocks altogether, since they’re redundant with fenced ones. That can’t be done on the spec level, because it breaks backwards compatibility.

I suggest this change to the spec because:

Inline HTML has always been a notable feature of Markdown
HTML is almost always and should be indented for readability
It doesn’t break backwards compatibility except specifically for documents with an indented code block, inside HTML, that starts with an HTML tag—which surely appears many orders of magnitude less often than documents that are just ordinarily indented HTML with some blank lines in them.

That is, I think it suits the spirit of Markdown trying its best to “do the right thing” and makes it more useful overall.

To clarify the change suggested, it’s actually a less breaking change than:

I take it the proposal is to disable indented code blocks when parsing commonmark content within HTML tags.

It’s:

When parsing CommonMark within HTML, an indented line starting with an HTML tag is more HTML, not a code block.

I don’t know if this makes it harder to implement, but it would break fewer existing documents than disabling indented code blocks entirely, and it maximises expressiveness of the language, such that:

<div class="foo">

    print "This is code";

</div>

Will give you a code block, and:

<div class="foo">

    <div class="bar"></div>

</div>

…won’t.

My guess is that a decent number of documents of the former type exist, and very few of the latter (or rather, those of the latter were far more likely intended to be HTML and not code blocks—hence the suggestion).

rokejulianlockhart · October 18, 2024, 5:50pm

This is my biggest gripe with Markdown as a whole. I find indentation to be such a significant part of readability that it seems insane that the specification retains the ability to create <pre>s using indentation, now that fenced code blocks exist, considering that it makes authoring readable HTML impossible.

Consider the support provided to the undermentioned:

github.com/11ty/eleventy

Disable markdown indented code blocks by default

opened 10:12PM - 14 Jun 22 UTC

closed 04:59PM - 16 Jun 22 UTC

zachleat

enhancement breaking-change

In markdown there is a specific feature called Indented Code Blocks that causes …much confusion! We have a big warning on the docs about it. https://www.11ty.dev/docs/languages/markdown/#there-are-extra-and-in-my-output Awhile back @drewm posted a lovely workaround to opt-out of this feature. ```js eleventyConfig.setLibrary("md", markdownIt(options).disable('code')); ``` https://twitter.com/drewm/status/1167821259662663682 I’d like to change the default Eleventy behavior to do this as well. Maybe even in 2.0 👀 Related: #402 #180 #1971 (#1635 maybe) and part of https://twitter.com/brob/status/1530620544680337412 from @brob and probably a bunch of others

I’ve asked at Reddit whether it’s possible to disable in VS Code, if anyone’s interested.

rokejulianlockhart · November 23, 2024, 8:34pm

@RogerDodger, I have officially requested this for VS Code at the undermentioned, if of use:

github.com/microsoft/vscode

Allow disabling indented code blocks in the Markdown previewer.

opened 08:32PM - 23 Nov 24 UTC

closed 04:16PM - 25 Nov 24 UTC

RokeJulianLockhart

markdown *extension-candidate

#### Desire As [I've poorly aforedescribed at `reddit.com/r/vscode` (to 5 upvot…es)](https://www.reddit.com/r/Markdown/comments/1g8t2xt/can_markdowns_indented_code_blocks_be_disabled_in/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button), I want a way to disable the conversion of a single tab or 2 spaces to a `<pre><code>`. Instead, I want it these to merely be ignored (as they are in other markup languages, like HTML). #### Rationale I author most of my markup in HTML, because it provides significantly more versatile and semantic markup capabilities. However, it has an (ultimately non-inherent, but in-practice) significant failure — `<code>` tags are not automatically syntax-highlighted by any parsers. Markdown, being a superset of HTML, improves this perfectly. As an example, when I render the undermentioned in VS Code with the PowerShell extension installed, I see beautiful colours: ~~~MD <tr> <th> Construction </th> <td> ```PS1 ${'Status'} = [PSCustomObject]@{ 'Non-Chronological' = 'Indeterminate'; 'Chronological' = [PSCustomObject]@{ 1 = @( 'Uncommenced', 'Commenced' ); 2 = @( 'Completed', 'Cancelled' ) } } ${'Person'} = [PSCustomObject]@{ 3 = ''; 1 = ''; 2 = '' } ``` </td> </tr> ~~~ However, that's really difficult to read. It gets exponentially more difficult if, for example, you have nested tables with code blocks in each. At that stage, I basically have to re-indent and then de-indent each time I modify the markup. It's a dreadful workflow. Instead, it should be the undermentioned: ~~~MD <tr> <th> Construction </th> <td> ```PS1 ${'Status'} = [PSCustomObject]@{ 'Non-Chronological' = 'Indeterminate'; 'Chronological' = [PSCustomObject]@{ 1 = @( 'Uncommenced', 'Commenced' ); 2 = @( 'Completed', 'Cancelled' ) } } ${'Person'} = [PSCustomObject]@{ 3 = ''; 1 = ''; 2 = '' } ``` </td> </tr> ~~~ However, all beneath the first `<td>` shall render in a `<pre><code>`. This is, of course, a basic example, where the aforedescribed potential complexity is less evident. However, I can provide incredibly complex examples if necessary. Summarily, having this implemented would completely change how I write my Markdown documents. I would finally be able to write the HTML within them in a readable manner, and have syntax-highlighted code blocks, without needing to deal with the indentation havoc that is `<pre>`. #### Feasibility Per https://github.com/11ty/eleventy/issues/2438#issuecomment-2464912554, this should be possible in VS Code, since `markdown-it` appears to be the parser that VS Code uses, and it supports the ability to disable indented code blocks. #### Corroberations 1. ["A switch that disables code blocks by indenting" at `forum.obsidian.md/t/21764`](https://forum.obsidian.md/t/a-switch-that-disables-code-blocks-by-indenting/21764/1?u=rokejulianlockhart) <blockquote> For years I’ve been using apps that “use Markdown” like Bear and Ulysses. So I would often tab underneath a header for typing because I visually like to indent different topics. Well, turns out that those apps were really using “Markdown flavors”. It was like Markdown lite. So now that I’m using Obsidian that uses the full Markdown specs, when I try and indent underneath a header, I get a code block! That technically is correct, but I hate it! I can’t write the way I normally like to write. I would love a simple switch that says something along the lines of “disable code blocks by indenting” or something to that effect. I would still be able to do a code block using the three ticks. </blockquote> 1. ["Inline HTML breaks when using indentation" at `talk.commonmark.org/t/3317`](https://talk.commonmark.org/t/inline-html-breaks-when-using-indentation/3317/1?u=rokejulianlockhart) [^1] <blockquote> The only input I found that works [...] requires: 1. Inner markdown body not indented at all 2. Blank line between first para and HTML 3. No blank line between second para and HTML If there *is* a blank line between second para and the HTML, then the HTML gets parsed as an indented code block and breaks the HTML. </blockquote> 1. https://github.com/jgm/pandoc/issues/2120#issue-71270331 1. https://github.com/11ty/eleventy/issues/2438#issue-1271419451 1. ["Disable “indent -> code block”" at `forum.obsidian.md/t/19173`](https://forum.obsidian.md/t/disable-indent-code-block/19173/1?u=rokejulianlockhart) <blockquote> #### **Situation** I often indent lists and put an empty line in between for visual distrinction. #### **Problem** Obsidian recognizes that as code and colors the line red. #### **What I’m trying to do** Turn off code blocks alltogether, or turn off the indent → code block function. </blockquote> 1. ["Break Markdown: Option to change default tab / indent behavior / Do not create code block" at `forum.obsidian.md/t/8741/5`](https://forum.obsidian.md/t/break-markdown-option-to-change-default-tab-indent-behavior-do-not-create-code-block/8741/5?u=rokejulianlockhart) > If there’s an empty line in between, the indent will default to code. If anybody found a solution, please let me know. 1. ["This is how to use Markdown inside HTML blocks" at `forum.obsidian.md/t/74435/14`](https://forum.obsidian.md/t/this-is-how-to-use-markdown-inside-html-blocks/74435/14?u=rokejulianlockhart) > It’s sad we can’t combine HTML with Markdown, I really wanted to use the metabind plugin in an html table, but since HTML goes separately, that really limits the amount of customization we can have. 1. ["Change the code block button from inserting indentation to triple-backticks" at `meta.stackoverflow.com/revisions/414866/1`](https://meta.stackoverflow.com/revisions/414866/1#:~:text=Indentation%20is%20also%20used%20with%20lists%20(bullet%20and%20numbered)%2C%20making%20it%20more%20confusing%20how%20to%20properly%20indent%20a%20block%20inside%20them.%20Fenced%20blocks%20provide%20a%20separate%20syntax%2C%20simplifying%20the%20markup%20for%20indenting%20a%20block%20inside%20a%20list.): > Indentation is also used with lists (bullet and numbered), making it more confusing how to properly indent a block inside them. Fenced blocks provide a separate syntax, simplifying the markup for indenting a block inside a list. #### Reposts 1. [x] https://github.com/searKing/preview-vscode/issues/96#issue-2686597767 1. [ ] https://github.com/shd101wyy/vscode-markdown-preview-enhanced/issues/new #### Interested 1. [x] @JaredRichardWilliam 1. [ ] @YerEverLuvinUncleBert [^1]: [`talk.commonmark.org/t/3317/6`](https://talk.commonmark.org/t/inline-html-breaks-when-using-indentation/3317/6?u=rokejulianlockhart)

Some demonstrations of support would be really useful.