Raw HTML blocks proposals -- comments wanted

That is the problem - since those spaces are not required.

>
>><div>foo</div>

works as well.

I don’t see problem. In your case nested bq “inner start” will be at 2, and <div> will be “at line start” again.

May be there is misunderstanding, caused by my english, sorry.

The problem is that the parser has to know what is considered the start of the line. But with blockquotes the optional space that is very commonly used will make this impossible (without saying that the space has to be omitted).

>> <div>foo</div>
>>
>><div>foo</div>

Why? It’s a next char after “>” if not space, or (next+1) if “>” followed by space or multiple spaces.

PS. At least it works in markdown-it, where we “remap” every line in bq, to make inner look as “usual content”

The approach you specified implies that if the user wants to have the HTML content to be inline, it must be preceeded with at least one space. But this will not work the same within blockquotes since as you mention, we trim the initial space.

Ah! Now understand the problem.

That will work if you use >_ for quoting. That will give 2 spaces in total, and only one will be trimmed.

Also in Blockquotes - could spec require strict indent before/after ">"? i suggested to make trimming equal for all inner strings. That could reduce problem for multiline blockquotes, when user don’t use space after >

But i agree, my solution is not ideal too. Stats from real docs needed to decide. I don’t use HTML at all, and will be fine if your suggestion accepted instead of mine. I like it and have no breaking examples.

It can be written as

</iframe ...>

or

<iframe width="560" 
        height="315"
        src="//www.youtube.com/embed/tjuBV4NbCng"
        frameborder="0"
        allowfullscreen>
</iframe>

I think we should search not only possible, but useable and natural solution. If copy-pasted html have to be sptitted, that will guarantee reports about broken support to all parsers maintainers.

It is a good point but quite conflicting, I’d rather have the current proposal and have people sorround copy-pasted code by <div>, than have more chance of having false positives.

I agree that avoiding hardcoding tag names is a worthy goal.

Clarification required

An HTML block would start with a line containing a single, unindented tag (either opening, closing, or self-closing), and would end, as before, with a blank line or end of document.

Can you please clarify if “containing a single” is supposed to mean:

a. containing only a single complete unindented tag (and nothing else);
b. containing only a single complete unindented tag or a partial unindented tag, but not both;
c. containing at least a single complete unindented tag;
d. containing either only a single partial unindented tag or least a single complete unindented tag; or both.

Version 0.15 seems to be (d). Since <div></div> in Example 100 is treated as a valid start of a HTML block (even though it has more than just one tag) and incomplete tags are also a valid start of a HTML block in Example 107.

If the new proposal is (d) or (b) then there is no problem with the YouTube iframe example, since it is a valid partial tag. The iframe example is only a problem if partial tags cannot be a valid HTML block tag (i.e. (a) or (c)).

Examples

To make the above options more clear, here are some example lines.

Under (a):

  • <div> would be a valid beginning of a HTML block.
  • <div></div> would be invalid, since it is not a single tag (it is two tags)
  • <div class="foo" would be invalid, since it is not a complete tag.
  • <div><p class="bar" would be invalid, since there is more than a single tag.

Under (b):

  • <div> would be a valid.
  • <div></div> would be invalid, since it is not a single tag (it is two tags).
  • <div class="foo" would be valid.
  • <div><p class="bar" would be invalid, since there is both a single complete tag and a partial unindented tag, but only one is allowed.

Under (c):

  • <div> would be a valid.
  • <div></div> would be valid.
  • <div class="foo" would be invalid, since it is not a complete tag.
  • <div><p class="bar" would be valid.

Under (d):

  • <div> would be a valid.
  • <div></div> would be valid.
  • <div class="foo" would be valid.
  • <div><p class="bar" would be valid.

A preference

I prefer the options that allow for more than “only a single complete tag”, since there are situations where you want HTML tags to not contain any whitespace content. When you really want to produce <foo></foo> or <foo/> and don’t want <foo> </foo> or

<foo>
</foo>

A question

I don’t understand why, in example 1 of the 6th post, <del>foo</del> becomes <p><del>foo</del></p>?

If this is not being interpreted as a HTML block (because it contains more than a single tag) and so is treated as a paragraph, wouldn’t the special characters be treated as normal characters (since they cannot be interpreted as autolinks). So, if it is transformed into HTML, it would become <p>&lt;del&gt;foo&lt;/del&gt;</p>.

+++ Hoylen Sue [Jan 02 15 14:17 ]:

I agree that avoiding hardcoding tag names is a worthy goal.

Clarification required

An HTML block would start with a line containing a single, unindented tag (either opening, closing, or self-closing), and would end, as before, with a blank line or end of document.

Can you please clarify if “containing a single” is supposed to mean:

a. containing only a single complete unindented tag (and nothing else);
b. containing only a single complete unindented tag or a partial unindented tag, but not both;
c. containing at least a single complete unindented tag;
d. containing either only a single partial unindented tag or least a single complete unindented tag; or both.

I was thinking (b). Of course, if we adopted Knagis’s suggestion, we’d
allow any number of complete tags and up to one incomplete tag, together
with whitespace.

Version 0.15 seems to be (d).

Version 0.15 does not implement any form of the proposal under
discussion here.

I prefer the options that allow for more than “only a single complete tag”, since there are situations where you want HTML tags to not contain any whitespace content.

I think we need to allow incomplete tags, since a long tag with many
attributes may be wrapped over several lines.

I don’t understand why, in example 1 of the 6th post, <del>foo</del> becomes <p><del>foo</del></p>?

The proposal under discussion was that an HTML block starts with a line
containing a single tag (and no other non-whitespace content). This
line contains “foo” and another tag, so it doesn’t start an HTML block.
That means it is interpreted as a paragraph, and the tags in it are
read as inline HTML tags. See CommonMark Spec

I don’t like this idea. It changes the definition of HTML. Markdown should not legislate how embedded HTML has to be formatted, and it should not change the meaning of whitespace in HTML.

I think this approach is also incompatible in spirit and fact with the original Markdown definition and implementation.

I don’t think it’s so terrible to maintain a list of block and inline elements. That list doesn’t change that often. Web authors should or can know that list. For elements that can be both, an exception can be specified.

+++ Peter Eisentraut [Jan 03 15 02:48 ]:

I don’t think it’s so terrible to maintain a list of block and inline elements. That list doesn’t change that often. Web authors should or can know that list. For elements that can be both, an exception can be specified.

The problem is how to handle elements that can be both, like <del> or <a>.

BabelMark
2
shows three interesting variants among Markdown implementations:

Group 1 (the biggest group, including current commonmark) always renders <del>Hi</del> with <p> tags. Many of these implementations even insert <p> tags around a bare open <del> tag, which I don’t think is valid HTML.

Group 2 always renders <del>Hi</del> as a block.

Group 3 (PHP Markdown) seems to have an ad hoc rule for <del> that allows both variants. If the <del> and </del> tags are on lines by themselves, the whole thing is interpreted as a block; otherwise it’s interpreted as a paragraph with inline <del> tags.

I think Group 3 (PHP Markdown) has the best approach here, as it is the only implementation that gives authors the power to express both things.

But, once you see a need to distinguish between

<del>
Hi
</del>

and

<del>Hi</del>

based on the positioning of the tags (the fact that in the first case we just have a single tag on a line by itself), then the list of block- and inline- level elements becomes redundant. The possibility opens up of making the distinction solely on the basis of positioning, and not hard-coding a list of elements.

And there’s a pretty good reason, I think, for not hard-coding the lists: custom elements (Custom elements not supported · Issue #239 · commonmark/commonmark-spec · GitHub). There’s no way to know whether a custom element should be interpreted as a block or inline element. We could just declare all custom (unknown) elements as block-level, but that is a restriction on authors. If we just go by positioning, we don’t need to make that restriction.

Count of custom tags is not finite.

Could the rule put forward in the original syntax guide on Daring Fireball be used?

The only restrictions are that block-level HTML elements — e.g. <div>, <table>, <pre>, <p>, etc. — must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces. Markdown is smart enough not to add extra (unwanted) <p> tags around HTML block-level tags.

This would require the HTML block to be surrounded by blank lines. My understanding is that the blank lines rule was not required in the spec due to the Daring Fireball rule being restrictive. From CommonMark 0.15:

In some ways Gruber’s rule is more restrictive than the one given here:

It requires that an HTML block be preceded by a blank line.
It does not allow the start tag to be indented.
It requires a matching end tag, which it also does not allow to be indented.

Indeed, most Markdown implementations, including some of Gruber’s own perl implementations, do not impose these restrictions.

But these restrictions could also be considered features if the primary goal is readability. For example, requiring blank lines before and after the HTML block makes it easier on the eyes to separate from the rest of the page content. It could be considered a good practice worth enforcing by the spec.

Here’s a hybrid proposal, which is not as simple and elegant, but achieves some of the same goals: An HTML block starts with either (a) any HTML tag or partial tag on a line by itself, or (b) any HTML block tag, whether it is on a line by itself or not. (Dual-purpose tags like <del> would not be counted as block tag for purposes of (b).)

This proposal would allow all the blocks the previous proposal did, plus some more that it did not, like:

<div>foo</div>

While we’re being practical, it might also make sense to special-case rules for verbatim block tags (<script>, <style>, <pre>), so that a block opened with one of these does not end until we hit the corresponding closing tag. (Normally we end the block at the first blank line, which allows users to include markdown content inside HTML tags if they want, but nobody would want markdown inside verbatim tags to be interpreted.)

What to do if such block will be closed with

</style> foo bar


?

Or even more “nice” combination:

styles...

</style> <script>

code...

Ping! Can we push it forward?

We need to do something with block comments. After npm registry site switched to markdown-it, users reported edge cases:

I feel, they are right - block comments should be handled more gently than now.

I agree about the comments. I think that comments, along with <script>, and <style> elements, should be special-cased. A raw HTML block starting with <!--, <script, or <style should be kept open until the first line containing -->, </script>, or </style> (respectively) is encountered. We don’t need to interpret text inside comments as Markdown!

I also agree that coming up with a better spec for HTML blocks is a high priority.

I’m thinking of something along the lines of the hybrid proposal above, with special-case rules for comments, script, and style.

It should also be made clear that a line with a partial tag like <div id="foo" can open a block (since the line may be hard-wrapped before it closes).

Note that on this proposal, some invalid HTML will be interpreted as HTML blocks. E.g.

<div id="foo"
that's all folks

or

<!-- HTML comments can't contain -- but this one does
-->

I think that’s okay: garbage in, garbage out.

This proposal would give a large degree of backwards compatibility with original Markdown, though there are a few cases where things would break, e.g.

<table>

    <tr>
        <td>
         etc.

Here the <tr> gets interpreted as indented code.

Some more discussion at https://github.com/jgm/CommonMark/issues/177

1 Like