Raw HTML blocks proposals -- comments wanted

jgm · January 2, 2015, 5:02pm

+++ Hoylen Sue [Jan 02 15 14:17 ]:

I agree that avoiding hardcoding tag names is a worthy goal.

Clarification required

An HTML block would start with a line containing a single, unindented tag (either opening, closing, or self-closing), and would end, as before, with a blank line or end of document.

Can you please clarify if “containing a single” is supposed to mean:

a. containing only a single complete unindented tag (and nothing else);
b. containing only a single complete unindented tag or a partial unindented tag, but not both;
c. containing at least a single complete unindented tag;
d. containing either only a single partial unindented tag or least a single complete unindented tag; or both.

I was thinking (b). Of course, if we adopted Knagis’s suggestion, we’d
allow any number of complete tags and up to one incomplete tag, together
with whitespace.

Version 0.15 seems to be (d).

Version 0.15 does not implement any form of the proposal under
discussion here.

I prefer the options that allow for more than “only a single complete tag”, since there are situations where you want HTML tags to not contain any whitespace content.

I think we need to allow incomplete tags, since a long tag with many
attributes may be wrapped over several lines.

I don’t understand why, in example 1 of the 6th post, <del>foo</del> becomes <p><del>foo</del></p>?

The proposal under discussion was that an HTML block starts with a line
containing a single tag (and no other non-whitespace content). This
line contains “foo” and another tag, so it doesn’t start an HTML block.
That means it is interpreted as a paragraph, and the tags in it are
read as inline HTML tags. See CommonMark Spec

petere · January 3, 2015, 2:35am

I don’t like this idea. It changes the definition of HTML. Markdown should not legislate how embedded HTML has to be formatted, and it should not change the meaning of whitespace in HTML.

I think this approach is also incompatible in spirit and fact with the original Markdown definition and implementation.

I don’t think it’s so terrible to maintain a list of block and inline elements. That list doesn’t change that often. Web authors should or can know that list. For elements that can be both, an exception can be specified.

jgm · January 3, 2015, 5:30am

+++ Peter Eisentraut [Jan 03 15 02:48 ]:

I don’t think it’s so terrible to maintain a list of block and inline elements. That list doesn’t change that often. Web authors should or can know that list. For elements that can be both, an exception can be specified.

The problem is how to handle elements that can be both, like <del> or <a>.

BabelMark
2 shows three interesting variants among Markdown implementations:

Group 1 (the biggest group, including current commonmark) always renders <del>Hi</del> with <p> tags. Many of these implementations even insert <p> tags around a bare open <del> tag, which I don’t think is valid HTML.

Group 2 always renders <del>Hi</del> as a block.

Group 3 (PHP Markdown) seems to have an ad hoc rule for <del> that allows both variants. If the <del> and </del> tags are on lines by themselves, the whole thing is interpreted as a block; otherwise it’s interpreted as a paragraph with inline <del> tags.

I think Group 3 (PHP Markdown) has the best approach here, as it is the only implementation that gives authors the power to express both things.

But, once you see a need to distinguish between

<del>
Hi
</del>

and

<del>Hi</del>

based on the positioning of the tags (the fact that in the first case we just have a single tag on a line by itself), then the list of block- and inline- level elements becomes redundant. The possibility opens up of making the distinction solely on the basis of positioning, and not hard-coding a list of elements.

And there’s a pretty good reason, I think, for not hard-coding the lists: custom elements (Custom elements not supported · Issue #239 · commonmark/commonmark-spec · GitHub). There’s no way to know whether a custom element should be interpreted as a block or inline element. We could just declare all custom (unknown) elements as block-level, but that is a restriction on authors. If we just go by positioning, we don’t need to make that restriction.

vitaly · January 3, 2015, 9:02am

Count of custom tags is not finite.

chrisalley · January 3, 2015, 12:31pm

Could the rule put forward in the original syntax guide on Daring Fireball be used?

The only restrictions are that block-level HTML elements — e.g. <div>, <table>, <pre>, <p>, etc. — must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces. Markdown is smart enough not to add extra (unwanted) <p> tags around HTML block-level tags.

This would require the HTML block to be surrounded by blank lines. My understanding is that the blank lines rule was not required in the spec due to the Daring Fireball rule being restrictive. From CommonMark 0.15:

In some ways Gruber’s rule is more restrictive than the one given here:

It requires that an HTML block be preceded by a blank line.
It does not allow the start tag to be indented.
It requires a matching end tag, which it also does not allow to be indented.

Indeed, most Markdown implementations, including some of Gruber’s own perl implementations, do not impose these restrictions.

But these restrictions could also be considered features if the primary goal is readability. For example, requiring blank lines before and after the HTML block makes it easier on the eyes to separate from the rest of the page content. It could be considered a good practice worth enforcing by the spec.

jgm · January 6, 2015, 4:38am

Here’s a hybrid proposal, which is not as simple and elegant, but achieves some of the same goals: An HTML block starts with either (a) any HTML tag or partial tag on a line by itself, or (b) any HTML block tag, whether it is on a line by itself or not. (Dual-purpose tags like <del> would not be counted as block tag for purposes of (b).)

This proposal would allow all the blocks the previous proposal did, plus some more that it did not, like:

<div>foo</div>

While we’re being practical, it might also make sense to special-case rules for verbatim block tags (<script>, <style>, <pre>), so that a block opened with one of these does not end until we hit the corresponding closing tag. (Normally we end the block at the first blank line, which allows users to include markdown content inside HTML tags if they want, but nobody would want markdown inside verbatim tags to be interpreted.)

vitaly · January 6, 2015, 6:19am

What to do if such block will be closed with

</style> foo bar

?

vitaly · January 6, 2015, 6:22am

Or even more “nice” combination:

styles...

</style> <script>

code...

vitaly · February 10, 2015, 5:23pm

Ping! Can we push it forward?

We need to do something with block comments. After npm registry site switched to markdown-it, users reported edge cases:

I feel, they are right - block comments should be handled more gently than now.

jgm · February 11, 2015, 1:27am

I agree about the comments. I think that comments, along with <script>, and <style> elements, should be special-cased. A raw HTML block starting with , </script>, or </style> (respectively) is encountered. We don’t need to interpret text inside comments as Markdown!

I also agree that coming up with a better spec for HTML blocks is a high priority.

I’m thinking of something along the lines of the hybrid proposal above, with special-case rules for comments, script, and style.

It should also be made clear that a line with a partial tag like <div id="foo" can open a block (since the line may be hard-wrapped before it closes).

Note that on this proposal, some invalid HTML will be interpreted as HTML blocks. E.g.

<div id="foo"
that's all folks

or

<!-- HTML comments can't contain -- but this one does
-->

I think that’s okay: garbage in, garbage out.

This proposal would give a large degree of backwards compatibility with original Markdown, though there are a few cases where things would break, e.g.

<table>

    <tr>
        <td>
         etc.

Here the <tr> gets interpreted as indented code.

Some more discussion at https://github.com/jgm/CommonMark/issues/177

an3ss · February 27, 2015, 2:24pm

I’m wondering if the bit about comments is going to be revised in the spec soon. The following examples summarize how I feel the spec should define inline comments and block comments, independently of other HTML blocks:

<!-- inline comment --> Some **formatted** `text`.

<!-- inline comment
     split in two lines --> Some **formatted** `text`.

<!-- Block comment -->
Some **formatted** `text`.

<!-- Block comment 1 ->
paragraph 1
<!-- Block comment 2 ->
paragraph 2

<!-- 
  Multiline block comment.
-->
    Code block

So comments preceded or followed by text on the same line would be inline comments and the rest would be block comments (by themselves, not including subsequent lines).

With this definition, block comments whould be assimilated to a ghost paragraph (one that acts like a paragraph from the parser’s standpoint but that does not generate a <p> element around the comment).

With this proposal, markdown formatting is always available after block comments, which I think is desirable (the current spec disables formatting until a blank line is found). Does this make sense?

Thank you and regards.

jgm · March 1, 2015, 12:00am

+++ an3ss [Feb 27 15 14:35 ]:

 With this proposal, markdown formatting is always available after block
comments, which I think is desirable (the current spec disables
formatting until a blank line is found). Does this make sense?

Here is why I like the original proposal better:

For someone who hard wraps, the difference between
```

text text
```
and
```
 text
text
```
might just be that in the former case the comment took
the full width of the line, while in the latter it
didn’t. Seems better to me to require a blank line
to end a block level comment.
There are cases where an HTML block might begin with a
comment (e.g. explaining what it is for), and in these
cases, you certainly wouldn’t want to interpret whatever
came right after the comment as Markdown.

an3ss · March 1, 2015, 1:08pm

Actually, I would like everything interpreted as Markdown unless it is an HTML tag, a comment, a <! declaration or a processing instruction. I don’t know if this can be done in general, and that’s why I’m only speaking about comments here (although this should be easy to do with declarations and processing instructions too).

So my point is:

I want this to be a **formatted** paragraph (it already is). <!-- comment -->

<!-- comment --> I'd like this to be a **formatted** paragraph too (it is not in the current spec).

<!-- comment -->
I don't care if the previous comment is a block or is inline, but I would like this to be a **formatted** paragraph too (it is not in the current spec).

In other words, I think an HTML comment should not start an “HTML block”, only an HTML tag could do that.

jgm · March 3, 2015, 4:31am

+++ an3ss [Mar 01 15 13:18 ]:

Actually, I would like everything interpreted as Markdown unless it is an HTML tag, a comment, a <! declaration or a processing instruction.

That would make it impossible to copy blocks of raw HTML into a Markdown document, without a change in meaning. We have tried to preserve this feature of original Markdown as much as possible, while still offering the possibility of having Markdown inside HTML tags. The current spec allows an author to choose either mode (leave a blank line for following Markdown content, or no blank line to interpret as HTML).

an3ss · March 3, 2015, 7:48am

I understand and I appreciate the effort. But what about HTML comments?
Why do they start an HTML block? Many times you will use them to comment pure markdown, not HTML markup. And other parsers seem to ignore them just as I would expect (see my previous examples).

codinghorror · March 3, 2015, 8:24am

I think that’s a pretty big drawback. Like… really big. As @vitaly noted even cut and paste HTML like

<iframe width="560" height="315" src="//www.youtube.com/embed/tjuBV4NbCng" frameborder="0" allowfullscreen></iframe>

… will stop working.

I want to go back to first principles on this: is HTML itself changing so fast that hard-coding the list of tags is really all that bad?

In this complete list of the new tags in HTML 5, I count 28 new tags, things like <article> and <progress>. (I can’t find a similar list for HTML 5.1, maybe because the spec is still evolving.)

I understand the problem with span vs block level, but that problem has always existed with Markdown, has it not?

I feel like unless we outline “yes, this is a tag we know, this is a valid HTML 5 tag” it just gets so weird, we are trading one known set of problems for another entirely unknown set of problems.

vitaly · March 3, 2015, 9:04am

HTML 5 allows ANY custom tags (with dash in name). That’s widely used in different page templates, frameworks and static pages generators. Limiting allowed html block tags in our days is not acceptable.

codinghorror · March 3, 2015, 9:35am

Seems like we could test for that?

Alternately we could expect the “classic” tags.

vitaly · March 3, 2015, 9:40am

I already do it in markdown-it. That helps in most cases. But we still can’t know, if line was started with block or inline tag. Current spec use whitelisting, and it will not work with customs.

lu_zero · March 3, 2015, 10:30am

assuming is always html is quite annoying I’d rather use the single-<tag>-starts, single-<tag>-ends and accept some content needs to have a <div> sorrounding it.