Raw HTML blocks proposals -- comments wanted

jgm · December 26, 2014, 9:19pm

The current spec for raw HTML blocks hard-codes a list of HTML “block-level” tags. This isn’t ideal, for several reasons:

New tags may be added to HTML in the future
If you use custom tags in your application, they’ll always be treated as span-level, which is often not what you want (see https://github.com/jgm/CommonMark/issues/239).
Some tags can be both span- or block-level (see https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_categories#Phrasing_content). The current system will always treat them one way or the other. For example, <del> tags are always treated as span-level, so there’s no way to use <del> tags to surround a markdown block-level construct (a list or paragraph, say).

Here’s a simple proposal that would avoid any hard-coding at all. An HTML block would start with a line containing a single, unindented tag (either opening, closing, or self-closing), and would end, as before, with a blank line or end of document.

The main drawback I can see (relative to the current spec) is that some existing Markdown documents may have HTML blocks that don’t start with a single tag on a line by itself, and these would be interpreted wrongly as paragraphs.

lu_zero · December 26, 2014, 10:58pm

Doesn’t sound bad to me. Obviously since my use-cases are non-html outputs I’m not so focused on it, but seems simple to use and easy to implement and should always give a nice result.

vitaly · December 26, 2014, 11:15pm

I don’t use HTML, but also thinked about such separation while writing parser. If that’s acceptable in terms of compatibility, that would be good.

If we are speaking about HTML logic changes, it worth to mention block comments, as special case. Should we avoid parce internals for those? Sould we care about case, when block comment contains paragraph with inline comment?

Knagis · December 27, 2014, 2:44pm

Maybe multiple tags can be allowed on the opening line, as long as it contains only HTML tags and whitespaces?

For example:

<table><thead>
<tr>...
</thead></table>

lu_zero · December 27, 2014, 6:40pm

it opens the road to more false positives, though.

jgm · December 28, 2014, 3:58am

Let me illustrate the proposal with three examples:

<del>foo</del>

becomes

<p><del>foo</del></p>

<del>
foo
</del>

becomes

<del>
foo
</del>

<del>

foo

</del>

becomes

<del>
<p>foo</p>
</del>

These examples nicely brings out the difficulties we’re facing: the <del> tag might be intended to surround content within a paragraph, a whole paragraph, or some raw HTML. The current proposal gives you the power to express all of these.

Note also that what goes for <del> goes for every tag, even tags like <div> and <em> that are always block or span level. This means you’d get some strange results if you did something like

<div>foo</div>

which would become

<p><div>foo</div></p>

But I think this is okay – even with the present system, we’re not guaranteeing well-formed HTML output when raw HTML is used in the input. It would be up to the author to know the (fairly simple) rules governing HTML tags. The nice thing about these rules is that authors wouldn’t need to remember a list of tags that are designated as block or inline.

lu_zero · December 28, 2014, 9:31am

As long it is well defined I like it.

vitaly · December 28, 2014, 1:17pm

Found edge case for html. This is real code to insert youtube videos, given by google:

<iframe width="560" height="315" src="//www.youtube.com/embed/tjuBV4NbCng" frameborder="0" allowfullscreen></iframe>

With new algorythm it will be wrapped to paragraph.

I’d suggest a bit different split for block/inline tags:

HTML, started exactly at line start, without spaces will be block
If 1+ spaces exists - in will be inline.

Or we can consider strategy, suggested by @Knagis

http://talk.commonmark.org/t/raw-html-blocks-proposals-comments-wanted/983/4?u=vitaly

It will not solve case with paragraph wrapping of <div>foo</div>, but will cover the most often sutiations.

Knagis · December 28, 2014, 5:42pm

This could prove problematic when the blocks are nested, for example, within a blockquote (example) because there the common practice is to indent everything by one space.

Another case, similar to what @vitaly mentioned: <script src="..."></script>.

vitaly · December 28, 2014, 6:21pm

In nested blocks “start of line” is shifted appropriately and counter relative to container inner…

>
> > <div>foo</div>

For div at second string start of line is 4.

Knagis · December 28, 2014, 6:41pm

That is the problem - since those spaces are not required.

>
>><div>foo</div>

works as well.

vitaly · December 28, 2014, 6:58pm

I don’t see problem. In your case nested bq “inner start” will be at 2, and <div> will be “at line start” again.

May be there is misunderstanding, caused by my english, sorry.

Knagis · December 28, 2014, 7:18pm

The problem is that the parser has to know what is considered the start of the line. But with blockquotes the optional space that is very commonly used will make this impossible (without saying that the space has to be omitted).

>> <div>foo</div>
>>
>><div>foo</div>

vitaly · December 28, 2014, 7:41pm

Why? It’s a next char after “>” if not space, or (next+1) if “>” followed by space or multiple spaces.

PS. At least it works in markdown-it, where we “remap” every line in bq, to make inner look as “usual content”

Knagis · December 28, 2014, 9:17pm

The approach you specified implies that if the user wants to have the HTML content to be inline, it must be preceeded with at least one space. But this will not work the same within blockquotes since as you mention, we trim the initial space.

vitaly · December 28, 2014, 9:44pm

Ah! Now understand the problem.

That will work if you use >_ for quoting. That will give 2 spaces in total, and only one will be trimmed.

Also in Blockquotes - could spec require strict indent before/after ">"? i suggested to make trimming equal for all inner strings. That could reduce problem for multiline blockquotes, when user don’t use space after >

But i agree, my solution is not ideal too. Stats from real docs needed to decide. I don’t use HTML at all, and will be fine if your suggestion accepted instead of mine. I like it and have no breaking examples.

lu_zero · January 1, 2015, 3:24pm

It can be written as

</iframe ...>

or

<iframe width="560" 
        height="315"
        src="//www.youtube.com/embed/tjuBV4NbCng"
        frameborder="0"
        allowfullscreen>
</iframe>

vitaly · January 1, 2015, 10:17pm

I think we should search not only possible, but useable and natural solution. If copy-pasted html have to be sptitted, that will guarantee reports about broken support to all parsers maintainers.

lu_zero · January 1, 2015, 10:52pm

It is a good point but quite conflicting, I’d rather have the current proposal and have people sorround copy-pasted code by <div>, than have more chance of having false positives.

Hoylen · January 2, 2015, 2:06pm

I agree that avoiding hardcoding tag names is a worthy goal.

Clarification required

An HTML block would start with a line containing a single, unindented tag (either opening, closing, or self-closing), and would end, as before, with a blank line or end of document.

Can you please clarify if “containing a single” is supposed to mean:

a. containing only a single complete unindented tag (and nothing else);
b. containing only a single complete unindented tag or a partial unindented tag, but not both;
c. containing at least a single complete unindented tag;
d. containing either only a single partial unindented tag or least a single complete unindented tag; or both.

Version 0.15 seems to be (d). Since <div></div> in Example 100 is treated as a valid start of a HTML block (even though it has more than just one tag) and incomplete tags are also a valid start of a HTML block in Example 107.

If the new proposal is (d) or (b) then there is no problem with the YouTube iframe example, since it is a valid partial tag. The iframe example is only a problem if partial tags cannot be a valid HTML block tag (i.e. (a) or (c)).

Examples

To make the above options more clear, here are some example lines.

Under (a):

<div> would be a valid beginning of a HTML block.
<div></div> would be invalid, since it is not a single tag (it is two tags)
<div class="foo" would be invalid, since it is not a complete tag.
<div><p class="bar" would be invalid, since there is more than a single tag.

Under (b):

<div> would be a valid.
<div></div> would be invalid, since it is not a single tag (it is two tags).
<div class="foo" would be valid.
<div><p class="bar" would be invalid, since there is both a single complete tag and a partial unindented tag, but only one is allowed.

Under (c):

<div> would be a valid.
<div></div> would be valid.
<div class="foo" would be invalid, since it is not a complete tag.
<div><p class="bar" would be valid.

Under (d):

<div> would be a valid.
<div></div> would be valid.
<div class="foo" would be valid.
<div><p class="bar" would be valid.

A preference

I prefer the options that allow for more than “only a single complete tag”, since there are situations where you want HTML tags to not contain any whitespace content. When you really want to produce <foo></foo> or <foo/> and don’t want <foo> </foo> or

<foo>
</foo>

A question

I don’t understand why, in example 1 of the 6th post, <del>foo</del> becomes <p><del>foo</del></p>?

If this is not being interpreted as a HTML block (because it contains more than a single tag) and so is treated as a paragraph, wouldn’t the special characters be treated as normal characters (since they cannot be interpreted as autolinks). So, if it is transformed into HTML, it would become <p><del>foo</del></p>.