Newline and IMG tags


#1

Given that we still have no way of specifying image size in CommonMark we are stuck using <img> tags in Discourse.

Trouble is, it is terribly hard to format multiple images on lines by themselves, this is so bad that I feel I need to divert from spec @jgm / @codinghorror

In particular, stuff like this:

<img src="/uploads/default/original/2X/d/d2358ccd3904da10dd9f4c7af07cb2ccd88942bc.png" width="85" height="53">

<img src="/uploads/default/original/2X/d/d2358ccd3904da10dd9f4c7af07cb2ccd88942bc.png" width="85" height="53">

Renders as:

Which is 100% clearly not what the author of the post on Discourse intends.

The intentions if for the image to be on a line by itself just like and for the rendering to be akin to

![](/uploads/default/original/2X/d/d2358ccd3904da10dd9f4c7af07cb2ccd88942bc.png)

![](/uploads/default/original/2X/d/d2358ccd3904da10dd9f4c7af07cb2ccd88942bc.png)

However since we are stuck without the ability to specify size, we are stuck using <img> tags and that leads to some very confusing behavior on the users end.


#2

Funny enough, you can get the desired result by inserting a line break in the <IMG> tag, like this:

<img
src="/uploads/default/original/2X/d/d2358ccd3904da10dd9f4c7af07cb2ccd88942bc.png" width="85" height="53">

<img
src="/uploads/default/original/2X/d/d2358ccd3904da10dd9f4c7af07cb2ccd88942bc.png" width="85" height="53">

This will place each <IMG> element into its own <P> element, just like in your second example. The reference implementation renders this as:

<p><img 
src="/uploads/.../d2...bc.png" width="85" height="53"></p>
<p><img 
src="/uploads/.../d2...bc.png" width="85" height="53"></p>

The reason for this difference in behaviour is—as far as I understand it—that in the second case the start condition number 7 in section 4.6 (“HTML blocks”) does no longer apply: The tag is split into two source lines and hence the line no longer “begins with a complete open tag or closing tag”. Thus the <IMG> element is not treated as a HTML block, but as “normal” tag inside the (inline) content of a paragraph (in this case, as it happens, the only content of that paragraph).


As an aside, note that the first output (two <IMG> elements as direct children presumably of the document’s <BODY> element) is actually invalid according to HTML rules (version 4.01; ISO 15445:2000): the <IMG> element is not one of the “block-like elements” (collected in %block;) allowed there.


However, I consider your <IMG> problem not just as a curiosity, but as an example and a symptom of the somewhat shaky way how descriptive markup (that is, start-tags, end-tags, empty-element-tags—HTML or otherwise) is processed in Markdown and CommonMark.

Conceptually, there are at least four issues/aspects to consider:

  1. What gets recognized as a tag, and where?

  2. How to find the end of an element which started with an explicit start-tag?

  3. If a start-tag is encountered outside of a paragraph, does it start a new paragraph (will the element, alongside following inline content, be “wrapped” inside a <P> element in the output)?

  4. Is the (mixed) content of an explicitly marked-up element parsed according to Markdown rules or just passed along?

In the current specification the treatment of descriptive markup depends on whether it is found inside a paragraph or outside:

  • Inside a paragraph, tags are just passed along: There is no need to find the element’s end, and content continues to be treated like regular Markdown text.

  • Outside a paragraph, tags may either start a paragraph or otherwise start, continue, or end a HTML block. Content of such an HTML block is not parsed as regular Markdown text but passed along literally.

While this approach avoids—as it IMO should—the need to actually parse the element structure (keeping track of open and/or omitted tags etc), it mixes the aspect of “content-parsing” with that of “paragraph-wrapping“.

For example, there is—as far as I can tell—no way to write in a Markdown text a “customized” paragraph with an explicit start-tag but otherwise the processing of a regular paragraph.

Consider a <NOTE> element with the same content model as an ordinary <P>—one would like to write in Markdown/CommonMark something similar to

After a regular paragraph:

<NOTE>Consider *this* example.</NOTE>

And so on.

an have it rendered as

<P>After a regular paragraph:</P>

<NOTE>Consider <EM>this</EM> example.</NOTE>

<P>And so on.</P>

I have not yet found a way to achieve this—not in CommonMark and not in (most) other Markdown variants: Either the output will have spurious <P> tags inserted, or the content will not get parsed (ignoring the Markdown-* for indicating emphasis <EM>).

Maybe it’s just me, but I really do feel that there should be an easier and cleaner way to indicate if an explicitly marked-up element should

  • get wrapped into a paragraph or not (that is, whether it is to be treated as inline or block level), and

  • have its content parsed “as usual”—that is, as Markdown-style inline text just like in a regular paragraph.

And to indicate these two aspects independently, if at all possible.


#3

<img src="x"> <img src="x">

I would expect that to render on the same line, if the images are small enough, yes.

However …

<img src="x">

<img src="x">

If the above rendered on the same line, that’d indeed be surprising to me.


#4

If the above rendered on the same line, that’d indeed be surprising to me.

I’m not sure why—in any case, a <BODY> element like this:

<BODY>
<IMG src="x">

<IMG src="x">
</BODY>

is actually rendered with the two images side-by-side (on the same line) in Firefox (Pale Moon, to be precise), but it is valid only in HTML5, where <IMG> is categorized as flow content and may thus appear in the content of a <BODY> element; otherwise (in proper“traditional” HTML) this is invalid (the <IMG> element is not allowed to appear there), so technically all bets are off when it comes to rendering expectations. It just turns out that this particular browser renders it the same way in all cases, just like it does with eg

<BODY>
<B>One</B>

<B>Two</B>
</BODY>

(where the same difference with regard to validity applies—the <B> element is also considered flow content in HTML5, but inline rsp character-level text otherwise—, with the same results).


#5

WRT

and

The HTML4 specification from 1999 is obsolete and has been superseded by the 2014 HTML5 specification which has subsequently been replaced by the HTML5.1 specification in 2016.

As such “valid only in HTML5” is misleading. You’re right in that it historically was invalid, per the DTD, however as that is no longer the case and hasn’t been the case for quite a number of years now* it is incorrect to imply that the obsolete HTML4 specification is somehow “proper”, or that it should be favored in any way compared to the HTML5.1 specification.

* (HTML5 was actively in use by browsers well before the specification had reached Recommendation status)


#6

Because newlines are interpreted as newlines here, sorry, should have been more explicit.

This
Renders
Like
This
Here


#7

I don’t want to go into a discussion of the merits and demerits of HTML5 here, or the meaning of terms like “obsolete”, “superseded”, “interoperability” etc. (My use of the phrase “proper” was obviously tongue-in-cheek, expressing an opinion rather than claiming a fact.)

After all, this is beside the point and problem that the original poster indicated—at least as I understood it, namely:


How can the author of a CommonMark text control/influence/indicate whether an explicitly marked-up element (HTML5, DocBook, “custom” or otherwise) occuring at “block level” is or is not to be wrapped inside a <P> element in the processor’s output (ie, whether or not the start-tag will implicitly start a CommonMark paragraph)? And whether or not the element’s content should be parsed according to CommonMark rules?

For the special case of an <IMG> element in CommonMark, the answer is simple (disregarding the mentioned “new-line-hack”): Write an HTML block for a <P> if wrapping is intended:

This *HTML block* will produce a &ldquo;freestanding&rdquo; `<IMG>` element:

<IMG src="http://example.com/img.png" alt="freestanding">

This *HTML block* will produce an `<IMG>` inside a paragraph (`<P>`) element:

<P>
<IMG src="http://example.com/img.png" alt="wrapped into paragraph">
</P>

(This is different in CommonMark than in most other Markdown variants, which tend to treat <IMG> as inline and thus wrap the first example inside a <P>.)

And my point was that this is akin to a more general problem: You can, as far as I can see, either have “neither wrapping nor parsing”, like in an ordinary, Gruber-like HTML block:

<FOO>
Neither wrapping *nor* parsing.
</FOO>

or you can have “both wrapping and parsing”, like in an ordinary paragraph containing inline markup:

<FOO>Both wrapping *and* parsing.</FOO>

(Though existing Markdown implementations wildly differ in these two examples, as BabelMap shows …)

The combination “no wrapping but parsing” seems to be impossible to achieve—and my <NOTE> example attempted to show why this could be desirable.


#8

Hmm—I still don’t quite get it what you mean:

If “here” means “in the immediate content of a <BODY> element“ and “interpreted as newlines” means “lines are broken at preserved newline characters”—as for the pre, pre-wrap, or pre-line values of the white-space CSS property—then I can find no mention of this in the rendering expectations defined for HTML5. (Nor does my—superficial—testing confirm such rendering behaviour, as I mentioned.)

In any case, I think the focus should be on the output generated by CommonMark and not on the details how this output will be rendered (under the default style sheet of more or less common browsers etc).


#9

We use a variant of CommonMark that renders

a
b
c

as

a
b
c

This is discussed in detail here: Default line break handling is inconvenient

My expectation is that if this “option” is enabled any sequence of inline HTML elements on lines by themselves would behave the same way as the text.

@codinghorror for the record I did “fix” this in Discourse. But I feel that we got to glue some special semantics to HTML inlines if you have the “newline”=<br> option enabled.


#10

So “interpreted as newlines” actually means “replaced by <BR> elements in the output” by this variant of CommonMark, implementing the remark

A renderer may also provide an option to render soft line breaks as hard line breaks.

in section 6.10. And the initial example input

<IMG src="..." alt="freestanding img 1">

<IMG src="..." alt="freestanding img 2">

would (or should, using this variant?) produce the output HTML (ie HTML5 rsp XHTML rsp “polyglot” rsp …?) fragment

<IMG src="..." alt="freestanding img 1" /><BR />
<BR />
<IMG src="..." alt="freestanding img 2" /><BR />

In this case I agree: I see no reason why this fragment should not render the two images on top of each other, and in fact it does in my browser.

So is the problem that this variant does (or did before the “fix”) not emit the <BR> elements, or that your browser still shows the images side-by-side?


#11

No I do not see this as a bug in the markdown.it implementation of:

A renderer may also provide an option to render soft line breaks as hard line breaks.

There are no soft line breaks in the tree next to the images, it is treated as an HTML block per other sections of the spec.

I feel there is a gap in the spec explaining how to deal with this.

cc @vitaly

Also

<p class='x'>
test
</p>

Should not turn into

<p class='x'><br>
test<br>
</p><br>

Inline vs block need special handling.


#12

I’d like to move away from the specifics of line breaks, newlines and rendering and point out that the very first remark in this thread actually hints right at the core of the problem:

This does apply, of course, to the vast majority of attributes of the vast majority of elements: Given that we still have no way of specifying arbitrary attributes in CommonMark, we are stuck using explicit tags.

For example (technically pretty much the same as my <NOTE> example above), consider the lang attribute to specify the natural language used in the content of some element, say a paragraph:

This paragraph is *solely* an example.

<P lang="de">Auch das ist *nur* ein Beispiel.</P>

I see no way to produce (from some variant of this CommonMark input) the desired output:

<P>This paragraph is <EM>solely</EM> an example.</P>

<P lang="de">Auch das ist <EM>nur</EM> ein Beispiel.</P>

That is, to write the second paragraph element in a way that

  1. it is not wrapped in yet another <P> element, but
  2. its content is still processed as “normal” inline CommonMark text, just as in a regular paragraph.

#13

See option breaks. That’s for newlines inside paragraphs. Those can be converted to space or <br>, according to spec.

That’s not applied to html image “blocks”, because those are NOT inside paragraph.

There were a lot of discussions about raw html (for example, copying youtube html was not working right in earlier spec versions). I theory, it would be nice to solve your case too, but that should be forwarded to CM forum. I think, it may be considered as “missed case”, but not sure.

The other way is to push in spec method to define image size via markdown. Then all those kludges will be not needed at all.


#14

Okay, I see: A softbreak element can—by definition—never occur in the AST as a direct child of document (because it is explicitly categorized as %inline, whereas document contains only %block elements).

Indeed. But this strict separation (in CommonMark) between inline and block elements clashes head-on with the HTML5 element structure, where flow content (as in the content model for <BODY>) is a wild mixture of elements, from <EM> (clearly inline) to <BLOCKQUOTE> (clearly block). In previous versions of HTML, this problem existed too, but less prominently, eg for <DD>, <DIV> and <LI>—elements which Markdown “avoids” or already treats specially.