Markup in alt and title

pornelski · September 4, 2014, 12:10am

Examples in the spec seem to suggest inserting HTML markup into HTML’s alt and title attributes.

![oh *no*, tags!](img "_angle_ _bracket_ _attack_")

I think that’s a mistake. HTML defines these attributes to contain plain text, so a screen reader will announce ![*this*]() as some variation of:

less than em greater than this less than slash em greater than

And ![](img "*this*") will render as <em>this</em> in the tooltip (where tooltips are supported).

This gives readers text that is completely different than what the author wrote. It’s not something anybody would like to do, it exposes an implementation quirk.

I think formatting should stay allowed in Markdown’s image alt/title syntax (so conversion straight to something more expressive than HTML can take advantage of that), but Markdown-to-HTML converters must be required to handle limitation of HTML attributes gracefully, without throwing unparsed angle brackets at users.

The simplest rule may be: parse alt and title as Markdown, strip all HTML tags, HTML-escape the result where necessary:

 ![*foo*]()  <img alt="foo">
 ![\*foo\\*]()  <img alt="*foo\*">
 ![& &amp; "]()  <img alt="&amp; &amp; &quot;">
 ![<div>1 < 2</div>]()  <img alt="1 &lt; 2">

vitaly · September 30, 2014, 9:51pm

I think, right behaviour will be not parse markup in image alt at all. Current spec has 5 tests on this case. IMHO, such behavior sould not be pushed to other parsers. It would be nice to remove those tests from spec. Also i don’t like idea to do parse + cleanup, because it overcomplicate things.

It looks like reference implementation has pecularities, that was promoted to spec without strong reasons. Here is our js implementation to compare. It now pass all tests except these.

Tests:

.
![foo *bar*]

[foo *bar*]: train.jpg "train & tracks"
.
<p><img src="train.jpg" alt="foo &lt;em&gt;bar&lt;/em&gt;" title="train &amp; tracks" /></p>
.

could be:

<p><img src="train.jpg" alt="foo *bar*" title="train &amp; tracks" /></p>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.
![foo *bar*][]

[foo *bar*]: train.jpg "train & tracks"
.
<p><img src="train.jpg" alt="foo &lt;em&gt;bar&lt;/em&gt;" title="train &amp; tracks" /></p>
.

could be:

<p><img src="train.jpg" alt="foo *bar*" title="train &amp; tracks" /></p>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.
![foo *bar*][foobar]

[FOOBAR]: train.jpg "train & tracks"
.
<p><img src="train.jpg" alt="foo &lt;em&gt;bar&lt;/em&gt;" title="train &amp; tracks" /></p>
.

could be:

<p><img src="train.jpg" alt="foo *bar*" title="train &amp; tracks" /></p>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.
![*foo* bar][]

[*foo* bar]: /url "title"
.
<p><img src="/url" alt="&lt;em&gt;foo&lt;/em&gt; bar" title="title" /></p>
.

could be:

<p><img src="/url" alt="*foo* bar" title="title" /></p>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.
![*foo* bar]

[*foo* bar]: /url "title"
.
<p><img src="/url" alt="&lt;em&gt;foo&lt;/em&gt; bar" title="title" /></p>
.

could be:

<p><img src="/url" alt="*foo* bar" title="title" /></p>

Knagis · November 3, 2014, 12:40pm

Since this issue has been mentioned in other topics as well (including https://github.com/jgm/CommonMark/issues/145) I thought I would bump it up and add my vote to the scenario where the parser parses the inlines in alt/title but HTML renderer outputs plain text.

The reasoning is that it seems very likely that people would want to create renderers that would output these attributes like this:

<h3>Image title</h3>
<img src=".." alt=".." title=".." />
<caption>Image description</caption>

Another opinion is that the writer should not know that, for example, emphasis are not supported in the image title and in this context *foo* will be output literally. If the parser can output warnings, it should do so; but otherwise it should output the HTML based on the least-surprising principle.

vitaly · November 3, 2014, 12:57pm

That returns us to question of “what is markdown soul” . Is it minimalistic or universal markup?

If anyone wish markup in caption so match for custom renderer, he can make nested call on title content. Let’s keep things simple. As far as i understand, primary target for markdown is html. IMHO it should be correct by default, without sanitizing kludges.

jgm · November 3, 2014, 10:58pm

It would be fairly straightforward to add to the renderer a function renderInlinesPlain, which renders the inlines without any tags or formatting. This would mean that ![*foo*](/url) would be rendered <img src="/url" alt="foo" />, not <img src="/url" alt="*foo*" />. This would give people the flexibility to write a custom renderer that renders the alt text as a formatted caption, while still outputing HTML that is guaranteed to be correct.

jgm · November 6, 2014, 7:25pm

I have made the change described above (to the spec and both implementations).

vitaly · November 8, 2014, 9:30am

Thanks. It’s much better now.

It it possible to remove requirement to “sanitize” md markup? Looks like unnecessary overcomplication. Or, it can be marked as depending on implementation, because “nobody cares” about result.

hobarrera · November 8, 2014, 10:28am

The idea of having a formal spec is that all parsers following it will provide the same output. Having something “implementation dependent” defeats that purpose.

People actually do care, because HTML inside ALTs is actually invalid, so we don’t want invalid HTML being generated.

vitaly · November 8, 2014, 11:44am

Your post is not releated to my question.

md sanitize != html sanitize.

pornelski · November 8, 2014, 12:02pm

Thanks. The fix is great.

I think stripping tags in attributes is the correct solution, and allowing markup on the markdown side gives renderers flexibility (in case they want to output e.g. <figure> or longdesc).

vitaly · November 9, 2014, 3:42am

IMHO, that’s not very clear, until details defined in spec. I see many problems. At first, image is inline tag. Alternate examples are blocks. That’s a potential conflict. At second such markup does not fits well into AST, provided by html parsers - because image is not container.

My suggestion is to keep things simple (leave all src intact), until spec defines better all levels of alternate image use (markup attributes, ast, presentation). Freezing now img html output format does not guarantee anything at the end. It just adds unnesessary resyncs for all other implementations.

This can be solved very easy - mark tests as internal / unstable / experimental:

.. <- two points
not mandatory src
..
html
..

This can be moved to mandatory later, when use cases finalized.