What is the point of limiting URI schemes in autolinks?

Except possible legacy documents that use tags like <svg:svg> for embedding SVG straight in their XHTML. HTML5 has done away with this and introduced ‘foreign elements’, and seems to have forbidden all colons in tag names, but namespaces used to be a thing.

(This is really the only reason I can think of. The colon test is pretty solid otherwise.)

I think namespaces are still a thing when it comes to extensibility (see topic Namespacing with CURIEs for my thoughts on extensions).

But core CommonMark doesn’t have namespaces yet, so limiting this topic to the spec it seems to me that <tag:foo> and <foo:tag> should be treated consistently, but as the spec is currently written, they come out quite differently as <a href="tag:foo">tag:foo</a> and &lt;foo:tag&gt; respectively.

I would hope this inconsistency should be resolved as soon as possible, before extensions are standardized. I think that limiting auto-link to absolute URIs (with the colon), rather than embedding a list of supported schemes would be the preferable choice to achieve such consistency. I’ll put together a pull-request with formal changes, referencing this topic for further discussion.

Is commonmark targetting html5 specifically though? Because I was assuming that was the case and therefore there shouldn’t be any conflicts since html5 made it invalid I believe? But yes otherwise, namespaces do exist in xhtml.

We should target html5 since that is the current linga franca of the net in the current and near future.

But it shouldn’t really matter, if we use an intermediate representation between sourceDocument and htmlDocument, via an Abstract Syntax Tree (via either json or XML).

I think the list of recognized URI schemes is just the one from the IANA, at least it looks like a 1:1 copy from it.

RFC3986 specifies how a scheme is built (scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )), and I would also appreciate if this would be used. Mark* is often used in internal environments where you have some very specific setups which commonly include non-standard URI schemes.

1 Like

There may be some confusion here, I’m discussing the CommonMark input specification, not the target language (e.g. HTML 5 vs XHTML). My question is really independent of that.

Specifically I asked about in the 1.0 autolink syntax, which sort of looks like HTML due to the use of angle brackets, but autolinks are not HTML (or XML).

The confusion arises because the html blocks and raw html features, also use angle bracket notations. But these features, which do use HTML as input are distinct, and in the current spec trigger on tag name definitions which exclude colons.

Note: Within a CommonMark html block you can already use notation like <svg:svg> to your heart’s content. You just can’t start the block with <svg:svg>. Instead, just wrap it one of the supported HTML block tags like <div> (or start it with an HTML comment), eliminate any blank lines, and you are set to go!

But that’s not my point.

The point is that at the beginning of, or inside the text of a paragraph. input <svg:svg> wouldn’t be treated recognized as any of the above under the 1.0 specification. The reference implementation would translate it to HTML&lt;svg:svg&gt;. But unless I’ve missed something, that’s being done based on an unwritten rule.

I’m not unhappy that CommonMark 5<7 renders as HTML 5&lt;7 – that’s a good thing that deserves an explicit rule in the spec.

Its the boundary conditions between the unwritten rule and autolink syntax in the 1.0 spec that concern me. The current boundary is implied from the list of schemes, adding complexity to the spec and hurting some extension scenarios.,

1 Like

Yes, exactly my point. I’m drafting a proposed change to the URI autolink section that would use this regex:

/<([A-Za-z][-+.A-Za-z0-9]*):([^\s<>\x00-\x1A]*)>/

This follows exactly the RFC3986 definition of scheme, but is looser in enforcing all the rules about the trailing part. I chose the [^\s<>\x00-\x1A]* part carefully to allow international resource identifiers (IRIs) in implementations that support characters beyond the ASCII range. As a result, this regex doesn’t try to enforce the semantic rules about using brackets only in IPv6 host names, but should match all valid RFC3086 absolute urls, as well as w3c Compact URIs (CURIEs).

One consequence is that this change would change output from one of the test cases:

Example 403, input

<heck://bing.bong>

Version 1.0 output:

<p>&lt;heck://bing.bong&gt;</p>

Proposed new output:

<p><a href="heck://bing.bong">heck://bing.bong</a></p>
4 Likes

The protocol whitelist is disastrous in my opinion. If at all possible that particular part should be dropped.

My top priority for this topic: Whatever comes of out the auto-linking discussion, there must be an explicit way to generate links to an arbitrary URI. Beyond the inherent absurdity of a text format being aware that spotify and secondlife exist, artificially limiting future protocols would be, in my opinion, very short-sighted.

Burt’s most recent proposal seems to address my concerns handily.

1 Like

On the ‘missing scheme’ thread, @jgm said:

The heuristic I would propose is that if the contents of the angle-brackets contains a colon and contains no whitespace, it should be treated as a URI and a hyperlink generated, rather than treating it as a tag.

A namespace-qualified tag like <m:math> alone doesn’t have much meaning standing alone, and to declare the prefix requires spaces inside the angle brackets. This is reflected in MathML documentation, for example the MathJax documentation says:

Also note that, unless you are using XHTML rather than HTML, you should not include a namespace prefix for your <math> tags; for example, you should not use <m:math> except in a file where you have tied the m namespace to the MathML DTD by adding the xmlns:m=“MathML Namespace” attribute to your file’s <html> tag.

So for <m:math> to mean something alone, it needs a declaration. The namespace-prefix declaration mechanism I suggested in the Compact URIs thread could be used to give a prefix like m an extra semantic kick that caused it to be passed through unaltered. E.g. a prefix declaration like this might be used in a CommonMark document:

<? prefix m: http://www.w3.org/1998/Math/MathML !verbatim ?>
1 Like

+++ Burt Harris [Oct 23 14 00:05 ]:

The heuristic I would propose is that if the contents of the angle-brackets contains a colon and contains no whitespace, it should be treated as a URI and a hyperlink generated, rather than treating it as a tag.

A namespace-qualified tag like <m:math> alone doesn’t have much meaning standing alone, and to declare the prefix requires spaces inside the angle brackets. This is reflected in MathML documentation, for example the MathJax documentation says:

Also note that, unless you are using XHTML rather than HTML, you should not include a namespace prefix for your <math> tags; for example, you should not use <m:math> except in a file where you have tied the m namespace to the MathML DTD by adding the xmlns:m=“MathML Namespace” attribute to your file’s <html> tag.

Yes, but if it’s added to the <html> tag, it needn’t be repeated on the <m:math> tags. That’s the problem I see. Of course, this could be addressed by a custom prefix declaration of the sort you describe, but it remains the case that regular HTML can have tags that would be wrongly treated as URLs by your heuristic.

My point is that <m:math> is not regular HTML it is only valid in XHTML, which is a separate language from HTML per W3C’s definitions.

The HTML 5 standard has changed the direction of HTML, heading away from the XHTML bent of its predecessor, and modern browsers now generally support custom tags without namespace. So I suggest it may quite uncommon to see XHTML used in the future, especially in contexts likely to be embedded into a CommonMark document. Perhaps we adjust the heuristic some inside HTML blocks, which might contain namespace declarations, but overall, I wouldn’t see support of XHTML tags worth too much additional complication.

2 Likes

yea, pretty much anything that is an official link would take the form of somethingHere://somethingThere with :// . I doubt m:math is going to be a trend with urls.

+++ mofosyne [Oct 24 14 02:49 ]:

yea, pretty much anything that is an official link would take the form of somethingHere://somethingThere with :// . I doubt m:math is going to be a trend with urls.

You don’t always have //. For example, mailto:me@example.com.

I would prefer to allow arbitrary URI schemes, too.

  • It is bound to happen that new, popular schemes will come up.
  • Custom URI schemes are useful when extending CommonMark in some scenarios.
  • Even the whitelist doesn’t guard reliably against mistaking an XML QName for a URI.

What if we allowed arbitrary schemes, provided they are at least two characters long, ASCII, and start with a letter?

That would capture all the existing schemes, while still allowing you to use one-letter XML namespaces. (That would suffice for the application I have in mind, mathml in epubs.)

1 Like

I think, on practice, current whitelist has no much use, because allows unsafe jsvascript/vbscript schemas. Removing checks will not make things worse, but will make those more simple and flexible.

+++ Vitaly Puzrin [Jan 16 15 10:38 ]:

I think, on practice, current whitelist has no much use, because allows unsafe jsvascript/vbscript schemas. Removing checks will not make things worse, but will make those more simple and flexible.

As you can see from the thread above, the intent of the whitelist was not to help with security, but to allow tags with XML qnames, like <math:mrow>. So that is the primary issue.

Hm… i can understand XML output somehow (for advanced structure validations), but input… isn’t HTML5 enougth? I’ve seen only 2 related mentions - math & epub. Math does not need qname in html5. No experience with epub. Looks like epub3 is ok with html5, and will require additional cryptyc convertors anyway.

Seems html element qnames can be droped safely.

Taking into account the “no SPACE in auto-link” requirement, that is: if this requirement still stands, and thus the example 552 input is not treated as an auto-link:

Spaces are not allowed in autolinks:

Example 552 (interact)

<http://foo.bar/baz·bim>

Then it seems to me that there is no need for a hard-coded list of URI schemes to distinguish URIs from XML tags starting with a QName1): you can always add SPACE in any (start, end, or “empty-element”) tag that contains a GI and thus prevent interpretation of say <m:math> as an auto-link by writing instead <m:math⎵>, and I think inserting this SPACE into “empty element” tags like this: <m:pi⎵/> is or was even recommended (to help user agents cope with XML).

______

  1. Actually, element type and attribute names containing COLON were already allowed in W3C HTML 4, but just not of much use there; so I would say that CommonMark should be able to handle “raw HTML” (and particularly XML markup) using such names in a more general way than just allowing one letter in front of the (first) “:”.

Hmm, well, per the title of the topic itself, the most relevant bit of this topic is only a “should” 1.0 issue, not a “must” 1.0 blocking issue –

Remove hard-coded list of protocols for autolinks? (SHOULD)

And I think the answer is, yes, we should remove the hard-coded list of protocols. See the last post by @jgm up above.