What is the point of limiting URI schemes in autolinks?

Burt_Harris · September 10, 2014, 11:36pm

In http://jgm.github.io/stmd/spec.html#uri-autolink, there is a lengthy list of URI schemes recognized by the parser inside angle brackets as an auto-link.

While I don’t see anything wrong with that list, could someone clarify the intent behind limiting autolink to these schemes?

rwzy · September 11, 2014, 12:56am

Good question, I’m guessing because otherwise the intent of the writer would be that it’s not a link, and/or that it might somehow conflict with writing normal html tags?

However, I have at least one protocol/scheme that is not listed there (just with a quick look, not going through everything of which I use), which I do use in markdown documents, and which autolinking libraries probably will not recognise, so that’s unfortunate. Therefore, I suspect that there would be many others like me who have at least one scheme unlisted which they would want to use too.

The only way I can see that being fixed is either having those schemes included in the list, which might be impractical, or, if stmd is supported by pandoc in the future, using a pandoc filter. (Or another pre/post-processor of some sort but pandoc filter is probably the cleanest and easiest because we can work with an AST directly.)

Burt_Harris · September 11, 2014, 1:12am

Since an autolink requires an absolute url including a colon, and normal HTML tag names don’t include colons, it doesn’t seem like there would be risk of conflict.

Even if there were reason for allowing namespace-qualified pseudo-HTML tags, it would be possible to disambiguate by including white space in the tag (since white space isn’t generally permitted in a URL.)

rwzy · September 11, 2014, 2:02am

Agreed.

If that was the case, which I don’t think it would be since it would be very strange, I do not think requiring white space in such a tag to differentiate between a link is acceptable. I think whitelisting would be a better approach in that case.

However again, I cannot see why it is not already possible to differentiate between a valid html tag, none of which allow a colon, and an absolute uri, which requires a colon. And, even if point 2 here was the case (which it isn’t currently), where the colon is not required, that could still be differentiated from a html tag by the required dot or hash symbol.

Zegnat · September 11, 2014, 2:05pm

Except possible legacy documents that use tags like <svg:svg> for embedding SVG straight in their XHTML. HTML5 has done away with this and introduced ‘foreign elements’, and seems to have forbidden all colons in tag names, but namespaces used to be a thing.

(This is really the only reason I can think of. The colon test is pretty solid otherwise.)

Burt_Harris · September 11, 2014, 5:58pm

I think namespaces are still a thing when it comes to extensibility (see topic Namespacing with CURIEs for my thoughts on extensions).

But core CommonMark doesn’t have namespaces yet, so limiting this topic to the spec it seems to me that <tag:foo> and <foo:tag> should be treated consistently, but as the spec is currently written, they come out quite differently as <a href="tag:foo">tag:foo</a> and <foo:tag> respectively.

I would hope this inconsistency should be resolved as soon as possible, before extensions are standardized. I think that limiting auto-link to absolute URIs (with the colon), rather than embedding a list of supported schemes would be the preferable choice to achieve such consistency. I’ll put together a pull-request with formal changes, referencing this topic for further discussion.

rwzy · September 12, 2014, 2:27am

Is commonmark targetting html5 specifically though? Because I was assuming that was the case and therefore there shouldn’t be any conflicts since html5 made it invalid I believe? But yes otherwise, namespaces do exist in xhtml.

mofosyne · September 12, 2014, 2:33am

We should target html5 since that is the current linga franca of the net in the current and near future.

But it shouldn’t really matter, if we use an intermediate representation between sourceDocument and htmlDocument, via an Abstract Syntax Tree (via either json or XML).

gnrp · September 12, 2014, 10:57am

I think the list of recognized URI schemes is just the one from the IANA, at least it looks like a 1:1 copy from it.

RFC3986 specifies how a scheme is built (scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )), and I would also appreciate if this would be used. Mark* is often used in internal environments where you have some very specific setups which commonly include non-standard URI schemes.

Burt_Harris · September 12, 2014, 10:53pm

There may be some confusion here, I’m discussing the CommonMark input specification, not the target language (e.g. HTML 5 vs XHTML). My question is really independent of that.

Specifically I asked about in the 1.0 autolink syntax, which sort of looks like HTML due to the use of angle brackets, but autolinks are not HTML (or XML).

The confusion arises because the html blocks and raw html features, also use angle bracket notations. But these features, which do use HTML as input are distinct, and in the current spec trigger on tag name definitions which exclude colons.

Note: Within a CommonMark html block you can already use notation like <svg:svg> to your heart’s content. You just can’t start the block with <svg:svg>. Instead, just wrap it one of the supported HTML block tags like <div> (or start it with an HTML comment), eliminate any blank lines, and you are set to go!

But that’s not my point.

The point is that at the beginning of, or inside the text of a paragraph. input <svg:svg> wouldn’t be treated recognized as any of the above under the 1.0 specification. The reference implementation would translate it to HTML<svg:svg>. But unless I’ve missed something, that’s being done based on an unwritten rule.

I’m not unhappy that CommonMark 5<7 renders as HTML 5<7 – that’s a good thing that deserves an explicit rule in the spec.

Its the boundary conditions between the unwritten rule and autolink syntax in the 1.0 spec that concern me. The current boundary is implied from the list of schemes, adding complexity to the spec and hurting some extension scenarios.,

Burt_Harris · September 12, 2014, 11:06pm

Yes, exactly my point. I’m drafting a proposed change to the URI autolink section that would use this regex:

/<([A-Za-z][-+.A-Za-z0-9]*):([^\s<>\x00-\x1A]*)>/

This follows exactly the RFC3986 definition of scheme, but is looser in enforcing all the rules about the trailing part. I chose the [^\s<>\x00-\x1A]* part carefully to allow international resource identifiers (IRIs) in implementations that support characters beyond the ASCII range. As a result, this regex doesn’t try to enforce the semantic rules about using brackets only in IPv6 host names, but should match all valid RFC3086 absolute urls, as well as w3c Compact URIs (CURIEs).

One consequence is that this change would change output from one of the test cases:

Example 403, input

<heck://bing.bong>

Version 1.0 output:

<p>&lt;heck://bing.bong&gt;</p>

Proposed new output:

<p><a href="heck://bing.bong">heck://bing.bong</a></p>

dotsConnected · September 26, 2014, 4:20pm

The protocol whitelist is disastrous in my opinion. If at all possible that particular part should be dropped.

My top priority for this topic: Whatever comes of out the auto-linking discussion, there must be an explicit way to generate links to an arbitrary URI. Beyond the inherent absurdity of a text format being aware that spotify and secondlife exist, artificially limiting future protocols would be, in my opinion, very short-sighted.

Burt’s most recent proposal seems to address my concerns handily.

Burt_Harris · October 22, 2014, 11:53pm

On the ‘missing scheme’ thread, @jgm said:

The heuristic I would propose is that if the contents of the angle-brackets contains a colon and contains no whitespace, it should be treated as a URI and a hyperlink generated, rather than treating it as a tag.

A namespace-qualified tag like <m:math> alone doesn’t have much meaning standing alone, and to declare the prefix requires spaces inside the angle brackets. This is reflected in MathML documentation, for example the MathJax documentation says:

Also note that, unless you are using XHTML rather than HTML, you should not include a namespace prefix for your <math> tags; for example, you should not use <m:math> except in a file where you have tied the m namespace to the MathML DTD by adding the xmlns:m=“MathML Namespace” attribute to your file’s <html> tag.

So for <m:math> to mean something alone, it needs a declaration. The namespace-prefix declaration mechanism I suggested in the Compact URIs thread could be used to give a prefix like m an extra semantic kick that caused it to be passed through unaltered. E.g. a prefix declaration like this might be used in a CommonMark document:

<? prefix m: http://www.w3.org/1998/Math/MathML !verbatim ?>

jgm · October 23, 2014, 4:59pm

+++ Burt Harris [Oct 23 14 00:05 ]:

The heuristic I would propose is that if the contents of the angle-brackets contains a colon and contains no whitespace, it should be treated as a URI and a hyperlink generated, rather than treating it as a tag.

A namespace-qualified tag like <m:math> alone doesn’t have much meaning standing alone, and to declare the prefix requires spaces inside the angle brackets. This is reflected in MathML documentation, for example the MathJax documentation says:

Also note that, unless you are using XHTML rather than HTML, you should not include a namespace prefix for your <math> tags; for example, you should not use <m:math> except in a file where you have tied the m namespace to the MathML DTD by adding the xmlns:m=“MathML Namespace” attribute to your file’s <html> tag.

Yes, but if it’s added to the <html> tag, it needn’t be repeated on the <m:math> tags. That’s the problem I see. Of course, this could be addressed by a custom prefix declaration of the sort you describe, but it remains the case that regular HTML can have tags that would be wrongly treated as URLs by your heuristic.

Burt_Harris · October 24, 2014, 12:11am

My point is that <m:math> is not regular HTML it is only valid in XHTML, which is a separate language from HTML per W3C’s definitions.

The HTML 5 standard has changed the direction of HTML, heading away from the XHTML bent of its predecessor, and modern browsers now generally support custom tags without namespace. So I suggest it may quite uncommon to see XHTML used in the future, especially in contexts likely to be embedded into a CommonMark document. Perhaps we adjust the heuristic some inside HTML blocks, which might contain namespace declarations, but overall, I wouldn’t see support of XHTML tags worth too much additional complication.

mofosyne · October 24, 2014, 2:38am

yea, pretty much anything that is an official link would take the form of somethingHere://somethingThere with :// . I doubt m:math is going to be a trend with urls.

jgm · October 24, 2014, 4:18am

+++ mofosyne [Oct 24 14 02:49 ]:

yea, pretty much anything that is an official link would take the form of somethingHere://somethingThere with :// . I doubt m:math is going to be a trend with urls.

You don’t always have //. For example, mailto:me@example.com.

nwellnhof · January 12, 2015, 1:47am

I would prefer to allow arbitrary URI schemes, too.

It is bound to happen that new, popular schemes will come up.
Custom URI schemes are useful when extending CommonMark in some scenarios.
Even the whitelist doesn’t guard reliably against mistaking an XML QName for a URI.

jgm · January 15, 2015, 7:40am

What if we allowed arbitrary schemes, provided they are at least two characters long, ASCII, and start with a letter?

That would capture all the existing schemes, while still allowing you to use one-letter XML namespaces. (That would suffice for the application I have in mind, mathml in epubs.)

vitaly · January 16, 2015, 10:24am

I think, on practice, current whitelist has no much use, because allows unsafe jsvascript/vbscript schemas. Removing checks will not make things worse, but will make those more simple and flexible.