What is the point of limiting URI schemes in autolinks?

There may be some confusion here, I’m discussing the CommonMark input specification, not the target language (e.g. HTML 5 vs XHTML). My question is really independent of that.

Specifically I asked about in the 1.0 autolink syntax, which sort of looks like HTML due to the use of angle brackets, but autolinks are not HTML (or XML).

The confusion arises because the html blocks and raw html features, also use angle bracket notations. But these features, which do use HTML as input are distinct, and in the current spec trigger on tag name definitions which exclude colons.

Note: Within a CommonMark html block you can already use notation like <svg:svg> to your heart’s content. You just can’t start the block with <svg:svg>. Instead, just wrap it one of the supported HTML block tags like <div> (or start it with an HTML comment), eliminate any blank lines, and you are set to go!

But that’s not my point.

The point is that at the beginning of, or inside the text of a paragraph. input <svg:svg> wouldn’t be treated recognized as any of the above under the 1.0 specification. The reference implementation would translate it to HTML&lt;svg:svg&gt;. But unless I’ve missed something, that’s being done based on an unwritten rule.

I’m not unhappy that CommonMark 5<7 renders as HTML 5&lt;7 – that’s a good thing that deserves an explicit rule in the spec.

Its the boundary conditions between the unwritten rule and autolink syntax in the 1.0 spec that concern me. The current boundary is implied from the list of schemes, adding complexity to the spec and hurting some extension scenarios.,

1 Like

Yes, exactly my point. I’m drafting a proposed change to the URI autolink section that would use this regex:

/<([A-Za-z][-+.A-Za-z0-9]*):([^\s<>\x00-\x1A]*)>/

This follows exactly the RFC3986 definition of scheme, but is looser in enforcing all the rules about the trailing part. I chose the [^\s<>\x00-\x1A]* part carefully to allow international resource identifiers (IRIs) in implementations that support characters beyond the ASCII range. As a result, this regex doesn’t try to enforce the semantic rules about using brackets only in IPv6 host names, but should match all valid RFC3086 absolute urls, as well as w3c Compact URIs (CURIEs).

One consequence is that this change would change output from one of the test cases:

Example 403, input

<heck://bing.bong>

Version 1.0 output:

<p>&lt;heck://bing.bong&gt;</p>

Proposed new output:

<p><a href="heck://bing.bong">heck://bing.bong</a></p>
4 Likes

The protocol whitelist is disastrous in my opinion. If at all possible that particular part should be dropped.

My top priority for this topic: Whatever comes of out the auto-linking discussion, there must be an explicit way to generate links to an arbitrary URI. Beyond the inherent absurdity of a text format being aware that spotify and secondlife exist, artificially limiting future protocols would be, in my opinion, very short-sighted.

Burt’s most recent proposal seems to address my concerns handily.

1 Like

On the ‘missing scheme’ thread, @jgm said:

The heuristic I would propose is that if the contents of the angle-brackets contains a colon and contains no whitespace, it should be treated as a URI and a hyperlink generated, rather than treating it as a tag.

A namespace-qualified tag like <m:math> alone doesn’t have much meaning standing alone, and to declare the prefix requires spaces inside the angle brackets. This is reflected in MathML documentation, for example the MathJax documentation says:

Also note that, unless you are using XHTML rather than HTML, you should not include a namespace prefix for your <math> tags; for example, you should not use <m:math> except in a file where you have tied the m namespace to the MathML DTD by adding the xmlns:m=“MathML Namespace” attribute to your file’s <html> tag.

So for <m:math> to mean something alone, it needs a declaration. The namespace-prefix declaration mechanism I suggested in the Compact URIs thread could be used to give a prefix like m an extra semantic kick that caused it to be passed through unaltered. E.g. a prefix declaration like this might be used in a CommonMark document:

<? prefix m: http://www.w3.org/1998/Math/MathML !verbatim ?>
1 Like

+++ Burt Harris [Oct 23 14 00:05 ]:

The heuristic I would propose is that if the contents of the angle-brackets contains a colon and contains no whitespace, it should be treated as a URI and a hyperlink generated, rather than treating it as a tag.

A namespace-qualified tag like <m:math> alone doesn’t have much meaning standing alone, and to declare the prefix requires spaces inside the angle brackets. This is reflected in MathML documentation, for example the MathJax documentation says:

Also note that, unless you are using XHTML rather than HTML, you should not include a namespace prefix for your <math> tags; for example, you should not use <m:math> except in a file where you have tied the m namespace to the MathML DTD by adding the xmlns:m=“MathML Namespace” attribute to your file’s <html> tag.

Yes, but if it’s added to the <html> tag, it needn’t be repeated on the <m:math> tags. That’s the problem I see. Of course, this could be addressed by a custom prefix declaration of the sort you describe, but it remains the case that regular HTML can have tags that would be wrongly treated as URLs by your heuristic.

My point is that <m:math> is not regular HTML it is only valid in XHTML, which is a separate language from HTML per W3C’s definitions.

The HTML 5 standard has changed the direction of HTML, heading away from the XHTML bent of its predecessor, and modern browsers now generally support custom tags without namespace. So I suggest it may quite uncommon to see XHTML used in the future, especially in contexts likely to be embedded into a CommonMark document. Perhaps we adjust the heuristic some inside HTML blocks, which might contain namespace declarations, but overall, I wouldn’t see support of XHTML tags worth too much additional complication.

2 Likes

yea, pretty much anything that is an official link would take the form of somethingHere://somethingThere with :// . I doubt m:math is going to be a trend with urls.

+++ mofosyne [Oct 24 14 02:49 ]:

yea, pretty much anything that is an official link would take the form of somethingHere://somethingThere with :// . I doubt m:math is going to be a trend with urls.

You don’t always have //. For example, mailto:me@example.com.

I would prefer to allow arbitrary URI schemes, too.

  • It is bound to happen that new, popular schemes will come up.
  • Custom URI schemes are useful when extending CommonMark in some scenarios.
  • Even the whitelist doesn’t guard reliably against mistaking an XML QName for a URI.

What if we allowed arbitrary schemes, provided they are at least two characters long, ASCII, and start with a letter?

That would capture all the existing schemes, while still allowing you to use one-letter XML namespaces. (That would suffice for the application I have in mind, mathml in epubs.)

1 Like

I think, on practice, current whitelist has no much use, because allows unsafe jsvascript/vbscript schemas. Removing checks will not make things worse, but will make those more simple and flexible.

+++ Vitaly Puzrin [Jan 16 15 10:38 ]:

I think, on practice, current whitelist has no much use, because allows unsafe jsvascript/vbscript schemas. Removing checks will not make things worse, but will make those more simple and flexible.

As you can see from the thread above, the intent of the whitelist was not to help with security, but to allow tags with XML qnames, like <math:mrow>. So that is the primary issue.

Hm… i can understand XML output somehow (for advanced structure validations), but input… isn’t HTML5 enougth? I’ve seen only 2 related mentions - math & epub. Math does not need qname in html5. No experience with epub. Looks like epub3 is ok with html5, and will require additional cryptyc convertors anyway.

Seems html element qnames can be droped safely.

Taking into account the “no SPACE in auto-link” requirement, that is: if this requirement still stands, and thus the example 552 input is not treated as an auto-link:

Spaces are not allowed in autolinks:

Example 552 (interact)

<http://foo.bar/baz·bim>

Then it seems to me that there is no need for a hard-coded list of URI schemes to distinguish URIs from XML tags starting with a QName1): you can always add SPACE in any (start, end, or “empty-element”) tag that contains a GI and thus prevent interpretation of say <m:math> as an auto-link by writing instead <m:math⎵>, and I think inserting this SPACE into “empty element” tags like this: <m:pi⎵/> is or was even recommended (to help user agents cope with XML).

______

  1. Actually, element type and attribute names containing COLON were already allowed in W3C HTML 4, but just not of much use there; so I would say that CommonMark should be able to handle “raw HTML” (and particularly XML markup) using such names in a more general way than just allowing one letter in front of the (first) “:”.

Hmm, well, per the title of the topic itself, the most relevant bit of this topic is only a “should” 1.0 issue, not a “must” 1.0 blocking issue –

Remove hard-coded list of protocols for autolinks? (SHOULD)

And I think the answer is, yes, we should remove the hard-coded list of protocols. See the last post by @jgm up above.

Well, yes. So? Is anything but blocking issues off-topic now? [I’d reckon that this topic is equally pertaining to the blocking issue “Inconsistent handling of spaces in links? (MUST)” anyway.]

[…] I think the answer is, yes, we should remove the hard-coded list of protocols.

That’s what I think, too.

See the last post by @jgm up above.

Yes, I have seen that post—and if fact that was the reason for me commenting: a syntax rule change with the effect to only

allowing you to use one-letter XML namespaces. (That would suffice for the application I have in mind, […])

seems pretty harsh and, as I tried to explain, unneeded, and would plausibly not suffice for applications other people have in mind. Thus I’d rather not see such a restriction introduced—notwithstanding whether it’s meant to be a resolution of a SHOULD issue or a MUST issue ….

@tin-pot, thanks for the comment, that’s useful.
I think we should make the change I suggested above, or something like it, for 1.0.
It’s not a “must,” but I don’t see a strong argument against making it, and it’s not a difficult change.

1 Like

I’ve made the change, updated the spec and both reference implementations. Please have a look.

Of course, we now allow more things that aren’t valid absolute URIs, e.g. what-the-heck:is!this. I had to switch a test with <localhost:5001>, which previously didn’t count as an auto-link but now does (with localhost interpreted as the scheme!). But I think that’s okay. We were making no real attempt to weed out invalid URIs before (except for invalid schemes).

1 Like

Hmm, in the dingus I get for this example:

1. This is [hopp](localhost:8080)

2. This is [hopp](localhost:80 80)

3. This is <foobar:dingbat class="invalid">.

4. This is <f:dingbat class="invalid"> 

5. This is <dingbat class="invalid">

for (1.) a link, for (2.) not (the SPACE!); no markup for (3.) or (4.); but (5.) gets passed through as markup (as html_inline in the AST).

Is this the new/correct/intended behavior?

We might want to think about whether we should allow colons in raw HTML tags, to allow 3 and 4.