What is the point of limiting URI schemes in autolinks?

+++ Vitaly Puzrin [Jan 16 15 10:38 ]:

I think, on practice, current whitelist has no much use, because allows unsafe jsvascript/vbscript schemas. Removing checks will not make things worse, but will make those more simple and flexible.

As you can see from the thread above, the intent of the whitelist was not to help with security, but to allow tags with XML qnames, like <math:mrow>. So that is the primary issue.

Hm
 i can understand XML output somehow (for advanced structure validations), but input
 isn’t HTML5 enougth? I’ve seen only 2 related mentions - math & epub. Math does not need qname in html5. No experience with epub. Looks like epub3 is ok with html5, and will require additional cryptyc convertors anyway.

Seems html element qnames can be droped safely.

Taking into account the “no SPACE in auto-link” requirement, that is: if this requirement still stands, and thus the example 552 input is not treated as an auto-link:

Spaces are not allowed in autolinks:

Example 552 (interact)

<http://foo.bar/baz·bim>

Then it seems to me that there is no need for a hard-coded list of URI schemes to distinguish URIs from XML tags starting with a QName1): you can always add SPACE in any (start, end, or “empty-element”) tag that contains a GI and thus prevent interpretation of say <m:math> as an auto-link by writing instead <m:math⎔>, and I think inserting this SPACE into “empty element” tags like this: <m:pi⎔/> is or was even recommended (to help user agents cope with XML).

______

  1. Actually, element type and attribute names containing COLON were already allowed in W3C HTML 4, but just not of much use there; so I would say that CommonMark should be able to handle “raw HTML” (and particularly XML markup) using such names in a more general way than just allowing one letter in front of the (first) “:”.

Hmm, well, per the title of the topic itself, the most relevant bit of this topic is only a “should” 1.0 issue, not a “must” 1.0 blocking issue –

Remove hard-coded list of protocols for autolinks? (SHOULD)

And I think the answer is, yes, we should remove the hard-coded list of protocols. See the last post by @jgm up above.

Well, yes. So? Is anything but blocking issues off-topic now? [I’d reckon that this topic is equally pertaining to the blocking issue “Inconsistent handling of spaces in links? (MUST)” anyway.]

[
] I think the answer is, yes, we should remove the hard-coded list of protocols.

That’s what I think, too.

See the last post by @jgm up above.

Yes, I have seen that post—and if fact that was the reason for me commenting: a syntax rule change with the effect to only

allowing you to use one-letter XML namespaces. (That would suffice for the application I have in mind, [
])

seems pretty harsh and, as I tried to explain, unneeded, and would plausibly not suffice for applications other people have in mind. Thus I’d rather not see such a restriction introduced—notwithstanding whether it’s meant to be a resolution of a SHOULD issue or a MUST issue 
.

@tin-pot, thanks for the comment, that’s useful.
I think we should make the change I suggested above, or something like it, for 1.0.
It’s not a “must,” but I don’t see a strong argument against making it, and it’s not a difficult change.

1 Like

I’ve made the change, updated the spec and both reference implementations. Please have a look.

Of course, we now allow more things that aren’t valid absolute URIs, e.g. what-the-heck:is!this. I had to switch a test with <localhost:5001>, which previously didn’t count as an auto-link but now does (with localhost interpreted as the scheme!). But I think that’s okay. We were making no real attempt to weed out invalid URIs before (except for invalid schemes).

1 Like

Hmm, in the dingus I get for this example:

1. This is [hopp](localhost:8080)

2. This is [hopp](localhost:80 80)

3. This is <foobar:dingbat class="invalid">.

4. This is <f:dingbat class="invalid"> 

5. This is <dingbat class="invalid">

for (1.) a link, for (2.) not (the SPACE!); no markup for (3.) or (4.); but (5.) gets passed through as markup (as html_inline in the AST).

Is this the new/correct/intended behavior?

We might want to think about whether we should allow colons in raw HTML tags, to allow 3 and 4.