Backslash escapes inside link destinations

Hello.

Is there some rationale why CommonMark (and cmark) handles link destinations in normal links and autolinks differently with respect to the backslash escape sequences?

Consider this input:

<http:\*>

[http:\*](http:\*)

[http:\*](<http:\*>)

which results in:

<p><a href="http:%5C*">http:\*</a></p>
<p><a href="http:*">http:*</a></p>
<p><a href="http:*">http:*</a></p>

My natural expectation would be that all those are equivalent to each other. So should not some unification be done here for future CommonMark revisions?

Neither valid URI nor valid IRI can ever contain \.

Nevertheless, per RFC 3987, section 3.1, a system converting IRIs to URIs MAY deal with \ (in an invalid IRI) by replacing them with %5C. If such a system encounters a \, but doesn’t convert it, then the conversion SHOULD fail. Quoting:

   Systems accepting IRIs MAY also deal with the printable characters in
   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
   characters are found but are not converted, then the conversion
   SHOULD fail.

The CommonMark reference impelementation is a system that converts IRIs to URIs. E.g., it transforms [](https://тест.рф/) to <p><a href="https://%D1%82%D0%B5%D1%81%D1%82.%D1%80%D1%84/"></a></p>.

There should be no way for a CommonMark-conforming implementation to fail. So the only way to deal with the conversion properly is to convert \ to %5C, IMO.

One may argue that whatever inside angle or round brackets is not a URI and only becomes one after unescaping all the backslashes. But it goes against the idea of Markdown being as close to plaintext as possible.

Square brackets are a different story, though. Whatever inside them should be interpreted as close to “unbracketed” Markdown text as possible. That’s why I think that in square brackets backslashes can and should be used for escaping.

Spec section 6.1: “Backslash escapes do not work in code blocks, code spans, autolinks, or raw HTML.”

What’s the rationale for treating URLs in autolinks differently from URLs in regular links? I’m not sure I really remember.

It would seem reasonable to me to allow backslash escapes in autolinks, especially given that there’s no other legitimate reason for a backslash to be there.

Does anyone see a downside to that?

Spec section 6.1: “Backslash escapes do not work in code blocks, code spans, autolinks, or raw HTML.”

I am aware of it. I’m just wondering, what motivation (if any) is behind it.

It would seem reasonable to me to allow backslash escapes in autolinks, especially given that there’s no other legitimate reason for a backslash to be there.

Does anyone see a downside to that?

I am not sure whether it is better to allow or disallow the backslash escapes in both contexts, but unless there is a rationale for the difference, it should rather be unified in one or the other way.

Maybe also things like this should be taken into account:

(Note that filenames (and maybe also usernames) may start with a punctuation character, e.g. .)

If user copies & pastes such address into Markdown document, it should preferably not break such address (more then it already is).

It would seem reasonable to me to allow backslash escapes in autolinks, especially given that there’s no other legitimate reason for a backslash to be there.

There is no legitimate reason for many other characters to be there too, yet the spec allows it.

Allowing " is especially interesting. Something like <http://example.com"onclick="console.log('pwned')> can easily XSS a naive Markdown implementation.

Does anyone see a downside to that?

If you allow backslash escapes inside autolinks (or any other links, for that matter), they stop being plaintext links and become backslash-escaped links. You can’t simply copy-paste them from your Markdown source into your browser and expect them to work.

IMO there are perfectly valid reasons why backslash escapes (and indeed none of the CommonMark-significant markup like *, [, the ` backtick itself, nor entity references) are not recognized in code block, code spans, and raw HTML—in other words, why these fragments are not interpreted in any way.

However, for these fragments of input text there are already rules in place to find their end (rather ingenious in the case of code spans and the “repeated backtick-trick”, or for the most part inherited from HTML in the case of raw HTML).

For autolinks, link titles, link destinations the rules are a little bit more complicated. The terminating delimiter can be:

  • the first non-escaped QUOTATION MARK rsp APOSTROPHE, skipping white space (for link titles),
  • the first non-escaped RIGHT PARENTHESIS providing a balanced match, skipping white space, for the opening LEFT PARENTHESIS (for link titles),
  • the first non-escaped GREATER-THAN SIGN, without skipping white space (for link destinations),
  • the first white space character (for link destinations and autolinks),
  • the first GREATER-THAN SIGN, without skipping white space or LESS-THAN SIGN (for autolinks),
  • the first LESS-THAN SIGN, without skipping white space or GREATER-THAN SIGN (for autolinks),
  • the first non-escaped RIGHT PARENTHESIS providing a balanced match, without skipping white space (for link destinations in inline links).

Although Grubers description is in fact silent about all of these, there is “prior art” for this use of \", \', (), \(, \) in link titles. This might be sufficient to support this use in the specification. (I personally tend to just write &quot; if needed in link titles enclosed in ", and to forget about all the other rules.)

For handling space or > or \ in link destinations and autolinks (ie, in URIs), I think there’s much less “common” practice.


Apply the “repeated backtick-trick” here too?

So I wonder whether CommonMark could apply the “repeated backtick-trick” here too: open a link destination or autolink with two or more < characters, and the closing delimiter will be the same number of repeated > characters.

This would in one go

  • allow any text, including unescaped space, >, ) and < in link destinations and autolinks;

  • provide an obvious distinction between an autolink and a HTML tag (which was the reason for excluding space from the former, IIRC!)

While <svn:defs> looks a lot like <svg:defs>, and the latter should be written <svg:defs > in CommonMark to differentiate both, it is perfectly obvious that <<svn:defs>> or <<mailto:"2 < 4"@example.com>> are not HTML tags (but would be autolinks). Similarly,

[link text](<<http://example.com/with a ) very strange > url>> "How about that?")

is much easier to type, parse, and copy-and-past for humans and machines than

[link text](<http://example.com/with%20a%20%29%20very%20strange%20%3E%20url> "How about that?")

or the “clearer” (because two less characters are percent-encoded :wink: )

[link text](<http://example.com/with%20a%20\)%20very%20strange%20\>%20url> "How about that?")

(I found no easier or clearer way to enter this URL in CommonMark.)

@mity I’m not positive which way to go on this, but I agree that

<http:\*>

should not behave differently as an autolink and in an inline link.
Can you open an issue on jgm/CommonMark with a pointer to this discussion?