Ah, didn’t realize that. Still. The example text I linked to hasn’t changed in master, so my question still stands unmodified.
The code you linked:
rules.link_open = function (tokens, idx /*, options*/) {
var title = tokens[idx].title ? (' title="' + escapeHtml(replaceEntities(tokens[idx].title)) + '"') : '';
return '<a href="' + escapeHtml(escapeUrl(unescapeUrl(replaceEntities(tokens[idx].href)))) + '"' + title + '>';
};
Does something significantly different from what the spec now says (at least how I interpret it). In particular, this:
- Interprets entities as normal (encoding them as UTF-8, I presume) but then forgets something was an entity (so ë and ë in the source are treated the same after this).
- It interprets all urlencoded characters
- It recodes all characters (or possibly bytes - I’m not entirely sure how javascript string handling is) that are not alphanumeric, or “, / ? : @ & = + $ #”.
to get at the final url (I’m ignoring the html escape, since that’s part of putting the final url in HTML, it does not influence the url itself).
I think this does not conform to the spec, even though the spec is vague on the subject. More importantly, I think this implementation is flawed. Take for example the markdown:
[foo](/foo%3Fbar?a=3%261%3D1)
Note: %3F
is ?, %26
is & and %3D
is =.
The url shown has a path of /foo?bar
, and a single argument a
with a value of 3&1=1
. When processed by the above javascript code, the url gets transformed into:
/foo?bar?a=3&1=1
Which means something entirely different. That path is now /foo
and there are two arguments. One with key bar?a
and value 3
, one with key 1
and value 1
(not 100% sure this interpretation is correct, but the url certainly has changed). With the current implementation, I cannot see a way for urls like these to be correctly represented.
The root cause of this problem seems to be the escaping of lone % characters. Since encodeURI
encodes %
, simply running it on an url will cause already-escaped characters to end up doubly escaped. The current code circumvents this by first decoding, but in the process throws away information.
For reference, the relevant RFC here is probably RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax. However, I don’t think it completely covers the cases we need. In particular, it distinguishes two uses for percent encoding:
A percent-encoding mechanism is used to represent a data octet in a component when that octet’s corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
More explicitly, escaping happens:
- Because using the non-escaped character would have a special meaning in the url. This escaping happens when building an URI from components. Since CommonMark only handles complete URIs and never composes one from components, it should never have to apply this kind of escaping.
- Because the non-escaped character is not valid for use in an URI (mostly characters >127 or ASCII control characters < 32).
Ideally, CommonMark should not have to deal with the second type of encoding either and all URIs inside a CommonMark document are already fully escaped and valid. The RFC also indicates this:
Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts.
In practice, this doesn’t really hold. This is due to two reasons:
- When displaying a URI, browsers display escaped characters from the second category unescaped.
- When writing a URI inside CommonMark, authors (at least myself) like to write unescaped special (e.g. accented) characters instead of having to write everything percent-encoded.
As an example, I can type the following URI in Firefox’ address bar:
example.org?foo=foo%26bar%C3%80
It loads the page and shows:
example.org/?foo=foo%26barÀ
It has decoded the %C3%80
into a LATIN CAPITAL LETTER A WITH GRAVE, but leaves the %26
(which is a &) encoded, so the meaning of the url doesn’t change to the user.
If I copy-paste the complete url out of Firefox, it does something useful (which I expect not all browsers do, Firefox didn’t do this in older versions). Firefox provides me with a completely encoded and thus valid url:
http://example.org/?foo=foo%26bar%C3%80
However, when I paste only a part of the url, I get the decoded characters:
/?foo=foo%26barÀ
If something like this is pasted into CommonMark, it has to deal with escaping to produce valid URIs (or of course completely ignore the issue and apply no escaping, but that’s not very nice).
As for the second reason, when I write a link in my CommonMark document, I’d rather write:
[1]: http://ace.wikipedia.org/wiki/Wikipèdia
Instead of:
[1]: http://ace.wikipedia.org/wiki/Wikip%C3%A8dia
So, after writing all this, it seems the way forward becomes more clear (even more clear than when I started writing this post ;-p).
So, inside an URI:
- We cannot encode any reserved characters, since that would change the meaning of the URI.Note that the reserved character set does not include the percent sign.
- We might need to encode percent signs, see below.
- There is no need to encode any unreserved characters, since these can just appear in the URI as-is.
- We must encode all other characters, since they cannot appear in a valid URI (at least I think that this follows from the RFC, but you’d need to plow through all the EBNF to be sure).
So, what about percent signs? AFAICS, the RFC specifies that percent signs are only allowed as part of a percent-encoding. This means that all other percent signs (specifically, all percent signs that are not followed by two hex digits) must be escaped. It seems easy to escape all percent signs, but this removes the ability for percent-encoded characters to be present in the markdown source (which prevents things like using & as part of a an argument from working as illustrated above). The best approach here seems to be:
- Any percent signs that are not followed by two hex digits must be escaped.
This makes sure that something like this works to point at the wikipedia entry about the percent sign:
[1]: http://en.wikipedia.org/wiki/%
This still leaves a small corner case with urls that use a literal % in the url (for example, to mean the modulo operator):
http://www.example.org/calculator?calc=100%20
To keep the % in the url, this would need to be encoded into:
http://www.example.org/calculator?calc=100%2520
One possible alternative would be to only keep percent-encoded reserved characters, but encode all other percent signs (e.g. even when they appear to be part of a percent-encoding). This would fix the above example, but AFAICS it would break URIs which are already properly and fully encoded, so this seems a lousy idea…
So, how does this sound?