Unclarities wrt urlescaping

Matthijs_Kooijman · October 6, 2014, 4:18pm

While reading the spec, the part about urlescaping seemed rather thin
and ambiguous. It seems this could cause some discussion, so I’ll post
this in a separate topic.

Example 338 says “optionally URL-escaped when written as HTML”. Does this mean that when using a non-HTML format, no URL-escaping should happen?

Also, this says “optionally URL-escaped”. Does this mean that not all characters are escaped? Implementations can decide what they’d like to escape? What defines this optional-ness?

vitaly · October 6, 2014, 5:01pm

Technically, that means, you need to unescape all value and then escape it back. Should not be a problem.

Matthijs_Kooijman · October 6, 2014, 8:50pm

Hmm, sorry, not sure what you mean by that? Could you be more specific?

vitaly · October 6, 2014, 10:55pm

Ah, didn’t undertood your question, i thinked you asked about implementation. Escaping still can be used if required for proper begin/end markup match. Also, you can use url encoding, partial of full. There are no mandatory requirements about escaping. Also, entities patterns are replaced with their values.

Output href is always url-encoded.

Last spec update has a lot of fixes about it.

Matthijs_Kooijman · October 7, 2014, 5:50am

I suspect you’re talking about backslash-escaping here? I wasn’t - backslash-escaping seems sufficiently well-defined in the spec.

You mean use url encoding in the markdown source? Are you implying that a markdown parser should be aware of urlencoding in its input? I haven’t seen anything to indicate that in the spec.

Yes, this is also clear enough in the spec AFAICS.

Is this not the version at http://jgm.github.io/stmd/spec.html that I’ve been reading yesterday?

vitaly · October 7, 2014, 7:55am

can != should/must

No. Use one from master commonmark-spec/spec.txt at master · commonmark/commonmark-spec · GitHub . It’s more fresh.

You can play with this code. Remove some parts and see how tests became to fail. It has minor bug, because don’t pay attention to md-escaped entities, but that’s not principal.

vitaly · October 7, 2014, 10:14am

Ups, missreaded your question. Yes, parser must understand urlencoded links. At least, until i did that, some tests from fresh spec failed. Also it must replace html entities with their values.

Matthijs_Kooijman · October 7, 2014, 10:59am

Ah, didn’t realize that. Still. The example text I linked to hasn’t changed in master, so my question still stands unmodified.

The code you linked:

rules.link_open = function (tokens, idx /*, options*/) {
      var title = tokens[idx].title ? (' title="' + escapeHtml(replaceEntities(tokens[idx].title)) + '"') : '';
      return '<a href="' + escapeHtml(escapeUrl(unescapeUrl(replaceEntities(tokens[idx].href)))) + '"' + title + '>';
    };

Does something significantly different from what the spec now says (at least how I interpret it). In particular, this:

Interprets entities as normal (encoding them as UTF-8, I presume) but then forgets something was an entity (so ë and ë in the source are treated the same after this).
It interprets all urlencoded characters
It recodes all characters (or possibly bytes - I’m not entirely sure how javascript string handling is) that are not alphanumeric, or “, / ? : @ & = + $ #”.

to get at the final url (I’m ignoring the html escape, since that’s part of putting the final url in HTML, it does not influence the url itself).

I think this does not conform to the spec, even though the spec is vague on the subject. More importantly, I think this implementation is flawed. Take for example the markdown:

[foo](/foo%3Fbar?a=3%261%3D1)

Note: %3F is ?, %26 is & and %3D is =.

The url shown has a path of /foo?bar, and a single argument a with a value of 3&1=1. When processed by the above javascript code, the url gets transformed into:

/foo?bar?a=3&1=1

Which means something entirely different. That path is now /foo and there are two arguments. One with key bar?a and value 3, one with key 1 and value 1 (not 100% sure this interpretation is correct, but the url certainly has changed). With the current implementation, I cannot see a way for urls like these to be correctly represented.

The root cause of this problem seems to be the escaping of lone % characters. Since encodeURI encodes %, simply running it on an url will cause already-escaped characters to end up doubly escaped. The current code circumvents this by first decoding, but in the process throws away information.

For reference, the relevant RFC here is probably RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax. However, I don’t think it completely covers the cases we need. In particular, it distinguishes two uses for percent encoding:

A percent-encoding mechanism is used to represent a data octet in a component when that octet’s corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.

More explicitly, escaping happens:

Because using the non-escaped character would have a special meaning in the url. This escaping happens when building an URI from components. Since CommonMark only handles complete URIs and never composes one from components, it should never have to apply this kind of escaping.
Because the non-escaped character is not valid for use in an URI (mostly characters >127 or ASCII control characters < 32).

Ideally, CommonMark should not have to deal with the second type of encoding either and all URIs inside a CommonMark document are already fully escaped and valid. The RFC also indicates this:

Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts.

In practice, this doesn’t really hold. This is due to two reasons:

When displaying a URI, browsers display escaped characters from the second category unescaped.
When writing a URI inside CommonMark, authors (at least myself) like to write unescaped special (e.g. accented) characters instead of having to write everything percent-encoded.

As an example, I can type the following URI in Firefox’ address bar:

example.org?foo=foo%26bar%C3%80

It loads the page and shows:

example.org/?foo=foo%26barÀ

It has decoded the %C3%80 into a LATIN CAPITAL LETTER A WITH GRAVE, but leaves the %26 (which is a &) encoded, so the meaning of the url doesn’t change to the user.

If I copy-paste the complete url out of Firefox, it does something useful (which I expect not all browsers do, Firefox didn’t do this in older versions). Firefox provides me with a completely encoded and thus valid url:

http://example.org/?foo=foo%26bar%C3%80

However, when I paste only a part of the url, I get the decoded characters:

/?foo=foo%26barÀ

If something like this is pasted into CommonMark, it has to deal with escaping to produce valid URIs (or of course completely ignore the issue and apply no escaping, but that’s not very nice).

As for the second reason, when I write a link in my CommonMark document, I’d rather write:

[1]: http://ace.wikipedia.org/wiki/Wikipèdia

Instead of:

[1]: http://ace.wikipedia.org/wiki/Wikip%C3%A8dia

So, after writing all this, it seems the way forward becomes more clear (even more clear than when I started writing this post ;-p).

So, inside an URI:

We cannot encode any reserved characters, since that would change the meaning of the URI.Note that the reserved character set does not include the percent sign.
We might need to encode percent signs, see below.
There is no need to encode any unreserved characters, since these can just appear in the URI as-is.
We must encode all other characters, since they cannot appear in a valid URI (at least I think that this follows from the RFC, but you’d need to plow through all the EBNF to be sure).

So, what about percent signs? AFAICS, the RFC specifies that percent signs are only allowed as part of a percent-encoding. This means that all other percent signs (specifically, all percent signs that are not followed by two hex digits) must be escaped. It seems easy to escape all percent signs, but this removes the ability for percent-encoded characters to be present in the markdown source (which prevents things like using & as part of a an argument from working as illustrated above). The best approach here seems to be:

Any percent signs that are not followed by two hex digits must be escaped.

This makes sure that something like this works to point at the wikipedia entry about the percent sign:

 [1]: http://en.wikipedia.org/wiki/%

This still leaves a small corner case with urls that use a literal % in the url (for example, to mean the modulo operator):

http://www.example.org/calculator?calc=100%20

To keep the % in the url, this would need to be encoded into:

http://www.example.org/calculator?calc=100%2520

One possible alternative would be to only keep percent-encoded reserved characters, but encode all other percent signs (e.g. even when they appear to be part of a percent-encoding). This would fix the above example, but AFAICS it would break URIs which are already properly and fully encoded, so this seems a lousy idea…

So, how does this sound?

vitaly · October 7, 2014, 11:28am

Yeah, i know this code isn’t perfect and said it directly. We just had no time to make it better. Doing all rare edge url cases correct is not primary goal right how, we still have more serious questions in remarkable pending list.

I just pointed, that it worth to look at samples in more fresh source. There are 2 processes, that goes in parallel - specs development and code development. I intentionally don’t participate much in specs dev, because my coding efforts are much more effective. Here are enougth of clever guys, who can improve specs better and faster than me. But if you ever have questions about implementation (remarkable or stmd.js) - feel free to ask anytime.

Returing to urls - spec need more examples. As soon as they appear, we will fix our implementation.

Matthijs_Kooijman · October 7, 2014, 12:13pm

Yes, most of my previous post was meant as ideas for the specification, since it is currently too vague. Anyone that wants to respond to my previous post from a spec perspective?