Need clarification for links escaping / unescaping

vitaly · January 7, 2015, 5:48am

URL-escaping should be left alone inside the destination, as all
URL-escaped characters are also valid URL characters. HTML entities in
the destination will be parsed into their UTF-8 codepoints, as usual,
and optionally URL-escaped when written as HTML.

It’s not clear from the spec, what to do with partially escaped links. We can try to always unescape as much as possible, then escape back. This method is now used, but it’s not clear from the spec.

URL-escaped characters are also valid URL characters. - that’s not 100% true. Those are valid only if make valid codepoints.
I don’t understand reasons why we should leave non-canotical content in destination.

path and query string should use different escapers for correct result (encodeURI / encodeUriComponent)
Seems IDL domains should be enoded with punicode, not escaped.

Also it would be nice to define some kind of link destination normalization.

I’ve created ticket in tracker, that curent js implementaion can be broken in many different ways, but it would be fine to clarify spec prior to do any fixes. Link destination processing is not as simple as it looks at first glance.

jgm · January 7, 2015, 6:23pm

I’m tempted to say this is an HTML renderer issue, not a parser (and hence spec) issue. What if the parser just stores the URL as written (after resolving backslash-escapes), and leaves the choice of how to escape it to the HTML renderer? (Actually, that is what the parser does, if I recall correctly.) So this issue would be outside the scope of the spec. (It would be easier for the spec not to mention it if the example outputs were XML, not HTML.)

vitaly · January 7, 2015, 7:37pm

Let me describe problem from another position. I look at whole system, that receives CommonMark markup and gives back HTML. I think, different implementation must give the same link destinations at output. There should not be possible multiple correct results.

May be, that’s only renderer problem, but this knowlegde does not help me to understand how to generate correct HTML from user input . If you convert tests to XML, that will mask problem, instead of solving it. It would be more simple to remove tests with entities in links.

Could you explain without refers to spec, what effect you like to get with those transformations in renderer? Make copy-paste from html source work? If yes, is it really needed?

jgm · January 7, 2015, 7:48pm

Could you explain without refers to spec, what effect you like to get with those transformations in renderer? Make copy-paste from html source work? If yes, is it really needed?

Sorry, which transformations do you have in mind?

I’m not against defining normalization in the spec – but how complicated is this going to get? What you’re doing is using an external library. Do we really want to make explicit all of the logic it uses in the CommonMark spec?

vitaly · January 7, 2015, 8:38pm

Replace entities. I could understand html unescaping somehow, but no ideas why replace entities should be done.
Apply decodeURI/encodeURI to full url, while this is for path only (not for query string and anchor). Query string is processed with encodeURIComponent. Also, domain name is usually encoded with punucode.

http://путин.рф/ -> http://xn--h1akeme.xn--p1ai/

That’s not mandatory, and not very complex if you have library to parse URL. Correct url parse (split to parts) is not easy without external lib. See example node-v0.x-archive/lib/url.js at master · nodejs/node-v0.x-archive · GitHub

I remembered about normalization, because current implementation looks like 50% of normalisation and 50% of encoding, both not complete. And i can’t understand idea behind it.

If spec does not contain tests, that can be normalized - desision can be left to authors of implementations. Because all links will be correct anyway.

It’s not nesessary to copy detailed description. Reference is enougth. Problem with current implementation is, that it looks like experimental mix of different ideas, done with unknown goals.

It’s only my assumption, you tried to make work 2 cases:

Normal links, as copy-pasted from browser address line
Encoded links, as copy-pasted from HTML source

Probably, (2) is not needed.

If you can explain, what you tried to reach, i can search correct and well specified improvments.

jgm · January 8, 2015, 4:44pm

+++ Vitaly Puzrin [Jan 07 15 20:51 ]:

I agree that the current spec needs improvement here!

It’s only my assumption, you tried to make work 2 cases:

Normal links, as copy-pasted from browser address line

Encoded links, as copy-pasted from HTML source

Probably, (2) is not needed.

I think both of these are needed. We certainly don’t want things to break if someone inserts a link like

[my link](http://example.com/%C3%B6%C3%B6)

But we also want it to work if they do

[my link](http://example.com/&ouml;&ouml;)

or

[my link](http://example.com/öö)

vitaly · January 8, 2015, 11:03pm

[my link](http://example.com/%C3%B6%C3%B6)
[my link](http://example.com/öö)

I think, that’s not a problem. URL encoding process is exactly for such cases

[my link](http://example.com/&ouml;&ouml;)

Do you have example, what should happen to insert such kind of link to destination? Note, if user inserts this into browser address input, he will get wrong result. Why should commonmark care of things, which are not ok for direct use in browser?

I’d suggest follow this principle for implementation (parse + html render):

Everything, that work after inserting to browser address line, should give the correct address after processing.

Everything else is not specified.

The only exclusion is markdown escaping (with \), but it’s clear.

jgm · January 9, 2015, 9:18pm

+++ Vitaly Puzrin [Jan 08 15 23:13 ]:

vitaly [1]vitaly
January 8

my link
my link

 I think, that's not a problem. URL encoding process is exactly for such
 cases

my link

 Do you have example, what should happen to insert such kind of link to
 destination? Note, if user inserts this into browser address input, he
 will get wrong result. Why should commonmark care of things, which are
 not ok for direct use in browser?

In an HTML <a> element, ö would always be
interpreted as a character reference, whether in the path
part or the query part of a URL.

BabelMark2
reveals two main groups (discounting CommonMark itself).

Markdown.pl, marked, cheapskate, PHP markdown, kramdown give us

<a href="http://example.com/&ouml;&ouml;">link</a>

CommonMark is a variant on this that percent-encodes the
ös. Pandoc is a variant that just uses the utf-8 character.

lunamark, discount, redcarpet, python-markdown, and others
give us

<a href="http://example.com/&amp;ouml;&amp;ouml;">

escaping the & for the user. Is that what you think the
behavior should be?

John

vitaly · January 10, 2015, 4:32am

Yes, i’d prefer this variant (and understand that other authors have reasons to prosess entities sequence as markup). But for compromiss i suggest to remove such test case, because developpers can have different preferences.

“Right” url value depends not on standard, but on how it was received and what user wish. We can’t guess with 100% probability. This is like answering to question “should line break become hard break or not” - depends on context.

IMHO, user can place url to text editor in several ways:

type manually - he types as he see, without any encoding and replacements.
copy-paste from browser address.
copy via context menu (“copy this link”).
select page content and copy-paste as part of text.
copy-paste from page html source.

(1-4) should not process entities, (5) should process.

I don’t know who will copy-paste from html source in current epoch. And i’m not sure that anyone did it before.
I try to follow KISS principle. When multiple choices available (convert | not convert), i prefer the most simple one (do nothing in our case).

The most simple solution is to remove appropriate tests. Or, as i suggested somewhere earlier, make such tests not mandatory (status = recommentation instead of requirement).

There is also another interesting case. When i copy-paste URL from browser address to text editor, i get text in URL-encoded form:

http://example.com/öö?ö=ö#ö -> http://example.com/%C3%B6%C3%B6?%C3%B6=%C3%B6#%C3%B6

That’s a common for wikipedia links. I think, it would be nice if commonmark could convert such links to human-readable form when possible. For example, in autolinks.

jgm · January 12, 2015, 5:37am

+++ Vitaly Puzrin [Jan 10 15 04:44 ]:

“Right” url value depends not on standard, but on how it was received and what user wish. We can’t guess with 100% probability. This is like answering to question “should line break become hard break or not” - depends on context.

Well, I think the probability is near 0 that if they include a valid entity, they didn’t mean it as an entity. Is there even a possible interpretation of

/foo&ouml;bar

as a query string where & serves to join two separate query parameter assignments?

It seems pretty clear that if someone writes

[foo](/url/foo&ouml;bar)

they mean to write an o-umlaut, and if they write

[foo](/url/foo=5&ouml=bar)

where & can’t be interpreted as part of an entity, they mean to write a query with two parts. And cmark gets this exactly right, rendering the first as

<p><a href="/url/foo%C3%B6bar">foo</a></p>

and the second as

<p><a href="/url/foo=5&amp;ouml=bar">foo</a></p>

It seems a safe principle that if you have a valid entity, the & in it is not the & that connects separate query parameter assignments.

IMHO, user can place url to text editor in several ways:

type manually - he types as he see, without any encoding and replacements.

copy-paste from browser address.

copy via context menu (“copy this link”).

select page content and copy-paste as part of text.

copy-paste from page html source.

(1-4) should not process entities, (5) should process.

All of these should work well with the current cmark. Suppose the browser bar says:

http://example.com/foo=bar&baz=3

The user can copy this into a CommonMark link, and it will be rendered in HTML as

http://example.com/foo=bar&amp;baz=3

just as needed. On the other hand, if they copy from HTML source, they’ll copy

http://example.com/foo=bar&amp;baz=3

and CommonMark will render this just as it is,

http://example.com/foo=bar&amp;baz=3

Again, that’s what is wanted.

The most simple solution is to remove appropriate tests. Or, as i suggested somewhere earlier, make such tests not mandatory (status = recommentation instead of requirement).

If we want an unambiguous spec, we still need to specify how things are parsed. (We can leave some latitude in rendering.) That means saying something to determine when a & in the destination of a CommonMark link should be interpreted as part of an entity and when it should be interpreted as a literal &. We could leave out tests that reveal these decisions, but that would just mean we didn’t have good test coverage.

vitaly · January 12, 2015, 7:09am

Your examples are based on assumption, that user wish for some reason to intentionally type entities in URL. My opinion is, that this assumption is artificial. Also, i’ve seen some apps with ajax, when sub-queries were placed in long line after #, and contained &. Probably, that was result of bad implementation (and i can’t remember example link now).

I think, that probability of getting wrong result is the same for both using replacements and without it. So, not using replacements can be preferable as more simple.

To prove me, that entities replacement can be needed, just say “yes, i know people who copypaste links from html src and who manually type umlauts as entities in urls” . That will be enougth.

Split testing goals. There are specification tests, and application tests. You will not be able to get full code coverage with spec tests anyway. And i don’t feel enougth responsibility to suggest final spec requirements for URLs.

I think, real situation with entities is “nobody cares”. If you don’t wish to maintain multiple test files - just use another markers for non-mandatory tests. That will also signal other devs to join discussion.