Markup collision with underscores in link texts

vitaly · January 18, 2015, 8:28am

http://media.ccc.de/browse/congress/2014/31c3_-_6373_-_en_-_saal_6_-_201412291600_-_the_only_thing_we_know_about_cyberspace_is_that_its_640x480_-_olia_lialina.html

Is it possible to avoid emphasis for such case without direct autolink markup?

Formaly, spec works as expected, but such link texts happens quite often (space replaced with underscores).

Ticket in markdown-it https://github.com/markdown-it/markdown-it/issues/38.

Knagis · January 18, 2015, 1:24pm

Could a solution be that ASCII symbols (with the exception of _ and *) are added to the exception of ASCII alphanumeric characters that disallow _ or __ being considered as opener or closer? In your example it would be - that would prevent the emphasis.

vitaly · January 18, 2015, 2:00pm

Probably, - can be added to alphanumeric exclusions, but i don’t know is it safe or not.

Adding [a-z] to any rules is bad idea in general, because will fail with other languages.

Knagis · January 18, 2015, 2:48pm

I think that the origin of the current [a-z0-9] rule is because _ is often used in member names from code. Almost always they will be ASCII, not unicode. But since the rule is there, most people would assume that a_a_a and ā_ā_ā would work the same so it could perhaps changed to any non-whitespace…

Babelmark2 - actually CommonMark ir the minority in this aspect - most of the others result in either emphasis for both or none for both.

vitaly · January 18, 2015, 3:27pm

[a-z0-9] also covers english words in text. Probably, it will be more simple to remove it at all, and tune negative condition like [punctuation + scopes + whitespaces].

Looks like goal was to define “word bounds” with cheats. That’s, not a problem in normal regegp engines, but pain in JS.

jgm · January 18, 2015, 4:23pm

@Knagis’s solution would work for this case, but there would be other bare URLs containing underscores for which this problem would arise again. [EDIT: A further problem is that the solution would be overly restrictive: I think we want to allow people to do things like _word_---blah blah and _foo_-bar.]

This is a tough one. It does show one problem with the idea that bare URLs should be linkified after parsing (not in the parser itself). Probably those who want to linkify bare URLs should insert the linkifying routine into the parser to avoid these problems—though there would, I imagine, be a significant performance cost. (That’s what cheapskate does, and you can see that it renders this nicely on BabelMark2.

I don’t see a good solution; I’m tempted to say that users who want URLs with underscores will just have to put them in <...>.

jgm · January 18, 2015, 4:56pm

+++ Kārlis Gaņģis [Jan 18 15 15:01 ]:

I think that the origin of the current [a-z0-9] rule is because _ is often used in member names from code. Almost always they will be ASCII, not unicode. But since the rule is there, most people would assume that a_a_a and ā_ā_ā would work the same so it could perhaps changed to any non-whitespace…

Yes, the intent was to avoid capturing underscores in code identifiers as emphasis, since that’s the case I was aware of that was prompting people to complain.

I hadn’t considered this URL case.

Given the new rules for emphasis, I wonder if it would make sense to simplify the rules for _ emphasis thus:

instead of

A single _ character can open emphasis iff it is part of a left-flanking delimiter run and is not preceded by an ASCII alphanumeric character.

we could have:

A single _ character can open emphasis iff it is part of a left-flanking delimiter run and not part of a right-flanking delimiter run.

and so on.

We’re already doing the checks for left- and right-flankingness, so this would not add any complexity to the code; it would even simplify it.

[EDIT: Just to clarify, this suggestion would not prevent emphasis in the original URL example. It just seemed independently preferable to singling out ASCII alphanumerics.]

[UPDATE: I have now made this change.]

vitaly · March 6, 2015, 5:34am

@jgm, need your advice. I’ve tried to detect links with schemas on inline phase, but got another collision. Such links eat tailing markup from pairs:

_http://foo.com/_
*http://foo.com/*
other pairs from extentions (^, ~, ~~, ==, ++)

Any ideas how to solve such conflict? Looks like it would be more effective to disable emphasis on this pattern inside of text: _-_

jgm · March 6, 2015, 7:36am

@vitaly, I’m not quite sure I understand your question. But, as I’ve said before, the biggest problem with linkifying bare URLs is figuring out what final punctuation is part of the link and what is not. You can introduce heuristics for this, but you’ll never cover all cases. This is why I prefer not to linkify bare URLs.

vitaly · March 6, 2015, 7:58am

Likification logic is ok. Problem is with emphasis.

As you said, _ has additional restrictions, to avoid false_positives_inside_of_words. Reason is clear. I think, here can be one more exclusion - skip _-_ inside of words. Because this pattern can happen in human-friendly links, generated from article titles:

http://media.ccc.de/browse/congress/2014/31c3_-_6373_-_en_-_saal_6_-_201412291600_-_the_only_thing_we_know_about_cyberspace_is_that_its_640x480_-_olia_lialina.html

That looks more safe, than setting high priority to likifier and cheating with link tails.

I don’t insist, that this should be in spec. Only ask your opinion - do you see problems with such logic? At first glance, i can’t find cases, when user need emphasis for 2 words with dash between.

jgm · March 6, 2015, 6:09pm

Seems to me that the best approach would be to scan for links as a unit rather than hoping nothing inside the link will be interpreted as Markdown.

Anyway, I can imagine cases where you’d want underline emphasis next to a hyphen. (Especially when we have multiple hyphens, an ASCII em or en dash, but even with single hyphens.)

vitaly · March 6, 2015, 6:49pm

I tried this approach hear. Problem is, that we don’t know where to stop, and linkifier eats markdown markup on tail. Parser can have extentions, adding ++, == etc to existing _, __, *, **. Much more possibilities for mistake.

Could you provide examples, which become broken if i disable _-_ inside of words?

jgm · March 6, 2015, 8:04pm

+++ vitaly [Mar 06 15 18:59 ]:

I tried this approach hear. Problem is, that we don’t know where to stop, and linkifier eats markdown markup on tail. Parser can have extentions, adding ++, == etc to existing _, __, *, **. Much more possibilities for mistake.

Given that URLs can contain all kinds of symbols that have meaning in
Markdown, including square brackets, parentheses, backslashes, and
asterisks, it just seems crazy to allow the inside of a URL to be parsed
as Markdown and then linkify afterwards. Disabling _ emphasis next to
a hyphen would be an ad hoc fix for just one possible issue of this
kind.

Could you provide examples, which become broken if i disable _-_ inside of words?

Spanish has two words that translate the English word
_that_---_aquel_ and _eso_---so we need to see how they
differ.

I looked up every word from _A_-_Z_.

See Plato, _Timaeus_, 17_a_-_d_.

I can think of lots of cases where people might naturally do this.

Wilfred · November 8, 2017, 11:19am

I’ve been bitten by this for links referencing Python source code, e.g.:

http://example.com/__init__.py

Several implementations produce the intended link here (babelmark), although http://example.com/__init__ seems to be harder to handle (babelmark).

Wilfred · November 9, 2017, 3:50pm

Interestingly, GitHub’s commonmark dialect does not seem to suffer from this.

Their formal specification briefly mentions underscores in its autolink section:

Autolinks can also be constructed without requiring the use of < and to > to delimit them, although they will be recognized under a smaller set of circumstances. All such recognized autolinks can only come at the beginning of a line, after whitespace, or any of the delimiting characters *, _, ~, and (.

This forbids things like `http://example.com` becoming a link.

After a valid domain, zero or more non-space non-< characters may follow:

This seems to imply _ is not treated specially.

An extended email autolink will be recognised when an email address is recognised within any text node. Email addresses are recognised according to the following rules: […] ., -, and _ can occur on both sides of the @, but only .`` may occur at the end of the email address, in which case it will not be considered part of the address:

Email autolinks also get a specific mention regarding underscores.