Linkifier. Let's discuss and test

Problem was discussed here first : http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

That’s not spec-related, but related to markdown in general. I’we written JS library for links recognition. It’s in beta stage and needs feedback.

Online demo http://markdown-it.github.io/linkify-it/

The principal difference with other js libs is, that new project should work right with unicode (and with astral characters too). And much more advanced than described in original document.

Current state:

  • recognition should be ok.
  • text & url normalization not complete yet.
  • speed not optimixed yet

I’d like to know your opinion. Especially, how it works with china, japan and other languages.

4 Likes

Great! I have been looking for this.

Simple API, but it’s too agressive for use in Markdown IMHO. We should have a way to disable some of the builtin schemes, such as “localhost” and “//”, which are better suited for small messages (like tweets) rather than long content.

One of the main difficulties I’ve had with autolinking in pandoc is determining when final punctuation belongs with the link or not. We want to allow things like parentheses in URLs (they come up all the time on Wikipedia), but you don’t want to capture a final parenthesis if the whole URL is in parens. Similarly, a URL can contain a - character, but what if the user intends to follow the URL with a textual em-dash (---)? Those final - characters should not be parsed as part of the URL. So some complex heuristics are needed. Does your library attempt to address this problem?

Here are some of the test cases I used in pandoc, in case it’s helpful. Some of them were borrowed from the ruby rinku library.

bareLinkTests :: [(String, Inlines)]
bareLinkTests =
  [ ("http://google.com is a search engine.",
     autolink "http://google.com" <> " is a search engine.")
  , ("<a href=\"http://foo.bar.baz\">http://foo.bar.baz</a>",
     rawInline "html" "<a href=\"http://foo.bar.baz\">" <>
     "http://foo.bar.baz" <> rawInline "html" "</a>")
  , ("Try this query: http://google.com?search=fish&time=hour.",
     "Try this query: " <> autolink "http://google.com?search=fish&time=hour" <> ".")
  , ("HTTPS://GOOGLE.COM,",
      autolink "HTTPS://GOOGLE.COM" <> ",")
  , ("http://el.wikipedia.org/wiki/Τεχνολογία,",
      autolink "http://el.wikipedia.org/wiki/Τεχνολογία" <> ",")
  , ("doi:10.1000/182,",
      autolink "doi:10.1000/182" <> ",")
  , ("git://github.com/foo/bar.git,",
      autolink "git://github.com/foo/bar.git" <> ",")
  , ("file:///Users/joe/joe.txt, and",
      autolink "file:///Users/joe/joe.txt" <> ", and")
  , ("mailto:someone@somedomain.com.",
      autolink "mailto:someone@somedomain.com" <> ".")
  , ("Use http: this is not a link!",
      "Use http: this is not a link!")
  , ("(http://google.com).",
      "(" <> autolink "http://google.com" <> ").")
  , ("http://en.wikipedia.org/wiki/Sprite_(computer_graphics)",
      autolink "http://en.wikipedia.org/wiki/Sprite_(computer_graphics)")
  , ("http://en.wikipedia.org/wiki/Sprite_[computer_graphics]",
      autolink "http://en.wikipedia.org/wiki/Sprite_[computer_graphics]")
  , ("http://en.wikipedia.org/wiki/Sprite_{computer_graphics}",
      autolink "http://en.wikipedia.org/wiki/Sprite_{computer_graphics}")
  , ("http://example.com/Notification_Center-GitHub-20101108-140050.jpg",
      autolink "http://example.com/Notification_Center-GitHub-20101108-140050.jpg")
  , ("https://github.com/github/hubot/blob/master/scripts/cream.js#L20-20",
      autolink "https://github.com/github/hubot/blob/master/scripts/cream.js#L20-20")
  , ("http://www.rubyonrails.com",
      autolink "http://www.rubyonrails.com")
  , ("http://www.rubyonrails.com:80",
      autolink "http://www.rubyonrails.com:80")
  , ("http://www.rubyonrails.com/~minam",
      autolink "http://www.rubyonrails.com/~minam")
  , ("https://www.rubyonrails.com/~minam",
      autolink "https://www.rubyonrails.com/~minam")
  , ("http://www.rubyonrails.com/~minam/url%20with%20spaces",
      autolink "http://www.rubyonrails.com/~minam/url%20with%20spaces")
  , ("http://www.rubyonrails.com/foo.cgi?something=here",
      autolink "http://www.rubyonrails.com/foo.cgi?something=here")
  , ("http://www.rubyonrails.com/foo.cgi?something=here&and=here",
      autolink "http://www.rubyonrails.com/foo.cgi?something=here&and=here")
  , ("http://www.rubyonrails.com/contact;new",
      autolink "http://www.rubyonrails.com/contact;new")
  , ("http://www.rubyonrails.com/contact;new%20with%20spaces",
      autolink "http://www.rubyonrails.com/contact;new%20with%20spaces")
  , ("http://www.rubyonrails.com/contact;new?with=query&string=params",
      autolink "http://www.rubyonrails.com/contact;new?with=query&string=params")
  , ("http://www.rubyonrails.com/~minam/contact;new?with=query&string=params",
      autolink "http://www.rubyonrails.com/~minam/contact;new?with=query&string=params")
  , ("http://en.wikipedia.org/wiki/Wikipedia:Today%27s_featured_picture_%28animation%29/January_20%2C_2007",
      autolink "http://en.wikipedia.org/wiki/Wikipedia:Today%27s_featured_picture_%28animation%29/January_20%2C_2007")
  , ("http://www.mail-archive.com/rails@lists.rubyonrails.org/",
      autolink "http://www.mail-archive.com/rails@lists.rubyonrails.org/")
  , ("http://www.amazon.com/Testing-Equal-Sign-In-Path/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1198861734&sr=8-1",
      autolink "http://www.amazon.com/Testing-Equal-Sign-In-Path/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1198861734&sr=8-1")
  , ("http://en.wikipedia.org/wiki/Texas_hold%27em",
      autolink "http://en.wikipedia.org/wiki/Texas_hold%27em")
  , ("https://www.google.com/doku.php?id=gps:resource:scs:start",
      autolink "https://www.google.com/doku.php?id=gps:resource:scs:start")
  , ("http://www.rubyonrails.com",
      autolink "http://www.rubyonrails.com")
  , ("http://manuals.ruby-on-rails.com/read/chapter.need_a-period/103#page281",
      autolink "http://manuals.ruby-on-rails.com/read/chapter.need_a-period/103#page281")
  , ("http://foo.example.com/controller/action?parm=value&p2=v2#anchor123",
      autolink "http://foo.example.com/controller/action?parm=value&p2=v2#anchor123")
  , ("http://foo.example.com:3000/controller/action",
      autolink "http://foo.example.com:3000/controller/action")
  , ("http://foo.example.com:3000/controller/action+pack",
      autolink "http://foo.example.com:3000/controller/action+pack")
  , ("http://business.timesonline.co.uk/article/0,,9065-2473189,00.html",
      autolink "http://business.timesonline.co.uk/article/0,,9065-2473189,00.html")
  , ("http://www.mail-archive.com/ruby-talk@ruby-lang.org/",
      autolink "http://www.mail-archive.com/ruby-talk@ruby-lang.org/")
  ]
2 Likes

Yes, linkify-it cover many complex cases. See demo & tests. Copy your code to linkify-it demo, and you will see - all your examples are ok, except not configured protocols (opt-in policy used).

logic is here https://github.com/markdown-it/linkify-it/blob/master/lib/re_url_parts.js

PS. Added termination on ---. Didn’t knew about it.

localhost now works only with http:// prefix (as any other local domain name). Default schemas can be disabled with .add(name, null).

FWIW, there’s a pretty comprehensive comparison of URL regexes here. The only entry that fulfilled all requirements was this one. Additionally, publicsuffix.org provides a list of valid domain suffixes that can be used in the tests.

Btw, are plain urls without the protocol (like publicsuffix.org above) to be linked? I’d say the natural expectation is for them to be linked, but only if indeed they are valid (not necessarily existing) domain names.

1 Like

@waldyrious Thanks for the link. At quick glance, all examples should be ok.

You can open online demo http://markdown-it.github.io/linkify-it/ and check youtself.

Ups… links in blockquotes without space were detected with mistakes:

>example.com
>http://example.com

Released fix.

Need help with asian languages group. It seems, some punctuation chars should be always the link delimiters.

If someone can formalize info, it would be awesome.

I’m sorry. I forgot to reply…

Defined behavior

Unicode chars have to be percent encoded in URL(RFC 3986).
Chinese Wikipedia uses this approach, though they use English bracket. If any Chinese punctuation chars are in the URL, they should be already escaped.
https://zh.wikipedia.org/wiki/%E5%A4%A7%E5%AD%A6_(%E6%B6%88%E6%AD%A7%E4%B9%89)

Therefore, [ any non english ]http://links[ any non english ] are still valid.

Extra case

When copy or paste from the browser address bar, the URL may not be escaped.
https://www.example.com/大学_(消歧义)?参数=parameter

Reported case

http://t.cn/RZwjG7U(分享自 @酷6网) is a share message which is popular case. Punctuation chars are rarely used in URL anyway. It’s pretty safe to use as a delimiter.

2 Likes

LOL @vitaly it gets even better. In this wikipedia page:

link: https://zh.wikipedia.org/wiki/(

the link takes you to the page for the character , which is called 括号 in Chinese. This is the double width character for (.

Current behavior of linkify seems to be okay. I find myself literally doing the information theoretic practice of trying to figure out “which case is more common”, and the Chinese wikipedia tilted this balance in favor of allowing Chinese characters in link parsing right away.

2 Likes

Looking forward to see the fix!