Make CommonMark safe by default?


#1

Although the goal for CommonMark is not to change Markdown but to standardize it, the world has changed since 2004. Making CommonMark safe by default seems to me to be worth making an exception for.

###The problem:

The security problem with Markdown is that it is natural to naively assume that it is safe, but this is not the case. Actually making it safe often means delving into the complex world of HTML sanitization.

###Proposed solution:

I suggest that all potentially unsafe features of CommonMark are designated Optional. Among others this would include inline HTML and javascript hrefs in links.

This would mean that safe parsers could claim 100% CommonMark compliance and drop a lot of implementation complexity in the process. On the other hand services that need inline HTML like GFM and StackExchange can take the extra responsibility for making sure the optional and potentially unsafe features are implemented securely.


Standard keyname for features
#2

Move raw HTML to an extension? Given the security implications of raw HTML, I think this opt-in approach should be considered.

Alternatively, there’s a case to made for making CommonMark more modular, keeping the default set of modules close to the original Markdown, but allowing either a subset or superset (extensions) to be used by the implementation.


#3

Even if embedded HTML was optional, this still works:

[do something evil](javascript:evil())
[do something dangerous](data:text/javascript;dangerous())

I believe most of the responsibility to secure Javascript lies with browsers, not with HTML generators (based upon Markdown or whatever).


#4

As @Crissov notes, if security is your goal, it’s not enough to make raw HTML optional (though this might make sense). You’d have to sanitize all URL attributes, too. I did write an implementation once that did this (cheapskate), but I became persuaded that sanitization should really be a separate step that occurs after Markdown processing, using a dedicated and well-tested sanitization library. (Doing it right is surprisingly hard, and I’m not in favor of putting all the details of the sanitization procedure in the spec.)

That’s why CommonMark reference implementations do no sanitization, and include a strong warning about this, and advice to use a sanitizer, in the README.


#5

Note that the reference implementations create an abstract syntax tree. It’s relatively easy (using the iterator/tree walker interface) to walk the AST prior to rendering, removing all raw HTML nodes (or changing them to innocuous comments), and sanitizing URLs in links and images. If you wanted a safer-by-default implementation, it would be easy to achieve in this way.

The point I’m making is that this kind of safety doesn’t require making HTML parsing optional; you can deal with it in the rendering phase or by altering the AST prior to rendering.


#6

Yes, I mentioned links in the post:

I also think you are missing the point about where the responsibility lies: browsers can’t be expected to know the difference between ‘legitimate’ javascript (ie that which comes from the site, eg Stack Overflow) and that which comes from user posted content on that site. The responsibility for sanitizing that content has to be with the implementers of markdown on that site.


#7

That’s an excellent point and I admit I hadn’t thought of it. I’d personally prefer embedded HTML to be parsed as normal paragraph text by default (unless ‘opted in’ to an optional parsing feature), but sanitizing URLs in links/images in the AST is a much better option - here I think we should do it by default and allow implementations to opt-in to unsafe syntax if they must.

This is my main issue: the imbalance in difficulty between full-blown HTML sanitization that is done post markdown processing and the relatively simple and minor sanitization that could be done pre or during markdown processing. I can’t find the changelog for HTML Purifier but I’m betting there is a litany of security fixes in there — CommonMark could sidestep the whole can of worms and save implementers a lot of trouble and maintenance by being ‘safe by default’.


#8

How does doing the sanitization during Markdown processing make it any easier? Besides, we already have libraries for doing HTML sanitization, and these are well tested and well maintained, by people who focus on this topic. Why should we put a half-baked version of this in Markdown processors?


#9

@Jack_Douglas I suppose your point is that if we just remove raw HTML, or parse it as regular text, that reduces the work that needs to be done in sanitizing. You don’t need to whitelist tags if there are no tags. But we still have to deal with links; we can’t just remove all links. As I recall, it’s harder than it might seem to remove all attack vectors here.

At any rate, if you do want an option to remove raw HTML, it’s better to do this after the parsing stage, as I suggested. (It could also be escaped and converted to regular text, if you prefer that to removing it.) That guarantees that nothing else in the CommonMark parsing is affected, and it allows the spec to stay exactly as it is.


#10

Because we don’t have to understand HTML to make markdown safe - apart from inline HTML, there are only a very small number of entry points for injected code.

Yes, and that’s a good thing, don’t get me wrong, but because it is so difficult, even well tested and well maintained libraries are bound to have bugs. That’s fine if you have no option but to use them, but I think we have an easy option not to - at least for sites that aren’t concerned about supporting inline HTML or esoteric links.

This is the crucial issue - if it’s hard then my proposal is pointless and we might as well just advise on post processing sanitization as we do, however my take-away from your earlier comment is that links could be sanitized easily by walking the AST prior to rendering - in which case it should be relatively easy to add a third parse (or additional stage to link/image parsing) which does the same? Please forgive me if I’m misunderstanding how parsing works here.


#11

Just to clarify, the current libraries do this:

Markdown --(parse)-->  AST  --(render)--> HTML

What I suggested was to add a filtering step:

Markdown --(parse)--> AST --(filter/sanitize)--> new AST --(render)--> HTML

It’s easy to walk the AST and find the links, inspect their URLs, and modify them if needed.
What’s harder is determining whether a given URL needs to be modified. Probably you’d need to do the following: run the URL through a real URL parser, and check the protocol against a whitelist. (In Haskell’s xss-sanitize, the following list is used: “ed2k”, “ftp”, “http”, “https”, “irc”, “mailto”, “news”, “gopher”, “nntp”, “telnet”, “webcal”, “xmpp”, “callto”, “feed”, “urn”, “aim”, “rsync”, “tag”, “ssh”, “sftp”, “rtsp”, “afs”.) If you wanted to be extra safe, you might also check for non-ascii characters, since people can trick you by using a URL with similar-looking characters to one you want to visit. Anyway, it’s a non-trivial task, though admittedly easier than full sanitization. If I wanted to do this, I’d have a look at several existing sanitization libraries to see what they do.


#12

Trying to shove security into a markdown processor where it doesn’t belong violates the single responsibility principle. If you think a library that’s designed purely with security in mind is going to have bugs, what makes you think mixing this complexity into markdown is going to result in fewer bugs?

If you find that there are bugs in your HTML sanitizer, fix the bugs in the HTML sanitizer. Building a new sanitizer is a sure-fire way to introduce more buggy implementations.

Markdown is HTML. If you need to sanitize your HTML do it as a post-process step. If an author is so naive as to not understand that they need to sanitize their inputs, they deserve what’s coming to them.

Coddling to crappy programmers is not a good argument for adding an unnecessary feature.


#13

I get this perspective, I really do, but here’s another one: it’s a really surprising thing that a text format like markdown allows arbitrary javascript to be included in the output of generated links. I was bowled over when I discovered this.

And as I said, if it’s just as complex to sanitize the markdown as in the post-processed output (as you imply) then I agree this is a non-starter as a proposal, however I’m still of the opinion that it is easier by orders of magnitude to sanitize the markdown than the HTML output.


#14

Thanks for clarifying, that’s really useful. It certainly doesn’t sound trivial, but it does remind me that something similar goes on in the spec for autolinks where a whitelist of schemes already exists. Presumably this whitelist is also for security (ie javascript isn’t in the list)?


#15

Actually, the whitelist of schemes for autolinks was intended to help distinguish autolinks from HTML or XML tags. (Sometimes namespaces are used, e.g. <m:math>.) A lot of people on this forum suggested that it would be better to remove the whitelist and just recognize a pattern, and that change is still under consideration. I suppose security was what motivated me not to include javascript in that list, but again, I don’t think this is the right place to worry about such things; all links need to be checked at some further step of processing.


#16

This suggestion seems to stem from a very narrow view of Markdown: that it is a way for Internet users to enter content into web applications.

But Markdown is just a way to write structured text and CommonMark is just a standard way of turning that text into HTML. Some people use Markdown for writing personal notes, others use it as an HTML alternative (like HAML). A tool like Pandoc can even convert Markdown into completely different formats. In these contexts, “safe” Markdown isn’t just nonsensical, it can be counterproductive!

I was under the impression that CommonMark was intended to be a standardized Markdown that all applications could use as a baseline before adding their application-specific customizations. But if I’m wrong and its really intended to be a standardized Markdown for online text input, we should definitely make this change.


#17

Were you bowled over when you discovered that you could include arbitrary JavaScript in markdown?

<script src="malicious.js"></script> is perfectly valid markdown because Markdown is HTML (plus sugar).

Thought experiment:

Consider arbitrary HTML input piped through a sanitizer. This has a level of effort A.

Consider arbitrary markdown input piped through a markdown processor. This has a level of effort B.

Consider arbitrary markdown input piped through a markdown processor, and then piped through a sanitizer. This has a level of effort A + B.

Consider arbitrary markdown input (which I will remind you can produce any HTML input) that is piped through a markdown processor that also somehow sanitizes the markup. How would it be possible to have a level of effort less than A + B? All the complexity of A needs to happen. All the complexity of B needs to happen. For this case the level of effort is A + B + C, where C is the level of effort of unifying two completely separate functionalities into the same codebase.

The only way that A + B could be greater than A + B + C is if C is negative, so how is it possible that unifying two separate functionalities is easier than leaving them separate?


#18

According to the Daring Fireball site: “Markdown is a text-to-HTML conversion tool for web writers”. It is used in different ways, but the inline HTML (and unsafe URL) security issues are only relevant if you are outputting HTML. I’d say this is another reason to make them optional rather than baked into the core spec: it moves us away from thinking of markdown as a HTML shorthand.


#19

No that’s a bit more obvious. As has been discussed in this thread already, it’s relatively easy to transform inline HTML into comments, code, paragraphs or whatever if you want ‘safe by default’ — the URL issue is significantly thornier and more subtle.

We are talking about different kinds of effort. It is impossible to be sure that arbitrary HTML input piped through a sanitizer is safe because HTML sanitization is complex (the single-file download of HTML Purifier is roughly 22k lines of code). I’m counting that as ‘high effort’ because even getting the best results you can hope for will include keeping up with HTML Purifier updates etc.

On the other hand, it is definitely possible to guarantee 100% safety of Markdown generated HTML for those that want it — it’s a “non-trivial” task but relatively easy when stood next to the impossibility of perfect sanitization post-output. We can put the effort in now at this level to make CommonMark intrinsically safe by default and encourage the unsafe bits to be ‘opt-in’ only. Really I think there are a lot of use cases where HTML is allowed but it shouldn’t be. Often it’s allowed to paper over a relatively small number of ‘missing’ features in markdown itself, but with a good mechanism for optional features baked into the spec, it shouldn’t be necessary to poison the well like that.


#20

I think it would be preferable to just allow http and https by default. Maybe I’m really suggesting there should be a “CommonMark Core” spec and a “CommonMark Compatible” extended with optional syntax — this is the kind of decision where you would diverge: Core having the stripped down safest and simplest options versus Compatible which would have the options that suit most applications in the wild.