Make CommonMark safe by default?

This suggestion seems to stem from a very narrow view of Markdown: that it is a way for Internet users to enter content into web applications.

But Markdown is just a way to write structured text and CommonMark is just a standard way of turning that text into HTML. Some people use Markdown for writing personal notes, others use it as an HTML alternative (like HAML). A tool like Pandoc can even convert Markdown into completely different formats. In these contexts, “safe” Markdown isn’t just nonsensical, it can be counterproductive!

I was under the impression that CommonMark was intended to be a standardized Markdown that all applications could use as a baseline before adding their application-specific customizations. But if I’m wrong and its really intended to be a standardized Markdown for online text input, we should definitely make this change.

3 Likes

Were you bowled over when you discovered that you could include arbitrary JavaScript in markdown?

<script src="malicious.js"></script> is perfectly valid markdown because Markdown is HTML (plus sugar).

Thought experiment:

Consider arbitrary HTML input piped through a sanitizer. This has a level of effort A.

Consider arbitrary markdown input piped through a markdown processor. This has a level of effort B.

Consider arbitrary markdown input piped through a markdown processor, and then piped through a sanitizer. This has a level of effort A + B.

Consider arbitrary markdown input (which I will remind you can produce any HTML input) that is piped through a markdown processor that also somehow sanitizes the markup. How would it be possible to have a level of effort less than A + B? All the complexity of A needs to happen. All the complexity of B needs to happen. For this case the level of effort is A + B + C, where C is the level of effort of unifying two completely separate functionalities into the same codebase.

The only way that A + B could be greater than A + B + C is if C is negative, so how is it possible that unifying two separate functionalities is easier than leaving them separate?

2 Likes

According to the Daring Fireball site: “Markdown is a text-to-HTML conversion tool for web writers”. It is used in different ways, but the inline HTML (and unsafe URL) security issues are only relevant if you are outputting HTML. I’d say this is another reason to make them optional rather than baked into the core spec: it moves us away from thinking of markdown as a HTML shorthand.

No that’s a bit more obvious. As has been discussed in this thread already, it’s relatively easy to transform inline HTML into comments, code, paragraphs or whatever if you want ‘safe by default’ — the URL issue is significantly thornier and more subtle.

We are talking about different kinds of effort. It is impossible to be sure that arbitrary HTML input piped through a sanitizer is safe because HTML sanitization is complex (the single-file download of HTML Purifier is roughly 22k lines of code). I’m counting that as ‘high effort’ because even getting the best results you can hope for will include keeping up with HTML Purifier updates etc.

On the other hand, it is definitely possible to guarantee 100% safety of Markdown generated HTML for those that want it — it’s a “non-trivial” task but relatively easy when stood next to the impossibility of perfect sanitization post-output. We can put the effort in now at this level to make CommonMark intrinsically safe by default and encourage the unsafe bits to be ‘opt-in’ only. Really I think there are a lot of use cases where HTML is allowed but it shouldn’t be. Often it’s allowed to paper over a relatively small number of ‘missing’ features in markdown itself, but with a good mechanism for optional features baked into the spec, it shouldn’t be necessary to poison the well like that.

1 Like

I think it would be preferable to just allow http and https by default. Maybe I’m really suggesting there should be a “CommonMark Core” spec and a “CommonMark Compatible” extended with optional syntax — this is the kind of decision where you would diverge: Core having the stripped down safest and simplest options versus Compatible which would have the options that suit most applications in the wild.

I see this as an implementation issue, not a spec issue. It does not affect parsing, and the spec is about how the input is parsed.

An implementation may choose to provide a “safe mode” that renders raw HTML escaped, or as a comment, or as nothing, and suppresses links with URLs that don’t fit some condition. It may even choose to make this mode the default. As I’ve described, it’s easy to do this at the rendering stage, or by altering the AST before rendering.

Note that markdown-it does exactly this – it disables raw HTML and some links by default. I agree that this is probably a good default for implementations, and I’ll think about doing something similar for cmark and commonmark.js.

EDIT: I’ve added safety options for cmark. Currently they’re not the default. [EDIT: and also now in commonmark.js]

5 Likes

Finally :smile:

When you add syntax plugins, i will loose the last serious reason to develop markdown-it.

2 Likes

No, outputting to PDF often includes rendered HTML.

1 Like

In my project at work XSS (Javascript injection) exploitable code has been a huge problem. I’ve thought the best way to solve this would have been to not save html but use a language like markdown to prevent XSS from being saved. Looking at the spec for CommonMark I realized that allowing HTML does not appear to be optional, for many site’s this could allow exploits. I’ve also noticed that it can’t be used on this site.

I personally think that html input should be an extension and not part of the default behavior. If that isn’t acceptable maybe it can be spec-ed that a valid parser/renderer can disable this feature, or provide an option to enable only safe renderings (clean unsafe tags/attributes).

@xenoterracide safety is not spec issue, that’s managed on implementation level. For example, markdown-it is safe by default.

@vitaly I don’t really agree, if a spec doesn’t specify something as optional, disable-able, or an extension, then if an implementation doesn’t support it, it isn’t really conforming to that spec. I do agree that it’s managed at the implementation level, but I think that the spec level can define that it is something that is optional and that implementations should make it disable-able. If you were to allow only partial html support that’s also something that the spec would need to define what that means. Does the spec need to define exactly how to disable it in the implementation? no that is an implementation detail in this case.

1 Like

“Optional” in context of spec is something, that can be omitted. HTML can NOT be omitted if you implement markdown parser. IMHO it’s not good idea to add programming issues into spec. That make things complex without added value. If spec does not prevent to implement configurable parsers - it’s good enouth as is.

1 Like

Why are you assuming that sanitizing commonmark would be easier than sanitizing html? In either case, you need to follow the same procedure:

  1. Parse it into an AST.
  2. Remove anything from the AST that doesn’t match your whitelist.
  3. Dump it back out.

Just don’t write your own HTML parser, and definitely don’t try to do it with a regex. Just like you’d sanitize markdown.

1 Like

Parsing HTML into an AST is much harder than parsing Markdown into an AST — and even the best HTML sanitizers contain bugs and vulnerabilities precisely because of that fact.

You have made this claim before and have yet to provide any evidence to support your argument. HTML is a subset of markdown, unless you’re naively thinking that you can remove the ability for markdown to support raw HTML, at which point you’ve invalidated the majority of use cases that would reap any benefit from HTML sanitization.

I’ve refuted this point already:

Didn’t I make that 100% clear in the original post? I’ve no idea why you would call that “naive” — I’m not invalidating anything, the topic is about whether support is mandatory or optional in the spec. There are plenty of use cases for markdown where HTML support is not necessary or is only used to work around what is not supported in a particular flavour of markdown.

From the original post with extra emphasis:

Here’s the HTML syntax description:

https://html.spec.whatwg.org/multipage/syntax.html

Here’s the Markdown syntax description:

http://spec.commonmark.org/0.27/

The HTML5 syntax spec specifies everything in excrutiating detail, while CommonMark’s bulk is mostly examples. CommonMark is more readable, if you want to get a general idea of what Markdown’s syntax looks like, but HTML5 probably makes it easier for two people to build a compatible implementation without talking to each other.

Apparently not, as I obviously missed it or at least forgot about it in the last 18 months.

It is naive to think that you can make markdown “safe by default” especially where the most impacted use cases are made invalid. What you’d be doing is providing bad defaults because everyone who wanted to use HTML in markdown then has to enable it, at which point the safety is moot.

I don’t disagree that there are use cases where HTML support is not strictly necessary, but I consider those to be a minority and will continue to do so until you can provide evidence that proves otherwise.

I strongly disagree with the idea of making CommonMark safe by default, because I see it as misguided and a waste of time that’s better served by other existing tools.

If you would like to change my mind, you will need to provide evidence that:

  • disabling the features that you suggest could be optional will not affect 95% of existing markdown usage (I’m willing to be flexible on what constitutes this set, but it obviously can’t be cherry picked)
  • compelling evidence that markdown sanitization is easier and more effective than existing HTML sanitization libraries
  • other commonmark implementors are interested in and willing to support the burden of keeping up with changes to the commonmark spec, specifically with regard to the “safe by default” feature, which could require fast action to respond to known security threats.

Not sure about ‘will not affect’, but take a look at the relatively tiny “safe, whitelisted subset of HTML tags” allowed on SO: What HTML tags are allowed on Stack Exchange sites? - Meta Stack Exchange

If a standardized version of markdown existed that catered for most or all of those functions, SO could skip HTML (and HTML sanitization) entirely. If it also supported tables and footnotes then we might get them on dba.se (Please can we have markdown tables on dba.se - Database Administrators Meta Stack Exchange) and bh.se (Markdown footnotes? - Meta Stack Exchange) but I digress…

I don’t think this is a generally disputed point. jgm described HTML sanitization as “surprisingly hard”, and my suggestion is to parse inline HTML as normal text by default, which is simple. The only difficulty is with other injection points, eg links — where we could white-list schemes to prevent javascript injection (for example). A bit harder but not even close to the difficulties of parsing HTML, never mind sanitizing.

1 Like