Make CommonMark safe by default?

jgm · July 13, 2015, 3:59pm

I see this as an implementation issue, not a spec issue. It does not affect parsing, and the spec is about how the input is parsed.

An implementation may choose to provide a “safe mode” that renders raw HTML escaped, or as a comment, or as nothing, and suppresses links with URLs that don’t fit some condition. It may even choose to make this mode the default. As I’ve described, it’s easy to do this at the rendering stage, or by altering the AST before rendering.

Note that markdown-it does exactly this – it disables raw HTML and some links by default. I agree that this is probably a good default for implementations, and I’ll think about doing something similar for cmark and commonmark.js.

EDIT: I’ve added safety options for cmark. Currently they’re not the default. [EDIT: and also now in commonmark.js]

vitaly · July 13, 2015, 5:29pm

Finally

When you add syntax plugins, i will loose the last serious reason to develop markdown-it.

Brian_Lalonde · July 17, 2015, 1:27am

No, outputting to PDF often includes rendered HTML.

xenoterracide · February 26, 2017, 8:40pm

In my project at work XSS (Javascript injection) exploitable code has been a huge problem. I’ve thought the best way to solve this would have been to not save html but use a language like markdown to prevent XSS from being saved. Looking at the spec for CommonMark I realized that allowing HTML does not appear to be optional, for many site’s this could allow exploits. I’ve also noticed that it can’t be used on this site.

I personally think that html input should be an extension and not part of the default behavior. If that isn’t acceptable maybe it can be spec-ed that a valid parser/renderer can disable this feature, or provide an option to enable only safe renderings (clean unsafe tags/attributes).

vitaly · March 4, 2017, 10:56pm

@xenoterracide safety is not spec issue, that’s managed on implementation level. For example, markdown-it is safe by default.

xenoterracide · March 5, 2017, 3:23am

@vitaly I don’t really agree, if a spec doesn’t specify something as optional, disable-able, or an extension, then if an implementation doesn’t support it, it isn’t really conforming to that spec. I do agree that it’s managed at the implementation level, but I think that the spec level can define that it is something that is optional and that implementations should make it disable-able. If you were to allow only partial html support that’s also something that the spec would need to define what that means. Does the spec need to define exactly how to disable it in the implementation? no that is an implementation detail in this case.

vitaly · March 5, 2017, 4:06am

“Optional” in context of spec is something, that can be omitted. HTML can NOT be omitted if you implement markdown parser. IMHO it’s not good idea to add programming issues into spec. That make things complex without added value. If spec does not prevent to implement configurable parsers - it’s good enouth as is.

notriddle · March 14, 2017, 7:44pm

Why are you assuming that sanitizing commonmark would be easier than sanitizing html? In either case, you need to follow the same procedure:

Parse it into an AST.
Remove anything from the AST that doesn’t match your whitelist.
Dump it back out.

Just don’t write your own HTML parser, and definitely don’t try to do it with a regex. Just like you’d sanitize markdown.

Jack_Douglas · March 15, 2017, 8:10am

Parsing HTML into an AST is much harder than parsing Markdown into an AST — and even the best HTML sanitizers contain bugs and vulnerabilities precisely because of that fact.

zzzzBov · March 15, 2017, 1:26pm

You have made this claim before and have yet to provide any evidence to support your argument. HTML is a subset of markdown, unless you’re naively thinking that you can remove the ability for markdown to support raw HTML, at which point you’ve invalidated the majority of use cases that would reap any benefit from HTML sanitization.

I’ve refuted this point already:

Jack_Douglas · March 15, 2017, 4:17pm

Didn’t I make that 100% clear in the original post? I’ve no idea why you would call that “naive” — I’m not invalidating anything, the topic is about whether support is mandatory or optional in the spec. There are plenty of use cases for markdown where HTML support is not necessary or is only used to work around what is not supported in a particular flavour of markdown.

From the original post with extra emphasis:

notriddle · March 15, 2017, 7:34pm

Here’s the HTML syntax description:

https://html.spec.whatwg.org/multipage/syntax.html

Here’s the Markdown syntax description:

http://spec.commonmark.org/0.27/

The HTML5 syntax spec specifies everything in excrutiating detail, while CommonMark’s bulk is mostly examples. CommonMark is more readable, if you want to get a general idea of what Markdown’s syntax looks like, but HTML5 probably makes it easier for two people to build a compatible implementation without talking to each other.

zzzzBov · March 15, 2017, 7:46pm

Apparently not, as I obviously missed it or at least forgot about it in the last 18 months.

It is naive to think that you can make markdown “safe by default” especially where the most impacted use cases are made invalid. What you’d be doing is providing bad defaults because everyone who wanted to use HTML in markdown then has to enable it, at which point the safety is moot.

I don’t disagree that there are use cases where HTML support is not strictly necessary, but I consider those to be a minority and will continue to do so until you can provide evidence that proves otherwise.

I strongly disagree with the idea of making CommonMark safe by default, because I see it as misguided and a waste of time that’s better served by other existing tools.

If you would like to change my mind, you will need to provide evidence that:

disabling the features that you suggest could be optional will not affect 95% of existing markdown usage (I’m willing to be flexible on what constitutes this set, but it obviously can’t be cherry picked)
compelling evidence that markdown sanitization is easier and more effective than existing HTML sanitization libraries
other commonmark implementors are interested in and willing to support the burden of keeping up with changes to the commonmark spec, specifically with regard to the “safe by default” feature, which could require fast action to respond to known security threats.

Jack_Douglas · March 15, 2017, 10:10pm

Not sure about ‘will not affect’, but take a look at the relatively tiny “safe, whitelisted subset of HTML tags” allowed on SO: What HTML tags are allowed on Stack Exchange sites? - Meta Stack Exchange

If a standardized version of markdown existed that catered for most or all of those functions, SO could skip HTML (and HTML sanitization) entirely. If it also supported tables and footnotes then we might get them on dba.se (Please can we have markdown tables on dba.se - Database Administrators Meta Stack Exchange) and bh.se (Markdown footnotes? - Meta Stack Exchange) but I digress…

Jack_Douglas · March 15, 2017, 10:16pm

I don’t think this is a generally disputed point. jgm described HTML sanitization as “surprisingly hard”, and my suggestion is to parse inline HTML as normal text by default, which is simple. The only difficulty is with other injection points, eg links — where we could white-list schemes to prevent javascript injection (for example). A bit harder but not even close to the difficulties of parsing HTML, never mind sanitizing.

Jack_Douglas · March 15, 2017, 10:18pm

I can’t imagine there would be security threats that ever needed changes to the spec if inline HTML was treated as text, and we white-listed safe link schemes.

Evan_Carroll · March 15, 2017, 10:50pm

I endorse this suggestion. CommonMark should be safe by default. The underlying assumption should be that it runs in a browser and that it has no ability to modify execution context or implement an XSS attack. If you want features that require more lax-execution then those checks should be turned down. But, the default should be security in the context of a Javascript enabled browser.

xenoterracide · March 16, 2017, 4:35pm

Html should be an extension, because I really only see 2 use cases for CommonMark

Static sites where only trusted developers can edit things, thus full html is fine
Sites where markdown is used for formatting for many untrusted users, html is not fine

theoretically there may be a 3rd case where CommonMark is currently not sufficient for 2 but I think those are probably just features CommonMark needs to add additional formatting and features to the core. If there are cases where html is needed but not full html, that could be handled by the extension, in its spec.

URI’s of course will need to have safe as part of the core spec.

I would argue that any site that is currently “cleaning” html output and claims any kind of CommonMark compliance is currently in violation of the spec. The spec as written says that html is allowed, and includes no restrictions. So if this site doesn’t allow the following code to render if pasted into this form, then it’s not actually CommonMark compliant as the spec is written.

<iframe width="560" height="315" src="https://www.youtube.com/embed/dQw4w9WgXcQ?autoplay=1&rel=0" frameborder="0" allowfullscreen></iframe>

<script>alert('hello')</script>

notriddle · March 16, 2017, 5:24pm

Do you want to add a Markdown syntax for the <details> tag? For tables with colspan attributes? For <dl>? For defining anchor points (<a name="point">)? For comments? For defining abbreviations inline? For exponentiation? For nested tables? If the answer to any of these questions is “no”, then we’re going to need embedded HTML, because I’ve seen all of those things used in GitHub.

Markdown’s advantage is that, by supporting a small number of features, it can pretty much assign each feature it’s own special syntax, giving very good support for the 80% part of the 80/20 rule. But the more features Markdown grows, the harder it gets to parse, because all those different syntaxes interact in very ugly ways. HTML, because (almost) all of the tags are syntactically identical, can add new features with very little effect on the parser.

It’s the same reason Discourse’s quotebox syntax is done with BBCode. It’s easier to add a lot of features to a list of English words than to come up with a parseable, typeable ASCII art representation.

xenoterracide · March 16, 2017, 6:49pm

Why do you need those things on github? I would question if markdown is really the right tool for the arbitrary user input. But if it’s useful enough… I have seen useful extensions for tables that work well, do they support custom attributes? no but custom attributes also get people in trouble, did you remember to clean your style, onclick, and onmouseover attributes (i’m sure there’s other ones too)? if not I can write an exploit.

Having it as an extension means that people can allow it if they want, and add it, but that commonmark implementation without extension would be safe to have out of the box. Note: hugo, for example was ok with my embedded <script></script> but because it’s static html generation that’s fine… I wouldn’t want this on a site where arbitrary people can do it.

you can either have html and not have it be secure, define only a safe subset of html (hard), or not have html at all. Currently the spec defines the first option.