Make CommonMark safe by default?

I don’t think this is a generally disputed point. jgm described HTML sanitization as “surprisingly hard”, and my suggestion is to parse inline HTML as normal text by default, which is simple. The only difficulty is with other injection points, eg links — where we could white-list schemes to prevent javascript injection (for example). A bit harder but not even close to the difficulties of parsing HTML, never mind sanitizing.

1 Like

I can’t imagine there would be security threats that ever needed changes to the spec if inline HTML was treated as text, and we white-listed safe link schemes.

I endorse this suggestion. CommonMark should be safe by default. The underlying assumption should be that it runs in a browser and that it has no ability to modify execution context or implement an XSS attack. If you want features that require more lax-execution then those checks should be turned down. But, the default should be security in the context of a Javascript enabled browser.

1 Like

Html should be an extension, because I really only see 2 use cases for CommonMark

  1. Static sites where only trusted developers can edit things, thus full html is fine
  2. Sites where markdown is used for formatting for many untrusted users, html is not fine

theoretically there may be a 3rd case where CommonMark is currently not sufficient for 2 but I think those are probably just features CommonMark needs to add additional formatting and features to the core. If there are cases where html is needed but not full html, that could be handled by the extension, in its spec.

URI’s of course will need to have safe as part of the core spec.

I would argue that any site that is currently “cleaning” html output and claims any kind of CommonMark compliance is currently in violation of the spec. The spec as written says that html is allowed, and includes no restrictions. So if this site doesn’t allow the following code to render if pasted into this form, then it’s not actually CommonMark compliant as the spec is written.

<iframe width="560" height="315" src="https://www.youtube.com/embed/dQw4w9WgXcQ?autoplay=1&rel=0" frameborder="0" allowfullscreen></iframe>

<script>alert('hello')</script>
1 Like

Do you want to add a Markdown syntax for the <details> tag? For tables with colspan attributes? For <dl>? For defining anchor points (<a name="point">)? For comments? For defining abbreviations inline? For exponentiation? For nested tables? If the answer to any of these questions is “no”, then we’re going to need embedded HTML, because I’ve seen all of those things used in GitHub.

Markdown’s advantage is that, by supporting a small number of features, it can pretty much assign each feature it’s own special syntax, giving very good support for the 80% part of the 80/20 rule. But the more features Markdown grows, the harder it gets to parse, because all those different syntaxes interact in very ugly ways. HTML, because (almost) all of the tags are syntactically identical, can add new features with very little effect on the parser.

It’s the same reason Discourse’s quotebox syntax is done with BBCode. It’s easier to add a lot of features to a list of English words than to come up with a parseable, typeable ASCII art representation.

2 Likes

Why do you need those things on github? I would question if markdown is really the right tool for the arbitrary user input. But if it’s useful enough… I have seen useful extensions for tables that work well, do they support custom attributes? no but custom attributes also get people in trouble, did you remember to clean your style, onclick, and onmouseover attributes (i’m sure there’s other ones too)? if not I can write an exploit.

Having it as an extension means that people can allow it if they want, and add it, but that commonmark implementation without extension would be safe to have out of the box. Note: hugo, for example was ok with my embedded <script></script> but because it’s static html generation that’s fine… I wouldn’t want this on a site where arbitrary people can do it.

you can either have html and not have it be secure, define only a safe subset of html (hard), or not have html at all. Currently the spec defines the first option.

Issues are technical documents (so it’s no surprise they want tables), and pull requests are a primary user interface for a lot of third-party programs (greenkeeper, reviewable, gitcop, bors, bonnyci, dangerci, and a bunch of project-specific one-off stuff) that seem to love the <details> tag.

What you’re describing is a blacklist. Don’t use a blacklist. Use a whitelist.

3 Likes

If you disable HTML, markdown does not require sanitizing at all. Because it produce correct HTML with properly paired tags.

Sanitizer needed after you generated bad markup. Better approach is to not accept html tags and to not generate bad markup at all.

1 Like

Most of the time I don’t need <dl>, but when I need it, I feel terrible without it. Allowing HTML may be optional for a document, but should never be optional for a CommonMark library.

1 Like

A couple of related points to all of this. First, Markdown follows Perl’s motto of there’s more than one way to do it. So both the HTML (e.g. <ol><li>Example</li></ol>) and more concise Markdown syntax (e.g. 1. Example) are just as valid. If you want to lazily paste in the more verbose HTML list that is fine; by design, Markdown gives you the freedom to use the best syntax for the situation. Second, Markdown prioritises freedom over security, expecting the author to be responsible enough not to write dangerous markup. That’s not appropriate for most web applications, but for documents where the author controls the full solution it often is.

2 Likes

[clicking this will run JavaScript code in your browser](javascript:alert%28%22XSS%20attack%22%29)

Markdown does not, and should not, have its own syntax for every typographical trick that web authors want access to. Every feature can’t have its own snowflake syntax.

1 Like

All such issues in markdown-it were solved many time ago. It’s safe by default.

That’s very generic statement, good for ideal world or for spec development. In real world it’s more easy to add couple of extensions instead of enabling raw html.

2 Likes

In the real world, I need to write documentation that can be viewed in our app’s online documentation viewer and can be previewed in GitHub (because writers don’t want to have to decipher broken markdown diffs in the pull request UI). That means any extensions that we’re going to use have to be in GitHub and in the app.

Or just use HTML for those rare cases. I write <a class=next href=page2.html>next</a>, it gets the next-button styling in our app, but GitHub just strips the class attribute and renders it like a link. It’s because an HTML parser can parse a tag without recognizing it (and, in a sandboxed environment like GitHub, strip anything it doesn’t understand). Markdown syntax can’t do that.

Edit: to be clear, I don’t really care that it’s HTML per se. What really matters is that it’s a way to extend Markdown that doesn’t look like garbage in oblivious viewers. But HTML has the big advantage of already existing, already having tooling, and already being implemented by most major Markdown vendors (except Reddit, so annoying).

2 Likes

I think that’s out of this thread scope. It’s not about HTML, but about default safety. Topics about syntax extensions already exist at this forum.

Please explain how you plan to address extensions to markdown with this “safe by default” concept. Each new extension then needs to address this extra “safe by default” concept, which could cause significant divergent behavior and security bugs that would need to be addressed.

  • The proposal was to remove inline HTML from the core Markdown spec, along with a few other changes to make it safe to embed (disallow JavaScript links, for example).

  • People responded by pointing out places where inline HTML is used, including cases where inline HTML is used in embedded contexts (like GitHub). Many of these uses of HTML are impossible to achieve with plain Markdown.

  • The counterargument that these missing features are a deficiency of Markdown was brought up. In other words, it’s an assertion that there are viable alternatives to inline HTML.

  • And there are. If inline HTML is removed, the missing features can be added back by:

    1. Just adding normal, markdown-ish punctuation-based syntax for the missing features. The resulting language would look like WikiCode or Perl. pls no.
    2. Extending the language with keywords. This would look like what Discourse does with it’s BBCode quote boxes. If you like, you could make the syntax identical to HTML.
2 Likes

I think @notriddle’s point is that it’s useful to have HTML available for when extensions aren’t available. We could add class=next to a document with consistent attribute syntax using [next](page2.html){ class: next } or similar, but GitHub does not support this extension and may not ever, so the { class: next } part is going be included in the output which looks bad. With raw HTML, GitHub will strip away the class attribute. It’s a form of progressive enhancement. If HTML was disabled entirely, then a link with a class attribute couldn’t be added to the source document at all. Links could only be added with Markdown syntax, e.g. [next](page2.html) which cannot have a class attribute.

1 Like

See this Make CommonMark safe by default? and below. I pointed first, that security questions should be addressed to implementation instead of spec.

1 Like

While I agree with the desire to make it safe by default, I also see why you don’t want to do that.

What do you think of adding feature keynames to the spec, and then also defining a few common “modes” in the spec that define a whitelist or blacklist of features? I just wrote up a propose for this in Standard keyname for features

still want, I’m not saying html should be straight up banned… I just think allowing it should be an optional part of the spec, currently AFAIK stripping any part of it is against the spec, and thus sanitizing it in any way means your parser is not CommonMark compliant.