Make CommonMark safe by default?

notriddle · March 17, 2017, 12:29am

Issues are technical documents (so it’s no surprise they want tables), and pull requests are a primary user interface for a lot of third-party programs (greenkeeper, reviewable, gitcop, bors, bonnyci, dangerci, and a bunch of project-specific one-off stuff) that seem to love the <details> tag.

What you’re describing is a blacklist. Don’t use a blacklist. Use a whitelist.

vitaly · April 1, 2017, 7:35pm

If you disable HTML, markdown does not require sanitizing at all. Because it produce correct HTML with properly paired tags.

Sanitizer needed after you generated bad markup. Better approach is to not accept html tags and to not generate bad markup at all.

franklinyu · April 3, 2017, 1:32am

Most of the time I don’t need <dl>, but when I need it, I feel terrible without it. Allowing HTML may be optional for a document, but should never be optional for a CommonMark library.

chrisalley · April 3, 2017, 10:12am

A couple of related points to all of this. First, Markdown follows Perl’s motto of there’s more than one way to do it. So both the HTML (e.g. <ol><li>Example</li></ol>) and more concise Markdown syntax (e.g. 1. Example) are just as valid. If you want to lazily paste in the more verbose HTML list that is fine; by design, Markdown gives you the freedom to use the best syntax for the situation. Second, Markdown prioritises freedom over security, expecting the author to be responsible enough not to write dangerous markup. That’s not appropriate for most web applications, but for documents where the author controls the full solution it often is.

notriddle · April 3, 2017, 8:18pm

[clicking this will run JavaScript code in your browser](javascript:alert%28%22XSS%20attack%22%29)

Markdown does not, and should not, have its own syntax for every typographical trick that web authors want access to. Every feature can’t have its own snowflake syntax.

vitaly · April 3, 2017, 8:35pm

github.com

markdown-it/markdown-it/blob/8.3.1/lib/index.js#L35-L40


      
          function validateLink(url) {
            // url should be normalized at this point, and existing entities are decoded
            var str = url.trim().toLowerCase();
          
            return BAD_PROTO_RE.test(str) ? (GOOD_DATA_RE.test(str) ? true : false) : true;
          }

All such issues in markdown-it were solved many time ago. It’s safe by default.

That’s very generic statement, good for ideal world or for spec development. In real world it’s more easy to add couple of extensions instead of enabling raw html.

notriddle · April 4, 2017, 6:34pm

In the real world, I need to write documentation that can be viewed in our app’s online documentation viewer and can be previewed in GitHub (because writers don’t want to have to decipher broken markdown diffs in the pull request UI). That means any extensions that we’re going to use have to be in GitHub and in the app.

Or just use HTML for those rare cases. I write <a class=next href=page2.html>next</a>, it gets the next-button styling in our app, but GitHub just strips the class attribute and renders it like a link. It’s because an HTML parser can parse a tag without recognizing it (and, in a sandboxed environment like GitHub, strip anything it doesn’t understand). Markdown syntax can’t do that.

Edit: to be clear, I don’t really care that it’s HTML per se. What really matters is that it’s a way to extend Markdown that doesn’t look like garbage in oblivious viewers. But HTML has the big advantage of already existing, already having tooling, and already being implemented by most major Markdown vendors (except Reddit, so annoying).

vitaly · April 4, 2017, 7:23pm

I think that’s out of this thread scope. It’s not about HTML, but about default safety. Topics about syntax extensions already exist at this forum.

zzzzBov · April 4, 2017, 9:40pm

Please explain how you plan to address extensions to markdown with this “safe by default” concept. Each new extension then needs to address this extra “safe by default” concept, which could cause significant divergent behavior and security bugs that would need to be addressed.

notriddle · April 4, 2017, 9:47pm

The proposal was to remove inline HTML from the core Markdown spec, along with a few other changes to make it safe to embed (disallow JavaScript links, for example).
People responded by pointing out places where inline HTML is used, including cases where inline HTML is used in embedded contexts (like GitHub). Many of these uses of HTML are impossible to achieve with plain Markdown.
The counterargument that these missing features are a deficiency of Markdown was brought up. In other words, it’s an assertion that there are viable alternatives to inline HTML.
And there are. If inline HTML is removed, the missing features can be added back by:
1. Just adding normal, markdown-ish punctuation-based syntax for the missing features. The resulting language would look like WikiCode or Perl. pls no.
2. Extending the language with keywords. This would look like what Discourse does with it’s BBCode quote boxes. If you like, you could make the syntax identical to HTML.

chrisalley · April 4, 2017, 9:52pm

I think @notriddle’s point is that it’s useful to have HTML available for when extensions aren’t available. We could add class=next to a document with consistent attribute syntax using [next](page2.html){ class: next } or similar, but GitHub does not support this extension and may not ever, so the { class: next } part is going be included in the output which looks bad. With raw HTML, GitHub will strip away the class attribute. It’s a form of progressive enhancement. If HTML was disabled entirely, then a link with a class attribute couldn’t be added to the source document at all. Links could only be added with Markdown syntax, e.g. [next](page2.html) which cannot have a class attribute.

vitaly · April 4, 2017, 10:09pm

See this Make CommonMark safe by default? and below. I pointed first, that security questions should be addressed to implementation instead of spec.

pickhardt · July 15, 2018, 7:36pm

While I agree with the desire to make it safe by default, I also see why you don’t want to do that.

What do you think of adding feature keynames to the spec, and then also defining a few common “modes” in the spec that define a whitelist or blacklist of features? I just wrote up a propose for this in Standard keyname for features

xenoterracide · November 19, 2020, 3:58pm

still want, I’m not saying html should be straight up banned… I just think allowing it should be an optional part of the spec, currently AFAIK stripping any part of it is against the spec, and thus sanitizing it in any way means your parser is not CommonMark compliant.