How Can I Safely Render User-Submitted Markdown Without Allowing Raw HTML?

Hello

I am working on a web application that allows users to submit content in Markdown using the CommonMark spec. One concern I have run into is handling raw HTML that users include in their submissions. :innocent:

While some users embed safe elements like <br> or <img>, others try injecting full HTML blocks, which poses security risks. :innocent: I would like to allow Markdown input only; without allowing raw HTML to be rendered in the output.

I am using the official CommonMark parser in JavaScript; but I noticed that by default, it allows raw HTML to pass through. I came across the disableHtml() option in some implementations does this reliably strip all embedded HTML / should I be applying an additional sanitization layer like DOMPurify? :thinking: Checked CommonMark Spec for reference . I was also reading about what is TensorFlow and wondered if AI could help filter unsafe Markdown content.

Also; is there a difference in behavior across CommonMark implementations in different languages? :thinking:

I would really appreciate best practices from anyone who’s already implemented this safely in a production app. It’s important that I strike a good balance between usability and security, especially for user-generated content that will be public.

Thank you !! :slightly_smiling_face:

I think you should use a separate HTML sanitizer, like DOMpurify, and not rely on your commonmark implementation to produce safe markup. The resulting pipeline will be “commonmark → dompurify → renderer”. This is a hill I will die on: commonmark parsers that let you disable inline HTML are offering an antifeature, including the reference cmark implementation.

There are two reasons why it’s a bad idea to rely on your commonmark implementation as a security boundary:

  1. There’s more to safely embedding UGC than just disabling inline and block HTML tags. You also need to filter out javascript: and data: URIs, you need to make sure the resulting HTML is properly nested, and, if you generate id from user input, need to ensure it doesn’t collide with parts of the host page..

    markdown-it used to be vulnerable to the data: URL thing and pulldown-cmark used to produce invalid HTML in some cases, so these problems aren’t just theoretical.

  2. There are expressiveness holes in commonmark that can only be filled by using HTML. You can try to fix them, but there’s only so much punctuation on the keyboard. You either wind up reinventing HTML but worse, or you wind up with markup that can’t actually express everything.

3 Likes

Don’t think of CommonMark as a security barrier. It is a parser, not a cleaner.

You still have to worry about javascript: urls, malformed nesting, attribute injection, and other things even if you turn off raw HTML. Different CommonMark implementations handle edge cases in completely different ways, so relying on one flag like disableHtml() is not safe.

The safe pattern in production is:

markdown → html → clean up (using DOMPurify or something similar) → display

Clean up after rendering, not before. That way, you’re only letting the real HTML that will go to the DOM through.

This isn’t a problem that can be solved with AI or TensorFlow. This isn’t a problem with classifying things; it’s strict output sanitization. Use an HTML sanitizer that has been tested in battle and make sure your configuration is tight (allowed tags, allowed attributes, and safe URI schemes).

It’s possible to make things easy to use and safe at the same time; just don’t skip the sanitization layer.

1 Like