I am working on a web application that allows users to submit content in Markdown using the CommonMark spec. One concern I have run into is handling raw HTML that users include in their submissions.
While some users embed safe elements like <br> or <img>, others try injecting full HTML blocks, which poses security risks. I would like to allow Markdown input only; without allowing raw HTML to be rendered in the output.
I am using the official CommonMark parser in JavaScript; but I noticed that by default, it allows raw HTML to pass through. I came across the disableHtml() option in some implementations does this reliably strip all embedded HTML / should I be applying an additional sanitization layer like DOMPurify? Checked CommonMark Spec for reference . I was also reading about what is TensorFlow and wondered if AI could help filter unsafe Markdown content.
Also; is there a difference in behavior across CommonMark implementations in different languages?
I would really appreciate best practices from anyone whoâs already implemented this safely in a production app. Itâs important that I strike a good balance between usability and security, especially for user-generated content that will be public.
I think you should use a separate HTML sanitizer, like DOMpurify, and not rely on your commonmark implementation to produce safe markup. The resulting pipeline will be âcommonmark â dompurify â rendererâ. This is a hill I will die on: commonmark parsers that let you disable inline HTML are offering an antifeature, including the reference cmark implementation.
There are two reasons why itâs a bad idea to rely on your commonmark implementation as a security boundary:
markdown-it used to be vulnerable to the data: URL thing and pulldown-cmark used to produce invalid HTML in some cases, so these problems arenât just theoretical.
There are expressiveness holes in commonmark that can only be filled by using HTML. You can try to fix them, but thereâs only so much punctuation on the keyboard. You either wind up reinventing HTML but worse, or you wind up with markup that canât actually express everything.
Donât think of CommonMark as a security barrier. It is a parser, not a cleaner.
You still have to worry about javascript: urls, malformed nesting, attribute injection, and other things even if you turn off raw HTML. Different CommonMark implementations handle edge cases in completely different ways, so relying on one flag like disableHtml() is not safe.
The safe pattern in production is:
markdown â html â clean up (using DOMPurify or something similar) â display
Clean up after rendering, not before. That way, youâre only letting the real HTML that will go to the DOM through.
This isnât a problem that can be solved with AI or TensorFlow. This isnât a problem with classifying things; itâs strict output sanitization. Use an HTML sanitizer that has been tested in battle and make sure your configuration is tight (allowed tags, allowed attributes, and safe URI schemes).
Itâs possible to make things easy to use and safe at the same time; just donât skip the sanitization layer.