Although the goal for CommonMark is not to change Markdown but to standardize it, the world has changed since 2004. Making CommonMark safe by default seems to me to be worth making an exception for.
I suggest that all potentially unsafe features of CommonMark are designated Optional. Among others this would include inline HTML and javascript hrefs in links.
This would mean that safe parsers could claim 100% CommonMark compliance and drop a lot of implementation complexity in the process. On the other hand services that need inline HTML like GFM and StackExchange can take the extra responsibility for making sure the optional and potentially unsafe features are implemented securely.
Move raw HTML to an extension? Given the security implications of raw HTML, I think this opt-in approach should be considered.
Alternatively, thereâs a case to made for making CommonMark more modular, keeping the default set of modules close to the original Markdown, but allowing either a subset or superset (extensions) to be used by the implementation.
As @Crissov notes, if security is your goal, itâs not enough to make raw HTML optional (though this might make sense). Youâd have to sanitize all URL attributes, too. I did write an implementation once that did this (cheapskate), but I became persuaded that sanitization should really be a separate step that occurs after Markdown processing, using a dedicated and well-tested sanitization library. (Doing it right is surprisingly hard, and Iâm not in favor of putting all the details of the sanitization procedure in the spec.)
Thatâs why CommonMark reference implementations do no sanitization, and include a strong warning about this, and advice to use a sanitizer, in the README.
Note that the reference implementations create an abstract syntax tree. Itâs relatively easy (using the iterator/tree walker interface) to walk the AST prior to rendering, removing all raw HTML nodes (or changing them to innocuous comments), and sanitizing URLs in links and images. If you wanted a safer-by-default implementation, it would be easy to achieve in this way.
The point Iâm making is that this kind of safety doesnât require making HTML parsing optional; you can deal with it in the rendering phase or by altering the AST prior to rendering.
I also think you are missing the point about where the responsibility lies: browsers canât be expected to know the difference between âlegitimateâ javascript (ie that which comes from the site, eg Stack Overflow) and that which comes from user posted content on that site. The responsibility for sanitizing that content has to be with the implementers of markdown on that site.
Thatâs an excellent point and I admit I hadnât thought of it. Iâd personally prefer embedded HTML to be parsed as normal paragraph text by default (unless âopted inâ to an optional parsing feature), but sanitizing URLs in links/images in the AST is a much better option - here I think we should do it by default and allow implementations to opt-in to unsafe syntax if they must.
This is my main issue: the imbalance in difficulty between full-blown HTML sanitization that is done post markdown processing and the relatively simple and minor sanitization that could be done pre or during markdown processing. I canât find the changelog for HTML Purifier but Iâm betting there is a litany of security fixes in there â CommonMark could sidestep the whole can of worms and save implementers a lot of trouble and maintenance by being âsafe by defaultâ.
How does doing the sanitization during Markdown processing make it any easier? Besides, we already have libraries for doing HTML sanitization, and these are well tested and well maintained, by people who focus on this topic. Why should we put a half-baked version of this in Markdown processors?
@Jack_Douglas I suppose your point is that if we just remove raw HTML, or parse it as regular text, that reduces the work that needs to be done in sanitizing. You donât need to whitelist tags if there are no tags. But we still have to deal with links; we canât just remove all links. As I recall, itâs harder than it might seem to remove all attack vectors here.
At any rate, if you do want an option to remove raw HTML, itâs better to do this after the parsing stage, as I suggested. (It could also be escaped and converted to regular text, if you prefer that to removing it.) That guarantees that nothing else in the CommonMark parsing is affected, and it allows the spec to stay exactly as it is.
Because we donât have to understand HTML to make markdown safe - apart from inline HTML, there are only a very small number of entry points for injected code.
Yes, and thatâs a good thing, donât get me wrong, but because it is so difficult, even well tested and well maintained libraries are bound to have bugs. Thatâs fine if you have no option but to use them, but I think we have an easy option not to - at least for sites that arenât concerned about supporting inline HTML or esoteric links.
This is the crucial issue - if itâs hard then my proposal is pointless and we might as well just advise on post processing sanitization as we do, however my take-away from your earlier comment is that links could be sanitized easily by walking the AST prior to rendering - in which case it should be relatively easy to add a third parse (or additional stage to link/image parsing) which does the same? Please forgive me if Iâm misunderstanding how parsing works here.
Markdown --(parse)--> AST --(filter/sanitize)--> new AST --(render)--> HTML
Itâs easy to walk the AST and find the links, inspect their URLs, and modify them if needed.
Whatâs harder is determining whether a given URL needs to be modified. Probably youâd need to do the following: run the URL through a real URL parser, and check the protocol against a whitelist. (In Haskellâs xss-sanitize, the following list is used: âed2kâ, âftpâ, âhttpâ, âhttpsâ, âircâ, âmailtoâ, ânewsâ, âgopherâ, ânntpâ, âtelnetâ, âwebcalâ, âxmppâ, âcalltoâ, âfeedâ, âurnâ, âaimâ, ârsyncâ, âtagâ, âsshâ, âsftpâ, ârtspâ, âafsâ.) If you wanted to be extra safe, you might also check for non-ascii characters, since people can trick you by using a URL with similar-looking characters to one you want to visit. Anyway, itâs a non-trivial task, though admittedly easier than full sanitization. If I wanted to do this, Iâd have a look at several existing sanitization libraries to see what they do.
Trying to shove security into a markdown processor where it doesnât belong violates the single responsibility principle. If you think a library thatâs designed purely with security in mind is going to have bugs, what makes you think mixing this complexity into markdown is going to result in fewer bugs?
If you find that there are bugs in your HTML sanitizer, fix the bugs in the HTML sanitizer. Building a new sanitizer is a sure-fire way to introduce more buggy implementations.
Markdown is HTML. If you need to sanitize your HTML do it as a post-process step. If an author is so naive as to not understand that they need to sanitize their inputs, they deserve whatâs coming to them.
Coddling to crappy programmers is not a good argument for adding an unnecessary feature.
I get this perspective, I really do, but hereâs another one: itâs a really surprising thing that a text format like markdown allows arbitrary javascript to be included in the output of generated links. I was bowled over when I discovered this.
And as I said, if itâs just as complex to sanitize the markdown as in the post-processed output (as you imply) then I agree this is a non-starter as a proposal, however Iâm still of the opinion that it is easier by orders of magnitude to sanitize the markdown than the HTML output.
Thanks for clarifying, thatâs really useful. It certainly doesnât sound trivial, but it does remind me that something similar goes on in the spec for autolinks where a whitelist of schemes already exists. Presumably this whitelist is also for security (ie javascript isnât in the list)?
Actually, the whitelist of schemes for autolinks was intended to help distinguish autolinks from HTML or XML tags. (Sometimes namespaces are used, e.g. <m:math>.) A lot of people on this forum suggested that it would be better to remove the whitelist and just recognize a pattern, and that change is still under consideration. I suppose security was what motivated me not to include javascript in that list, but again, I donât think this is the right place to worry about such things; all links need to be checked at some further step of processing.
This suggestion seems to stem from a very narrow view of Markdown: that it is a way for Internet users to enter content into web applications.
But Markdown is just a way to write structured text and CommonMark is just a standard way of turning that text into HTML. Some people use Markdown for writing personal notes, others use it as an HTML alternative (like HAML). A tool like Pandoc can even convert Markdown into completely different formats. In these contexts, âsafeâ Markdown isnât just nonsensical, it can be counterproductive!
I was under the impression that CommonMark was intended to be a standardized Markdown that all applications could use as a baseline before adding their application-specific customizations. But if Iâm wrong and its really intended to be a standardized Markdown for online text input, we should definitely make this change.
Were you bowled over when you discovered that you could include arbitrary JavaScript in markdown?
<script src="malicious.js"></script> is perfectly valid markdown because Markdown is HTML (plus sugar).
Thought experiment:
Consider arbitrary HTML input piped through a sanitizer. This has a level of effort A.
Consider arbitrary markdown input piped through a markdown processor. This has a level of effort B.
Consider arbitrary markdown input piped through a markdown processor, and then piped through a sanitizer. This has a level of effort A + B.
Consider arbitrary markdown input (which I will remind you can produce any HTML input) that is piped through a markdown processor that also somehow sanitizes the markup. How would it be possible to have a level of effort less than A + B? All the complexity of A needs to happen. All the complexity of B needs to happen. For this case the level of effort is A + B + C, where C is the level of effort of unifying two completely separate functionalities into the same codebase.
The only way that A + B could be greater than A + B + C is if C is negative, so how is it possible that unifying two separate functionalities is easier than leaving them separate?
According to the Daring Fireball site: âMarkdown is a text-to-HTML conversion tool for web writersâ. It is used in different ways, but the inline HTML (and unsafe URL) security issues are only relevant if you are outputting HTML. Iâd say this is another reason to make them optional rather than baked into the core spec: it moves us away from thinking of markdown as a HTML shorthand.
No thatâs a bit more obvious. As has been discussed in this thread already, itâs relatively easy to transform inline HTML into comments, code, paragraphs or whatever if you want âsafe by defaultâ â the URL issue is significantly thornier and more subtle.
We are talking about different kinds of effort. It is impossible to be sure that arbitrary HTML input piped through a sanitizer is safe because HTML sanitization is complex (the single-file download of HTML Purifier is roughly 22k lines of code). Iâm counting that as âhigh effortâ because even getting the best results you can hope for will include keeping up with HTML Purifier updates etc.
On the other hand, it is definitely possible to guarantee 100% safety of Markdown generated HTML for those that want it â itâs a ânon-trivialâ task but relatively easy when stood next to the impossibility of perfect sanitization post-output. We can put the effort in now at this level to make CommonMark intrinsically safe by default and encourage the unsafe bits to be âopt-inâ only. Really I think there are a lot of use cases where HTML is allowed but it shouldnât be. Often itâs allowed to paper over a relatively small number of âmissingâ features in markdown itself, but with a good mechanism for optional features baked into the spec, it shouldnât be necessary to poison the well like that.
I think it would be preferable to just allow http and https by default. Maybe Iâm really suggesting there should be a âCommonMark Coreâ spec and a âCommonMark Compatibleâ extended with optional syntax â this is the kind of decision where you would diverge: Core having the stripped down safest and simplest options versus Compatible which would have the options that suit most applications in the wild.