Remove HTML passthru/pass through


#1

I really like Markdown. My understanding of its goal is to take plaintext that a human wrote and intended for other humans to be able to read directly and translate marking indicating emphasis/whatnot into HTML. User-provided text would never naturally have any HTML markup in it. So, the spec should have those pass through so that they are rendered as they appear in the plain text—they should be escaped.

One scenario I am thinking of is handling of stuff that looks like it could be HTML. Markdown loves to swallow constructs like Someone <else at whatever dot com> in the name of HTML passthru. Instead, it should escape that < so that the end reader sees it as a literal angle bracket. Another example might be a very simple math assertion with no spaces: 0.01<p.

Instead of trying to make it so that commonmark can be used as a filter for any existing HTML document, it should be targeted at actual text. People here seem to be of the opinion that one could interlace plaintext and HTML blocks, run that through a commonmark system, and then have a document with rich text. I think this adds a lot of unnecessary complexity to commonmark. Supporting HTML passthru requires the < character to be escaped in many situations which the human writing plain text could not anticipate. I would propose that instead of running a commonmark processor on the entire HTML document, the commonmark processor be run on the textContent of nodes in the document that opt into markdown using some proprietary means. If a document is already HTML, why would you be wanting to run markdown/commonmark on it in the first place?

Another reason I’d like this is to get rid of the “garbage in, garbage out” idea. When I have a plaintext document and pass it through markdown, I want to get “structurally valid XHTML (or HTML)” (Why did it take me so long to realize that markdown’s promise was never meant to be kept in the first place?). If we can just rip out support for HTML passthru, there no longer is a reason for commonmark to output invalid XML fragments. Outputting structurally valid XML fragments would mean no more dealing with tracking down an unclosed tag, etc.

I don’t have the need to have commonmark passthru HTML. Most use cases I can imagine, such as this very post, user comments in various places, etc., have no need for HTML passthru and would end up sanitizing such stuff out anyway. Sure, disabling passthrough would not make it safe by default because users could still specify link and image URIs, but it would be safer than it is now. And it could be guaranteed to output valid XML (or even XHTML?) fragments—no more need to go parsing something so ambiguous and quirky as HTML/SGML…

Sorry for the rant post.


#2

I’m sympathetic. Raw HTML pass-through is not my favorite feature of Markdown, but it is a feature, and since our aim was to give a spec for Markdown, not invent something completely different, we have to deal with it.

Regarding your suggestion to run the Markdown parser on the contents of text nodes produced by an HTML tokenizer: this wouldn’t work, for a variety of reasons. First, the HTML5 tokenization algorithm will tokenize <else at whatever dot com> as [TagOpen "else" [("at",""),("whatever",""),("dot",""),("com","")]] (assuming Haskell’s tagsoup library implements it correctly, as it claims to). So this wouldn’t get recognized as a text node. Second, the HTML tokenizer won’t know to skip Markdown code blocks and code spans. Third, sometimes a Markdown delimiter that starts in one text node would only be matched by a closing delimiter in another text node, as in *hello <br> there*.


#3

HTML pass-through is the only way to create e.g. definition lists in CML. Are there other features where the only option is to use HTML pass-through? For such uses-cases, omitting HTML pass-through would be sorely missed.

On the other hand, output for such HTML workarounds e.g. <dl> is now tied to HTML-only.

And in the above sentence, I first typed the litteral <dl>, and not &lt;dl&gt;, proving the OP’s point… :wink:


#4

the HTML tokenizer won’t know to skip Markdown code blocks and code spans.

You’d insert the markdown into the HTML the same way you’d put pure text in: by escaping it before stitching it with other HTML, using a DOM model to create a text node for it, or using some construct like <![[CDATA ]]> (and taking care to escape ]]> within that).

Raw HTML pass-through is not my favorite feature of Markdown, but it is a
feature, and since our aim was to give a spec for Markdown, not invent
something completely different, we have to deal with it.

Right. For CommonMark to be what it is trying to be—a spec that clarifies but doesn’t redefine markdown—HTML passthru must be there.

HTML pass-through is the only way to create e.g. definition lists in CML

If people have developed a common way of expressing terms and definitions in plain text, the result, if not wrapped in the appropriate semantic HTML, would at least be readable by humans. And a future markdown or commonmark could start recognizing such a syntax at some point. I see that there are other discussions in this forum hoping for commonmark to define a syntax for table creation. Yes, eliminating the ability to use the features of HTML which do not yet have an analogue in user-friendly commonmark syntax would be an issue. But if the user can still convey the data in a readable way in pure text form without the use of the HTML feature, it might be acceptable to people who do not want to expose content authors to HTML. And I think tables are a bigger deal than definition lists because the most natural way to display tabular data when authoring plain text would probably assume fixed width fonts which would mean abusing code blocks…

Currently I am using the following for my particular use case:

public string ToHtml(string terms) => CommonMarkConverter.Convert(
    // Disable HTML passthru because I don’t like it. But then I
    // manually have to restore to get things like “<http://blah.org>”
    // to work. Of course this would break input like “`&lt;http://`”.
    Regex.Replace(
        terms.Replace("<", "&lt;"),
        "&lt;(https?:)",
        "<$1"));

I have seen some other discussion in this forum of trying to modularize commonmark so that features can be more easily mixed and matched. I think HTML passthru would be a great candidate for being optional. Website template authors might like the magical behavior of <div> followed by a blank line allowing them to interlace rich HTML and quickly-written markdown text. Users contributing content to a website appreciate that their naturally written plaintext magically gets a facelift when *However* renders as “However” while these same users would find it confusing if <div mysteriously disappears along with the remainder of the line).

So, would a “user-friendly” variant of commonmark be something that commonmark should define, or should that be left up to individual website authors?


#5

Heh, and by the title of “Tables in pure Markdown”, how would one define “pure markdown”? Isn’t HTML passthru part of pure markdown? :wink:


#6

There are two ways one might disable raw HTML passthrough:

  1. Add an option that modifies the behavior of the parser, so that raw HTML blocks and raw HTML inline are not recognized; < is just parsed as a regular character (except in autolinks etc.).

  2. Add an option that modifies the behavior of the renderer, so that raw HTML blocks and raw HTML inline are output as escaped plain text (<b> is rendered as &lt;b&gt; for example).

(2) is fairly easy to achieve with current reference implementations; you just need to define a custom renderer. It doesn’t require any parts of the spec to be changed (or marked as optional), since it affects rendering only and not parsing. I’d be much more reluctant to add options that affect parsing, since that is getting away from the goal of avoiding fragmentation in Markdown parsing. So one has to ask whether (2) would not be enough for your needs.


#7

The main time when I think this approach would be insufficient is when commonmark parser enters “no markdown” mode. For example, when I give the dingus <div>*hey*, man!, the rules for HTML block detection are in play. The parser is behaving differently for *text* just because of the presence of the block preamble at the beginning of a line. I know this is a contrived example and unlikely to happen in the wild, so it’s only a weak case against method 2.

For my needs, it’d be fine. Though I’m accomplishing what I want by preprocessing trusted input right now. Since that is a fully implementation agnostic approach, it seems cleaner at the moment. However, I guess that relying on the API of a particular commonmark implementation is OK since commonmark’s goal is to make sure all implementations handle the same input similarly. So switching between conformant implementations to access features like the AST (which isn’t part of the spec) should be possible without sacrificing anything. And that solution would be less fragile than my current preprocessing solution.

I still think that commonmark/markdown is a great way to mark up HTML-free text and that it’d be great if the spec could eventually have first-class support for processing human-readable plain text input. Even if it can’t be done in the first stable release of the spec, I hope to be able to eventually disable the recognition of raw HTML along with the garbage-in/garbage-out behavior.


#8

Without HTML pass-through, you can’t use sectioning elements such as section, article, or even div.


#9

Sure, but how often would an author writing text be composing more than one article at a time? Maybe being able to demarcate sections would be useful, but it seems wrong that someone writing text should need to know the HTML syntax for <section/> and especially <div/>. That’d be like using the pstricks TeX package which defeats the whole point of targeting the DeVice Independent output format.

<section/> sounds like something that, maybe, should be implicitly created whenever headers are used. But that’d be a different talk topic ;-). <div/> and <article/> are elements that a web designer—not document author—might be interested in. My point is that the spirit of markdown/commonmark seems to be to take what people would naturally write in plaintext files and give it a facelift. Allowing HTML passthru by default causes, albeit rare, unexpected results when passing in certain plaintext that people have written in the past, be it people who were targeting plaintext or markdown.


#10

Well, as pointed out earlier, Markdown is what it is, and changing it in fundamental ways is not the job here. Markdown is already widely used for all kinds of purposes. I do write web pages in Markdown.