AST reference — WRT 3rd party JS consumption

I’m looking to adopt commonmark.js to convert Markdown input into a format that can be consumed by my MVC DOM framework of choice, Mithril.

One of the amusing / frustrating things with any library that abstracts HTML / DOM is that there’s a tendency for people to reinvent Javascript notation for DOM elements. The de facto Javascript Markdown parser, pre-Common Markdown, is (AFAICT) markdown-js, which produces JsonML. JsonML is ancient and poorly documented (derelict, you might argue), but I can get a handle on it and reliably convert it to work with Mithril’s virtual DOM notation.

What should I use for reference in trying to get commonmark.js to plug in?

+++ barneycarroll [Apr 05 15 14:25 ]:

I’m looking to adopt commonmark.js to convert Markdown input into a format that can be consumed by my MVC DOM framework of choice, Mithril.

One of the amusing / frustrating things with any library that abstracts HTML / DOM is that there’s a tendency for people to reinvent Javascript notation for DOM elements. The de facto Javascript Markdown parser, pre-Common Markdown, is (AFAICT) markdown-js, which produces JsonML. JsonML is ancient and poorly documented (derelict, you might argue), but I can get a handle on it and reliably convert it to work with Mithril’s virtual DOM notation.

What should I use for reference in trying to get commonmark.js to plug in?

Can you say more about what you’re trying to do? The README.md for
commonmark.js gives documentation for the API. If you have more
specific questions, just ask.

Thanks! I want to output some kind of Javascript object notation — as opposed to HTML strings — that I can manipulate to suit the needs of my framework. markdown-js can produce JsonML. Common Markdown makes references to ‘AST’, but I’m not sure how to produce it or what it might look like.

+++ barneycarroll [Apr 05 15 21:31 ]:

Thanks! I want to output some kind of Javascript object notation — as opposed to HTML strings — that I can manipulate to suit the needs of my framework. markdown-js can produce JsonML. Common Markdown makes references to ‘AST’, but I’m not sure how to produce it or what it might look like.

If the documentation isn’t enough, try playing with it in the node.js repl:

node
> commonmark = require('commonmark');
> reader = new commonmark.Parser();
> doc = reader.parse("Hello *world*");
> doc.type;
> doc.firstChild.type;
> doc.firstChild.firstChild.type;
> doc.firstChild.firstChild.literal;
> doc.firstChild.firstChild.next.type;
> writer = new commonmark.HtmlRenderer();
> writer.render(doc);

And so on. (See the README, which has a complete list of the “public properties” defined for Node objects.)

The best way to interact with this kind of Node tree is to use the provided tree-walking function. For example, to change all text to uppercase:

var walker = parsed.walker();
var event, node;

while ((event = walker.next())) {
  node = event.node;
  if (event.entering && node.type === 'Text') {
    node.literal = node.literal.toUpperCase();
  }
}

If you want to transform the tree in some way prior to rendering it as HTML, then using a walker is usually the way to go.

2 Likes

So the parser consumes Markdown and produces AST, something which resembles DOM but only acknowledges a few abstract node types. The renderer interprets the AST and writes static HTML, using internal definitions to map node types to tag names. I was lazily hoping I’d missed a trick and I could use the lib to produce something which mapped to a DOM subset but it seems this job is split between the 2 core methods.

Contrast this with NPM’s markdown where the parser produces a representation of DOM and the renderer converts this to an HTML string. This is more useful in my case since I can take the parser’s output and go. Maybe I’m missing a trick, but it seems to me that there’s an opportunity for splitting commonmark.js’s renderer into 2: one method to interpret the abstract definitions and produce a DOM structure (like JsonML) and another for producing valid HTML from that (which is a solved problem).

Are there many tools that make use of the distinction between the 2 methods? Maybe exploring other people’s usage can help me rationalise the intended purpose.

+++ barneycarroll [Apr 06 15 10:50 ]:

So the parser consumes Markdown and produces AST, something which resembles DOM but only acknowledges a few abstract node types. The renderer interprets the AST and writes static HTML, using internal definitions to map node types to tag names. I was lazily hoping I’d missed a trick and I could use the lib to produce something which mapped to a DOM subset but it seems this job is split between the 2 core methods.

Contrast this with NPM’s markdown where the parser produces a representation of DOM and the renderer converts this to an HTML string. This is more useful in my case since I can take the parser’s output and go. Maybe I’m missing a trick, but it seems to me that there’s an opportunity for splitting commonmark.js’s renderer into 2: one method to interpret the abstract definitions and produce a DOM structure (like JsonML) and another for producing valid HTML from that (which is a solved problem).

DOM is HTML-specific, and our AST is meant to be more format-independent than that. Admittedly the primary use for the JavaScript implementation is to produce HTML, but since this is a reference implementation, the structure of the AST mirrors that used in the C implementation, cmark, which already supports multiple output formats (so far, HTML, and XML format, groff man, and commonmark itself). It would be trivial to add more output formats to commonmark.js, and I’ll probably do that in time.

If you want an abstract representation of DOM, it’s not hard to build one up by walking the tree of Nodes (using the walker structure as described). If you implemented this, we might want to consider merging it in as a separate “renderer.”

Addendum: EmptyStar has a fork of commonmark.js that does something like what I just outlined, building up an abstract representation of HTML from the Node tree, and rendering this with a “policy.” See https://github.com/jgm/commonmark.js/issues/6

The idea that DOM is a subset of HTML is refreshing to say the least :wink: