First steps towards an AST

It would be great if we could create an AST for Markdown. That way, parsers could parse the Markdown to this AST then generate the HTML from the AST at the end. This separation of concerns (parsing MD vs outputting HTML) should lead to better debugging, inspection, and testing of markdown parsers.

So I decided to create an AST in XML. I spent about 4 hours on it.


Despite the fact that XML is subjectively the devil, it has pretty great support for validation, and that is why I chose to write a XML DTD that describes an XML document containing the AST, which should have all information that should be necessary to generate the HTML output of the parser.

You should also know that I recognize that XML is incredibly verbose, and I do not actually expect parsers to make an XML file - just to have an internal representation that they can readily dump as this kind of XML file.

I submit this document under the CC-BY license, in addition to the CC-BY-NC-SA license implicit in content posted on the forum.

DTD: http://hastebin.com/zoruwaroja.xml

Here’s an example of use:

EDIT: All these links are dead. Didn’t notice that they get removed after 30 days.
test1.md: http://hastebin.com/zoxaqumazi.mel
test1.xml: http://hastebin.com/veyenuketo.xml <-- abstract syntax tree
test1.html: http://hastebin.com/xuworipoje.xml

(by the way, that test1.md is pretty funny on Babelmark)

test2.md: the empty string
test2.xml:

<document>
</document>

test2.html: the empty string


Ideally, we could get a tool that would take this AST and render it as output HTML. It doesn’t seem like that should be too hard - link references would be the only part requiring anything other than straight load-in, dump-out.

The next step would be to create a tool that would take the AST and turn it into Markdown. Then we can compare the HTML output of the parsers and the AST->HTML tool and find bugs!

2 Likes

I wonder if json can be used instead as an AST container. Or YAML etc…

An AST is meant to be abstract. @riking is including an XML format if you want it but it’s not required:

You should also know that I recognize that XML is incredibly verbose,
and I do not actually expect parsers to make an XML file - just to have
an internal representation that they can readily dump as this kind of
XML file.

I would think of it as a “visualization” of the AST, or a proof that the parser is using the correct AST.

How about support for generic syntax extensions?
Generic directives/plugins syntax

Not relevant this in thread. This defines an ‘intermediate’ representation between the ‘source document’ and ‘html’. This would allow for supporting other outputs like pdfs easier to implement.

And therefore it is relevant. If someone wants to extend a parser to cover syntax extensions you might need an IR to represent the parsed extensions, depending on what those extensions do. So it would be nice if the AST included guidance on what to do in this case so that different parsers and transformers maintain some kind of consistency.

I don’t know how this should look necessarily (and don’t have any experience writing DTDs) but it’s definitely relevant to this thread. For instance you could define 2 more elements:

<!ELEMENT blockextension >
<!ELEMENT inlineextension >

or maybe its possible to define them using an element name extracted from the extension @name.

<!ELEMENT block(%name) >

Yes, the AST definition implies a model of documents that can be expressed with CommonMark. This is also implied by the current specification but not made explicit so it’s difficult to refer to and to extend.

Generic syntax extensions would provide extensions to the document model and to the AST as well. One could even defined different syntax extensions that map to the same AST, e.g. different table syntaxes.

AST is not mandanory in parser architecture. In ideal world with ideal parsers, processing go via several stages: source -> tokens stream -> ast -> result. Token stream has flat structure with open/close commands, ast has tree structure.

Any intermediate stage/representation is not mandadory. But ast is more convenient for abstract manupulations, and tokens stream is more convenient for creating high speed code. You can take a look at remarkable. It’s not yet complete, but a good example that parser can live without classic ast. As result, it’s faster than reference js implementaion, even with losses for plugable architecture (to allow easy for syntax extention).

I’m not against AST, but it should not be mandatory requirement. Both users and developpers need to have a choice between universality / simplicity / effectivity.

Yes, an AST should not be mandatory. But if implementations build on an AST and expose this AST (as input/output format, for plugins/filters/etc.) then it should be a standard form of CommonMark AST.

3 Likes

Some suggested standard for a way to represent an AST, if used, would probably be a good idea to help standardize parser+renderer implementations. I don’t really see why XML though, JSON seems like it would be perfectly suited for this?

Well, XML has the advantage of having very good validation and tooling support. You could then create a CommonMark DTD/XML-Schema/RELAX NG to validate the output of a parser or use XSLT or other processing tools to transform it into other formats.

Not that the AST necessarily has to be represented as XML, but if you want an output format, XML has some benefits.

Maybe there are similar tools for JSON.

1 Like

I’m looking forward on creating a library to create PDF booklets/HTML/Epub/… from Commonmark and the AST looks like a perfect fit for my use case. So, has been made any standarization effort since last september? Or we should just expect the AST to remain the same?

On the issue of validation, maybe something like JSON Schema might be usefull: http://json-schema.org

How about the output format of James Clark’s SP and NSGML parsers? It is extremely simple to create and process, and at the same time very compact.

It was designed for just this kind of AST parser output, as far as I understand, so it should be a very fitting choice IMO.

(If you don’t know the name James Clark, or his SP and NSGML parsers for SGML and XML, or his expat XML parser library – you should!) :wink:

Here is an example to show how this kind of output looks like: Parsing a sample document like this:

CommonMark
==========

A sample doc.

through

cmark -t xml sample.md >sample.xml

and then the resulting XML document sample.xml through SGMLS

sgmls sample.xml >sample.out 2>sample.err

gives this parser output in sample.out:

?xml version="1.0" encoding="UTF-8"?
(COMMONMA
-">\n
(DOCUMENT
-  
ALEVEL TOKEN 1
(HEADER
-    
(TEXT
-CommonMark
)TEXT
-\n  
)HEADER
-\n  
(PARAGRAP
-    
(TEXT
-A sample doc.
)TEXT
-\n  
)PARAGRAP
-h>
)DOCUMENT
)COMMONMA

as a (hopefully) complete representation of the CommonMark AST.

The upper-case element names, their truncation and the 77 errors while parsing the CommonMark.dtd would need some more background study at another time, but I think you get the gist …