Project ANN+RFC: Extending and using CommonMark in new ways

tin-pot · October 22, 2015, 4:30am

Project Announcement and
Request for Comments—
Using CommonMark in new ways

I intend to extend the use of CommonMark and the usefulness of cmark in a project of mine with two main goals:

1. Make “foreign” mark-up syntaxes available in CommonMark texts: Blocks and inline spans written in a “foreign” mark-up syntax can be used in CommonMark texts.

Blocks of this kind use either the existing fenced code block syntax and announce the type of mark-up they contain in an info string, or they use configurable start and end lines as delimiters.

Inline spans mark-up announce the type of mark-up they contain using a (preliminary) syntax which also includes an info string, which is interpreted in the same way as the info string on fenced code blocks

2. Make CommonMark and (a modified) cmark usable in an XML/SGML environment: While the “conventional” transformation of plain text files into structured documents (ie HTML/XML/XHTML) is of course retained (and can use the “foreign mark-up” extensions mentioned), it should also be possible to input XML/SGML documents which containplain text fragments as the character data content of designated element into the mark-up processing, which substitutes these “container elements” with elements generated from the contained plain text in the final output (which is again an XML/SGML document).

The motivation for the first goal needs probably no explanation, and achieving the second goal would allow to use CommonMark (and a processor for it) as well as “foreign” syntaxes (and processors for them) in an XML/SGML authoring process, eg to produce DocBook documents.

Some expected properties of my solution are:

The CommonMark specification needs no change at all if the “foreign” syntax is only used in fenced code blocks;
The CommonMark syntax is processed by (a modified) cmark or similar Markdown processor, while
each of the “foreign” syntaxes is processed by it’s own, specific processor
in a robust but flexible manner.
Adding a new “foreign” syntax for use in
- fenced code blocks with info strings, in
- “foreign mark-up blocks”, or in
- code spans with info strings

would solely require adding one line into a configuration file, and no changes to the cmark processor.

Input to mark-up processing can be in a variety of formats:

Plain text files as ever,
well-formed XML documents (without using a DTD or XML Schema),
validated XML documents (using an XML parser to check the document against a DTD or XML Schema),
validated SGML/HTML documents (using an SGML parser to check the document against a DTD, and to parse it as a first stage of processing).

The mark-up processing tools used in this concept (cmark, processors for “foreign” syntaxes) would not need to parse and generate XML or SGML, but a simple text format instead.

The key idea is to compose cmark and other mark-up processors together with specialized tools into a chain of processors (typically in a U*IX-style pipeline, or controlled by a Makefile, ie each processor is also a process), so that this chain of processes transforms the plain text mark-up: it is a “plain text mark- up processing chain”. Most of the rest follows naturally from this idea, driven by some design decisions I made, under constraints and requirements I assumed.

More (or even all) the details about the motivation, concept, design of the planned implementation can be found in a very detailled article I wrote: A Plain Text Mark-Up Processing Chain.

Since the goals of the project are relevant for the greater community (as I would hope), and the implementation would be related to several topics discussed here recently, like

ways of using “foreign” mark-up syntax blocks in CommonMark, either in the form of fenced code blocks or alternatively or additionally in “foreign mark-up blocks”;
ways of using the same “foreign” mark-up syntax in code spans (with an extended syntax to attach an info string to such code spans);
modifying the cmark implementation by adding another “mode of operation” (ie a new value for the -t option), or alternatively implement a new processor based on cmark and using it’s API;
generating “native” elements from mark-up, and thus using the CommonMark DTD for “production purposes”, not primarily for testing;

I would like to invite everyone who finds some (or all) of the project’s goals or topics attractive and worthwile to comment, discuss and help to make this project a success.

— tin-pot

chrisalley · October 22, 2015, 9:57am

Just to clarify, since no changes are required to the CommonMark specification, it appears that the topic relates to cmark (and other) implementations rather than a (syntax) extension to the CommonMark specification. I think this topic should be moved from Extensions to Implementation.

tin-pot · October 22, 2015, 10:14am

Yes and no: You’re right that the main concern is modifying/extending the implementation. But to use “foreign syntax” by means other than the existing fenced code block with an info string, the CommonMark syntax would have to be extended—if only in a “soft”, ie configurable way.

The one CommonMark specification extension which could be generally useful (not only in this project) would be to

extend the syntax for code spans so that they can be marked with an (optional) info string; by default, the info string would have the same meaning as in fenced code blocks—which is no meaning as far as the CommonMark spec is concerned

@chrisalley: Thanks for your advise!

[EDIT: I did realize too late that one can change the “category” of a topic after the fact. I’ve changed it here to “Implementation” now; and will try to get rid of the duplicate post where (Implementation) was appended to the title …

Sorry for the mess…]

tin-pot · October 22, 2015, 11:56pm

A quick update: I have cobbled together (aka: prototyped) two important aspects of the project:

A free-standing CommonMark processor to output the ESIS of an XML document transformed from CommonMark input. This uses the element set from the CommonMark DTD, and is in fact the -t xml mode with a different output format. See the source file cmesis.c in my repository. This program uses the cmark API for parsing, and renders nodes into an ESIS stream from the parsed tree.
A free-standing filter to transform the ESIS representation back into XML. It also does some (table-driven) processing:

Character content from input elements is either treated as CDATA, meaning no transformation takes place, or
as PCDATA, meaning that the usual substitution of entity references for < and the like takes place.
Elements can be renamed in the XML output (eg from the “native” CommonMark XML names to the (X)HTML element type names,
and elements (or rather: their delimiting tags) can be omitted altogether. This is currently done for the html and html_block elements, and for text too, in order to get a resulting XML output closer to what proper XHTML would look like.
empty elements are output using the XML < /> form, not with a separate end tag.

A true transformation into HTML would require some more effort, eg to select between UL and OL in the output depending on input attribute values.

The source for this xmlout tool is also in the repository.

Please note that this software is in a very early stage and pretty much un-tested, so the usual disclaimers apply even more.

Project ANN+RFC: Extending and using CommonMark in new ways

Project Announcement and Request for Comments—Using CommonMark in new ways

Project Announcement and
Request for Comments—
Using CommonMark in new ways