Make CommonMark more compatible with other target languages


#1

Why?

The way I see Markdown is as a great intermediate format. Which means that you
can use it to author and archive content which is read in a different format.

Another such intermediate format is LaTeX. The main difference between the two
is that while Markdown is lightweight and easy to learn, LaTeX is highly
extensible and powerful. By design the two are quite orthogonal when it comes
to uses and there is no sense in trying to make one more like the other or vice
versa.

The output formats are usually less readable (and writable) like PDF, XML
(HTML, EPUB, Docx, OpenOffice etc.), troff, rtf, PostScript etc. So it is very
important that conversion is easy and straight forward.

Originally Markdown was intended to only produce HTML but with time it was
discovered that it can be useful in a variety of settings where the output is
not HTML. Which gets me to the problem.

The problem

Right now CommonMark includes renderers to XML, troff and LaTeX. As described
above, the XML output is very good because Markdown was written with HTML in
mind. For other languages it would be nice if the conversion would be
facilitated in a similar manner.

Now since each target language has its idiosyncrasies, it is sometimes hard to
do something in Markdown and get the expected results after conversion.

The proposal

The proposal has two parts. First to extend the core syntax to include
definition lists which are needed for full compatibility with the groff man
macro package and second to allow embedding languages other than HTML.

To elaborate:

  1. Right now, writing man-pages in Markdown is almost impossible in CommonMark
    since it does not specify definition lists which would correspond to the
    .IP macro in man(7). This is for instance used when describing flags or
    files.
  2. In general it would be nice if an arbitrary language could be embedded into
    the Markdown source (troff, LaTeX, etc.). This would of course be terrible
    (mostly) to read and entirely against the philosophy of Markdown, but it is
    already done with HTML to enable special properties that would otherwise be
    lost in conversion so it would only be natural to also allow that for other
    languages too. In the example of man-pages, it could allow the use of other
    macro packages like mdoc (the BSD endorsed way to write man-pages). Also
    in this example one could then generate the html from the man-pages
    themselves or use a preprocessor to substitute some parts of the Markdown
    with the raw troff. In any way embedding parts of the target language
    directly into the Markdown document could be very beneficial for conversion.

#2

I think this should be tagged as an extension to the spec. I think it’s far too divergent from the original intent of markdown that I can’t see it getting any significant amount of support unless it’s treated as an optional extension.


#3

Would the proposed Description List extension be sufficient for this?

There has been some discussion about adding “do not parse” sections where an arbitrary language or meta data could be specified which the Markdown parser would skip over. See @jgm’s comment here. But as far as I know there are no plans to add this to the core spec.


#4

That depends on whether you use the man or mdoc macro package. On Linux and macOS where man is primarily used you can get away with what is currently in CommonMark + the description list extension. With that you can generate SH, SS, PP (alias of P and LP), IP (alias for TP), the useful font commands, RS-RE pairs (could be used for code) and PD.
HP (a paragraph with the first line not indented and the others indented) cannot be generated, but I never saw it used in practice.

Note that to produce a valid man page with man you also need to prepend a header line starting in .TP to the output of cmark, but this can be done with one line of shell-script. .TP gives some information on the manpage like name and section number.

So on macOS and Linux you could generate almost all man-pages using the core syntax + the Description list extension. All you have to do is add one line at the top of each file.

I was a bit confused by the meaning of core vs. extension. As long as it is implemented in cmark, I don’t really care whether the part of the syntax is considered to be “core” or part of an “extension” spec.


#5

I counted the ratio of installed man-pages on my System. Due to the fact that the developer tools are installed there are quite a few duplicates, but I guess those are present in the same ratio.

On my macOS system there are a total of 17710 pages, 14457 of which are written in man (82%) and 3103 are written in mdoc (18%).

On an example linux system (Raspbian-minimal) there are a total of 2434 pages, of which 2380 are written in man (98%) while are 58 written in mdoc (2%).

After some research I found that .HP or an equivalent (.nf) is used to aligning arguments or for function calls. 1352 of the pages written in man use .HP in one or more places (<10%).

So it produces:

sudo [-AbEHnPS] [-C num] [-g group] [-h host] [-p prompt] [-u user]
     [VAR=value] [-i | -s] [command]

int *XListDepths(Display *display, int screen_number, int
              count_return);

instead of

sudo [-AbEHnPS] [-C num] [-g group] [-h host] [-p prompt] [-u user] 
[VAR=value] [-i | -s] [command]

int *XListDepths(Display *display, int screen_number, int
count_return);

Honestly, I don’t know what that would correspond to in markdown.


#6

For extensions, I assume they would be added to the reference implemention (perhaps enabled by some kind of flag) if they are formalised as official CommonMark extensions. This will be dealt with after the core spec reaches 1.0. But an implemention does not have to implement anything besides the core spec to be considered CommonMark compliant. So not all CommonMark parsers will necessarily include support for the extension, meaning that you could paste the same document into a different CommonMark-enabled app and see different results. You’d need the extension enabled in both apps.


#7

In pandoc we’re exploring adding a generalized way of including raw content for any output format: https://github.com/jgm/pandoc/issues/3537

For CommonMark, your best approach would be to use a custom AST filter. My lcmark allows you to write simple filters in lua, and also provides a templating system. The filter could, for example, transform bullet lists, each item of which has two paragraphs, into groff man definition lists. (Or devise your own convention for signaling bullet lists that are to be treated as definition lists.)

Of course, it would be better to have native definition list support, and this should happen at some point, but this is a way forward for now.


#8

The raw code inclusion discussed in pandoc looks like what I had in mind! I’m not in a position to comment on the syntax though (first, because I don’t feel qualified to and second because the syntax itself is not that important to me)

Now, as for lcmark this seems to provide a lot of flexibility, but for what I want to implement it is more a long-term thing and instead of finding a hacky way to implement it, I’d rather wait for the syntax to be official and then do something. So I’m happy to wait until after the core spec is finalized.