I think that really very little “plumbing” is required from cmark, in both approaches:
-
Currently I do pre-processing, so there is no CommonMark syntax involved (yet) in my
zhtmlprocessor. The pre-processor in my case is triggered by lines containing eg%%Z(as the sole content) and similar markers (and I use “$” to delimit in-line mark-up). It converts only these specially marked-up parts of the typescript (but not marked-up with CommonMark syntax!), and replaces them with HTML (either HTML blocks or inline HTML). The result is a conforming CommonMark document, which then gets processed bycmarkor what Markdown processor you have (well, I know which one you have!
)The whole contraption of course relies on and stands or falls with the Markdown rule that HTML mark-up is passed through by a Markdown processor!
Avoiding that the pre-processor falls into “the code block trap” you mentioned is not that hard at all: I have three tools that each know just enough about Markdown to avoid code blocks (one doing this only when an option is set): two are processing “Z notation e-mail mark-up” to HTML rsp various forms of “plain text”, and one is a general-purpose plain text formatter which I use for Markdown typescripts too. They work fine for me, but I can not rule out that there could be problems with “egde cases” of Markdown syntax, ie some remaining lack of knowledge about Markdown syntax in these tools.
This is item 1 in your list of customizations, and does already work pretty well with existing Markdown implementations, with absolutely no plumbing. It has, however, a bit of a feeling of a special solution, not a very general one (but I would argue about that!), and with only one pre-processor the issue of fragility didn’t come up in practice but I believe this could turn out to be a problem with multiple pre-processors (I’m not sure about that one, too);
-
What I propose for post-processing “labeled code blocks” (and labeled code spans, too, but there’s no label syntax yet) is not that the post-processor “sees” the CommonMark typescript, but the (XML or HTML or SGML) output of
cmark: this is somewhere between your listed places for customizations item 2 and 3: the post-processor’s input is not the “regular” rendering of the CommonMark typescript bycmark, but is specifically augmented for post-processing by a modifiedcmark:
-
all the labeled code block’s raw text content is wrapped in (XML/HTML/SGML) elements (one instance for each code block), but for “known” labels only (for which there is a post-processor);
-
with a made-up element name (ie tag),
-
for the sole purpose that post-processors can find their input inside these “transport elements”,
-
and then replace these elements by their formatted (XML/HTML/SGML/whatever) output,
-
completing the final output document each step in the chain of post-processors one bit more,
-
until the output document contains no more such “transport elements”, but is finished and final, and hopefully conforms to the targeted document type (for HTML/XML/SGML/whatever).
So for post-processing no one needs to produce, or see, or parse, a complete AST of the cmark parse: for a post-processor it would be sufficient to simply do a text search for the start tags of elements which are of interest to this specific post-processor (distinguished by an attribute like class="C" or class="PHP", derived from the label of the source code block itself). Each post-processor can rely on the exact spelling of the “transport element’s” start tag, because they were placed in there by cmark just for the purpose of being recognizable by those post-processors in the first place!
Yes, this is indeed similar to your remark "you can find the <pre> elements generated by cmark", but I see important advantages in not using an element type (ie tag) of the target (XML/HTML/SGML) document type like <PRE>, but instead avoiding any conflict through use of said “made-up” element names (a tag <commonmark:special class="PHP"> would never introduce conflicts in those target documents). This would protect from interference with any target format, while also separating the post-processors nicely, and would be the most generic approach I can think of right now.
That’s why some help by cmark is required: the code block contents for blocks labeled with a “known” identifier are to be wrapped not in <PRE>, but in “special” elements, for the one purpose of shipping them to the post-processors. (There may be multiple chained together, each detecting “it’s” input by the class="..." attribute, and handing it’s own result to the next one in the chain.)
Furthermore, one could see it as a drawback that each post-processor would have to be adjusted for this kind of input using “transport elements”; but I’m certain that a “post-processor hosting process” could easily be implemented, which would separate the content of “transport elements” from the rest of the document, feed it to post-processor’s standard input, receive the post-processor’s standard output, and piece together a final document from the various post-processors outputs and the document content outside of the “transport elements”, which the post-processors would never see in this mechanism: each post-processor would only see plain text in it’s “own” syntax arriving at it’s standard input, with no tags or enitity references at all. (Unless they were put into the code block by the original typescript’s author, in order to arrive at the post-processor, that is!)
So I expect that with some more “plumbing” one could use existing “filters” as post-processors (each one transforming stdin to stdout), but this plumbing would all be completely outside of and independant of cmark! In each case, cmark would have to produce the exact same output: wrapping specific raw text parts into specific “transport elements”, pushing the result out of the door (stdout) as always, and that’s it. How these elements are processed further is not the job of cmark, but of a Makefile, or a command-line pipe, or of this “post-processing” super-process, or whatever you can think of.
Can you point out for me where you think in this approach lurks fragility? Or what would restrict this approach to just one kind of output? I’m not sure I understand your argument completely regarding these alleged properties of post-processing.
(And the more I write and think about this approach, the more urgent gets my itch to actually go and implement it, so we can see how it works out … I’m convinced it would really take not that much effort.)