Preserve bullet char in AST

I thought that it might be useful if the character which was used to indicate a list item, is preserved in AST representation.

At the moment reference implementations output type (bullet/ordered) and tightness:

<list type="bullet" tight="true">

This feature might be helpful for rendering into non-html formats.

There are a number of similar issues that might be discussed. Should the AST represent whether a code block was indented or fenced? Should it represent which character was used for a list item? Should it represent the distinction between ATX and setext headers? Should it represent whether the ATX header had following ###s or just leading?

So far we’ve gone in the direction of having a simpler AST. If you render CommonMark, for example (and I’ve got a commonmark writer brewing in the commonmarkwriter branch of jgm/cmark), you’re never going to reproduce all the features of the input (e.g. indentation, laziness, etc.). Is it so bad, then, for the renderer to use a list marker other than the one used in the original document?

+++ zudov [Mar 19 15 09:09 ]:

That makes sense. At the end of the day it’s ‘abstract’ for reason.

I gave some more thought on that subject. A little inconsistency I noticed is that in the case of ordered list the AST preserves information whever it’s parenthesis or period delimited. Not a big problem, but made me wondering about the reason.

And one use case for preserving bullet character is that markdown makes it tempting to use ‘+’ for pros and ‘-’ for cons in Pros/Cons lists. There is no difference for html, but renderers to other formats (e.g. org) or custom html renderer, could make use of it.

+++ zudov [Mar 30 15 15:41 ]:

I gave some more thought on that subject. A little inconsistency I noticed is that in the case of ordered list the AST preserves information whever it’s parenthesis or period delimited. Not a big problem, but made me wondering about the reason.

The idea was to allow you to reproduce the delimiter in the output format, if the output format supports that. So, yes, you’re right – similar reasoning might suggest keeping track of the bullet.

And one use case for preserving bullet character is that markdown makes it tempting to use ‘+’ for pros and ‘-’ for cons in Pros/Cons lists. There is no difference for html, but renderers to other formats (e.g. org) or custom html renderer, could make use of it.

Note that in CommonMark changing the bullet character starts a new list, so you couldn’t have one single list with + for pro and - for con.

Note that in CommonMark changing the bullet character starts a new list, so you couldn’t have one single list with + for pro and - for con.

Indeed. I meant two separate lists with pros and cons.

The idea was to allow you to reproduce the delimiter in the output format, if the output format supports that.

Keeping track in the AST of the original document’s verbatim markup (as much as possible) is useful, in use cases where isomorphic reproducability is required.

With Pandoc you can have bidirectional conversion from one supported input document format to another supported output format, and vice versa — but the original markup will not be preserved. This is obviously also the case when you convert Markdown to Markdown: the orginal markup in the source file will get “normalized” in the ouptut (varying furthermore with optional flags, like --no-wrap --smart --atx-headers).

Likely, the common use case scenario which was envisaged, is that of a single author who would edit a “master copy” of the document in a format using his favorite lightweight markup syntax (e.g. AsciiDoc, reST, or Markdown), who would then have the output formats (.docx, .html, .pdf) “updated” from the edited master copy, in a build process; Pandoc would never touch the “master copy”. Cases wherein multiple authors would agree to all work on the same file (using Git, for example, to manage version control), in the same format (Pandoc Markdown), are actually no different.

However, suppose a scenario wherein there are multiple authors who each prefer a different syntax, and who would need to collaborate on the same document.

One could imagine many authors unwilling to give up using LyX, while other ones would happily switch to editing in Common Mark syntax. When the document is edited (and the updated file committed/pushed), you would need to convert the .md file of the latter to .tex; the former could then continue in Lyx, but would expect the update would reflect changes only on the contents of the document made by his collaborators, and he would be much upset if his orginal latex markup would not have been preserved. Likewise, the co-author working in Markdown would be confused to find that the conversion would have changed his orginal _emphasis_ into *emphasis*, atx-style headers changed to setext, etc., even if the value of those nodes (text) would not have been edited by any collaborator.

Implementing such collaborative use case scenario’s, would of course require a lot more effort than just extending the AST’s model (like caching and diffing versions of the AST). But as long as the AST does not keep track and preserve the markers from the input document, verbatim, true bidirectional conversion in collaborative use case scenario’s will unfortunately remain not feasible at all.

Maybe this is all just too much out of scope still. Yet I believe that if we could have reliable bidirectional conversion, that would be a real game changer, especially with increasing interest for real-time collaborative editing apps. One way or another, you’ll need a parser on the client (a Common Mark implementation in JavaScript), of which the “raw” AST (instead of a formatted html string) would be reactively synced, both with other connected clients, as with a wysiwyg “rich text” editor in the user’s browser (using shadow DOM). In both cases, the AST then needs to provide users’s orginal markup, instead of “normalizing” it as an undeliberate side-effect.

+++ rhythmus [Apr 07 15 19:11 ]:

Maybe this is all just too much out of scope still. Yet I believe that
if we could have reliable bidirectional conversion, that would be a
real game changer, especially with increasing interest for real-time
collaborative editing apps. One way or another, you’ll need a parser on
the client (a Common Mark implementation in JavaScript), of which the
“raw” AST (instead of a formatted html string) would be reactively
synced, both with other connected clients, as with a wysiwyg “rich
text” editor in the user’s browser (using shadow DOM). In both cases,
the AST then needs to provide users’s orginal markup, instead of
“normalizing” it as an undeliberate side-effect.

I do see the point of this. But there are just so many features of the source text that would have to be stored. For example, these are all equivalent in CommonMark:

> Hi there
> boss.

>   Hi there
boss.

>  Hi there
>  boss.

I think a plausible argument could be made for retaining information about the emphasis delimiter and bullet character, but retaining all features of the source document and avoiding any renormalization on round-trip would require a ridiculous amount of detail to be stored in the AST.

retaining all features of the source document … a ridiculous amount of detail … in the AST

That would make no sense indeed, for it would basically require serializing all edge cases, c.q. all equivalent notation, for each and every supported input format (in Pandoc’s case). Which would effectively come down to storing a complete clone of the verbatim input along with the abstract, normalized representation thereof.

If one would want to implement a diffing mechanism to support collaborative editing scenario’s (which, as suggested above, would sync between different formats/files of the same document), it would obviously be up to such implementation to keep track of, cache, and invalidate/update the versions/states of the files involved, and not be the responsibility of the CommonMark parser (or that of Pandoc, for that matter).

Yet, when files in different format are to be synced, then git’s immediate line/string based diffing, for example, would not suffice and one would need to be able to hook into the CommonMark parser (or Pandoc for other formats than CommonMark), or at least use its serialized output. Diffing would then have to be done first between the ASTs of both files, and only thereafter between the updated ASTs and their corresponding files’ verbatim contents — instead of simply overwriting the target files with the default (renormalized) output of the respective target format’s writers.

It’s still unclear to me how one would best implement diffing the target file’s verbatim string against the AST’s structured data object, but I guess one could loop over all nodes in the tree of t(ype): String or Space, join their c(ontents)’s values, use these re-assembled strings to (fuzzily) find the corresponding text in the target file, and update where applicable.

If instead of the content (string value), the node’s type needed to be updated, then it would be helpful, I guess, if the AST would provide the preferred verbatim markup to be used. This would be, I think, unnecessary while syncing between different formats (e.g. .tex ↔ .md ↔ .docx), but not so when files are to be synced which are both formatted using the same markup language/syntax, but using alternative notation (.md ↔ .markdown ↔ .text). E.g. both are in valid CommonMark syntax, but one has line-wrapping applied, uses * for emphasis, + for unordered list items, and setext-style headers, the other atx-style headers, -, _, and no-wrap.)

Maybe it could all boil down to “just” extending the AST with node properties for verbatim markup, but only for those cases wherein the CommonMark spec allows for alternative notation (header styles, list item markers, inline emphasis), line-wrapping perhaps excluded.

Pardon me my confused wording — I probably don’t see all too clear where I’m heading to, myself. But I do see value in preparing the AST, as much as practically doable, for collaborative editing use cases, for which predictable, non-destructive syncing between files would be a crucially important requirement.

+++ rhythmus [Apr 08 15 02:36 ]:

Maybe it could all boil down to “just” extending the AST with node properties for verbatim markup, but only for those cases wherein the CommonMark spec allows for alternative notation (header styles, list item markers, inline emphasis), line-wrapping perhaps excluded.

Even here there are lots of options: for example, these are all equivalent level-2 headers:

## Hi

## Hi ##

## Hi #

Hi
-

Hi
--

But it might be worth storing the distinction between setext and ATX style: this really stands or falls with the rest (emphasis markers, bullet markers).

Sure, but that mustn’t imply the combinatorics for all these many different notations had to be implemented so as to again abstract them. Instead one could dump the verbatim before and after delimiters along with the type and contents of the node.

If some abstraction of the formatting style were to be considered, then it would be probably best to have those properties set only at the level of the document, along with its metadata.

[{
        "unMeta": {},
        "𝐟𝐨𝐫𝐦𝐚𝐭": {
          "𝐬𝐲𝐧𝐭𝐚𝐱":  "CommonMark",
          "𝐜𝐨𝐥𝐖𝐫𝐚𝐩": "none",
          "𝐡𝐞𝐚𝐝𝐞𝐫𝐬": "ATX"
        }
    },
    [{
        "t": "Header",
        "𝐚𝐧𝐭𝐞": "## ",
        "c": [2, ["hi", [],
                []
            ],
            [{
                "t": "Str",
                "c": "Hi"
            }]
        ],
        "𝐞𝐱𝐢𝐦": "\n"
    }, {
        "t": "Header",
        "𝐚𝐧𝐭𝐞": "## ",
        "c": [2, ["hi-1", [],
                []
            ],
            [{
                "t": "Str",
                "c": "Hi"
            }]
        ],
        "𝐞𝐱𝐢𝐦": " ##\n"
    }, {
        "t": "Header",
        "𝐚𝐧𝐭𝐞": "## ",
        "c": [2, ["hi-2", [],
                []
            ],
            [{
                "t": "Str",
                "c": "Hi"
            }]
        ],
        "𝐞𝐱𝐢𝐦": " #\n"
    }, {
        "t": "Header",
        "𝐚𝐧𝐭𝐞": null,
        "c": [2, ["hi-3", [],
                []
            ],
            [{
                "t": "Str",
                "c": "Hi"
            }]
        ],
        "𝐞𝐱𝐢𝐦": "\n-\n"
    }, {
        "t": "Header",
        "𝐚𝐧𝐭𝐞": null,
        "c": [2, ["hi-4", [],
                []
            ],
            [{
                "t": "Str",
                "c": "Hi"
            }]
        ],
        "𝐞𝐱𝐢𝐦": "\n--\n"
    }]
]

Apart from support for general-purpose collaborative editing applications, one use case I have in mind specifically is that of the GITenberg Project. The maintainers lean heavily towards AsciiDoc as the preferred format for the “master” copies of all Gutenberg e-books in their >60k repos, followed, maybe, by reST, but leaving out Markdown, because of its small feature-set as regards supported content element types.

That’s a shame, considering what Pandoc Markdown already is and what CommonMark is aspiring to become. But well, preferences may differ…

I’d suggest they would instead maintain a format-agnostic policy for GITenberg input, and accept as many editable document formats as possible. On the backend they could then rely on Pandoc to convert between files/formats when repos are merged, using a diffing mechanism like the one described above. That way, contributors of all stripes could use whatever format they’d prefer, and have their edits/commits synced into whatever formats/files present in the repo into which these are merged back.

The GITenberg Project still is in its infancy, but if the reference CommonMark implementation’s (and/or Pandoc’s) AST would provide the required hooks to enable syncing/diffing between different formats, that would definitely further the case of CommonMark as the de facto standard universal document format of the future.

Another application of extending the AST with the actual literals used in the input file’s verbatim markup, concerns CommonMark’s own functionality (extensibility) and is similar as the one brought up by the OP of this thread.

If the AST would store the literal delimiters (such as the actual character (string) used to denote an unordered list item), then an extensions mechanism could use that readily available information to build custom functionality on top of standard CommonMark behavior. Semantically discriminating between + and - (pro and contra), could then more easily be implemented: a standard CommonMark implementation would tokenize both as unnumbered list items regardless, while the extension might add, for example, a class to further disambiguate. If the required information (the actual verbatim delimiter) would already be available in the AST, then implementing such extension might be reduced to using a simple walker, similar to Pandoc filters.

Is there any problem with implementing this for at least the bullet code characters? In other words, ignoring the other header syntaxes for now.

At GitHub we’re trying to do some clever parsing of lists for the task lists feature. We are converting the source text to an AST to perform some manipulation, but before sending the text back we’ve found that the AST does not track bullet type. I’m more than happy to do a PR for this single change if it’s okay that it doesn’t take into consideration all other variations.

What if we don’t say anything about this in the spec? That allows conforming parsers to store this information (and do something with it) if they want to, but doesn’t require it.