retaining all features of the source document … a ridiculous amount of detail … in the AST
That would make no sense indeed, for it would basically require serializing all edge cases, c.q. all equivalent notation, for each and every supported input format (in Pandoc’s case). Which would effectively come down to storing a complete clone of the verbatim input along with the abstract, normalized representation thereof.
If one would want to implement a diffing mechanism to support collaborative editing scenario’s (which, as suggested above, would sync between different formats/files of the same document), it would obviously be up to such implementation to keep track of, cache, and invalidate/update the versions/states of the files involved, and not be the responsibility of the CommonMark parser (or that of Pandoc, for that matter).
Yet, when files in different format are to be synced, then git’s immediate line/string based diffing, for example, would not suffice and one would need to be able to hook into the CommonMark parser (or Pandoc for other formats than CommonMark), or at least use its serialized output. Diffing would then have to be done first between the ASTs of both files, and only thereafter between the updated ASTs and their corresponding files’ verbatim contents — instead of simply overwriting the target files with the default (renormalized) output of the respective target format’s writers.
It’s still unclear to me how one would best implement diffing the target file’s verbatim string against the AST’s structured data object, but I guess one could loop over all nodes in the tree of t
(ype): Str
ing or Space
, join their c
(ontents)’s values, use these re-assembled strings to (fuzzily) find the corresponding text in the target file, and update where applicable.
If instead of the content (string value), the node’s type needed to be updated, then it would be helpful, I guess, if the AST would provide the preferred verbatim markup to be used. This would be, I think, unnecessary while syncing between different formats (e.g. .tex ↔ .md ↔ .docx), but not so when files are to be synced which are both formatted using the same markup language/syntax, but using alternative notation (.md ↔ .markdown ↔ .text). E.g. both are in valid CommonMark syntax, but one has line-wrapping applied, uses *
for emphasis, +
for unordered list items, and setext-style headers, the other atx-style headers, -
, _
, and no-wrap.)
Maybe it could all boil down to “just” extending the AST with node properties for verbatim markup, but only for those cases wherein the CommonMark spec allows for alternative notation (header styles, list item markers, inline emphasis), line-wrapping perhaps excluded.
Pardon me my confused wording — I probably don’t see all too clear where I’m heading to, myself. But I do see value in preparing the AST, as much as practically doable, for collaborative editing use cases, for which predictable, non-destructive syncing between files would be a crucially important requirement.