Does the spec require loss of input data when parsing?

seii · April 8, 2021, 4:14am

I have been doing some work on a CommonMark parser lately, and ran into a problem. When I am converting from raw CommonMark input to anything else, there’s an intermediary step which I would think of as “the AST step” - input has been parsed into an object tree, but not rendered to other languages like HTML. From this step, I wrote my parser such that it preserves whitespace, newlines, etc. so you could convert back into exactly the same raw CommonMark input.

Comparing my work against the “AST” tab in the dingus, I now see that my parser wouldn’t match the official AST in a lot of places. This is because the CommonMark spec recommends changes to input (such as the raw contents of an ATX heading being stripped of leading and trailing whitespace, for example) which would prevent knowing the original CommonMark input once parsing is complete.

Can anyone help me figure out whether parsing CommonMark truly requires losing data from the original input? Am I missing something? (I would think of anything which irreversibly loses data from the original input as steps to be performed only before rendering the AST to another language, but the spec doesn’t seem to support that idea.)

mity · April 8, 2021, 4:58pm

Does the spec require loss of input data when parsing?

The specification says “only” how the Markdown input should be transformed to an output format (HTML; it’s left to people’s common sense to keep the behavior as closely as possible when converting to another format).

It does not specify anything about intermediate phases such as the AST representation. I believe, it’s a good thing because it’s actually possible to implement an (almost) SAX-like parser, which needs to keep around relatively little compact information about the block structure of the input. For example my MD4C works that way.

Also note there is no one-to-one mapping between the input and output. The same output can be encoded in many ways in Markdown input. I.e. the specification does not care at all about a possibility of re-creating the exact input from any parsed representation. If you need to achieve that, feel free to implement so, but you’re simply going further than the specification requires.

In your particular situation I can see two possible approaches how to deal with it:

Do the transformation of the source as required when creating the AST, but provide also additional data structures providing enough information for reconstructing the original source if needed; or
Split the “AST” into two phases: “input AST” and “output AST”, so all the transformation requiring any loss of information would take place in the transition from the former to the latter. So e.g. the bogus whitespace or removal of link reference definitions would be done in that transition.

seii · April 8, 2021, 7:26pm

I will start by saying thank you, because based on your response it looks like I’m not actually blocked by what I read in the spec and I can continue on my project.

Separately, I’d like to contrast your reply against the current About this document and ATX Headings sections of the spec as examples. I mean no offense to your reply, but I believe the spec as written disagrees with part of it.

The spec does not appear to agree:

This document attempts to specify Markdown syntax unambiguously. It contains many examples with side-by-side Markdown and HTML.
…
Since this document describes how Markdown is to be parsed into an abstract syntax tree, it would have made sense… [continued explanation on why HTML is used to represent an AST instead of some unique layout]

This appears to set up two things the document does - specify Markdown syntax unambiguously and also specify how Markdown is parsed into an AST (which merely happens to be represented by HTML). There’s also the entire Appendix A where “parsing” is shown to be “parsing to an AST” rather than “parsing to HTML”.

Further, the ATX Headings (as one example) contain the instruction

Leading and trailing whitespace is ignored in parsing inline content

It specifically says “in parsing”. The whitespace in this example (Example 37) is valid Markdown, but if these whitespace alterations are not performed as part of parsing to an AST then this line of the spec (as currently written) has been broken. To say it differently, if parsing to an AST does not contain these whitespace alterations then it has broken the spec and is not valid CommonMark.

If by “parsing” the spec intends to mean “transform CommonMark into another output” instead of “parsing raw Markdown into an AST”, these formatting instructions would indeed make sense… except, again, that both the “about” section and the appendix clearly describe “parsing” as transforming raw Markdown syntax into a CommonMark AST.

Hopefully this shows why I found things so confusing. Have I misunderstood anything in these examples?

mity · April 8, 2021, 11:54pm

I believe this sentence only means that for any given input, you should get an unambiguous expected output, based on the specification wording.

It certainly does not mean that two different inputs have to lead to two different (unique) outputs. Consider e.g. all the possible ways how to encode the same link in a Markdown document, or ATX versus setext headers etc.

And once again, the specification does not care what exactly you do during processing the input, as long as you generate the expected final output.

I.e. a “minimal” compliant implementation is in principle losing an information in the process. I’m just noting that nobody prevents you to do a “better” implementation which provides the output as expected by the specification and additionally it allows to retrieve any supplementary data allowing you to reconstruct the original input byte by byte if you wish.

vas · April 9, 2021, 9:37am

@seii,

I think part of your confusion stems from your misunderstanding of what an AST is. Your original idea that an AST should losslessly contain all the information of the source contradicts the notion of an AST. What you describe is closer to a concrete syntax tree or a parse tree, though even those usually do not retain extraneous whitespace from the source. The leading and trailing whitespace in an ATX heading are extraneous and have no bearing on the author’s intended heading. Even the non-extraneous required space between the # and the heading text would not be represented in an abstract syntax tree because it only has purpose within the concrete textual syntax.

Usually when there is a need to retain facts about the source, parsers will attach it to AST nodes as extra metadata. The reference implementation does this for source line numbers. In your implementation, if you want to produce a true AST but also retain the source as-is, you could attach a separate “raw source” attribute to each node. I know of parsers that do exactly this. OR you can do whatever you want, including what you have now (it’s an intermediate representation, but not a true AST), as long as the final output conforms.

This document attempts to specify Markdown syntax unambiguously.

Yes, “Markdown syntax”. The concrete textual syntax and its semantics, but not a specific abstract syntax tree, not an “official AST” as you put it. Cmark and commonmark.js are reference implementations only. The only mention of “abstract syntax tree” in the entire spec is that one paragraph you reference. The spec describes what you might call the “abstract abstract syntax tree” using formal English (there is no formal grammar). Thus I agree with @mity: any AST that conforms to this “AAST” is perfectly legit. The fact that we have many different CommonMark implementations that each use different ASTs bares this out.

seii · April 9, 2021, 2:49pm

What you said here made me do some research, and I think you’re right. My understanding has indeed been closer to a concrete tree than an abstract one, so that helps me see what I missed. Thank you!

Credit to both @mity and @vas for hearing me out on this. I think the simplistic answer to my original question is “yes”, but the full answer from you both is closer to, “yes parsing to an AST requires data loss… because ASTs are about the syntax, not the ‘words’. ASTs can also include metadata, extra nodes, etc. so use those to retain non-syntactic information.”

vas · April 9, 2021, 5:53pm

No problem. By the way, I meant “extraneous” not “spurious”. Sorry. That didn’t help your understanding. I fixed my reply.

I wouldn’t use the term “loss” or the phrase “data loss”. From the perspective of the semantics of the syntax, the lead and trailing whitespace are not data, just as the leading zeros of 0008.5920 are not data if the semantics of that string are floating point number. The trailing zero, on the other hand, would be data if and only if it represented actual precision of a measurement. It is not lossy at all if extraneous text is stripped when parsing to an AST. And to the extent it is a true AST, one would want to remove anything not pertinent to the abstraction that the AST represents.

But if your purpose were not an AST but a in-memory literal representation of the source text, e.g. if you were writing a text editor, or a concrete syntax tree to be used by a syntax highlighter, then of course it would be “data loss” if you stripped anything.