I give my vote to @vitaly: the AST should be lossless but would add there is a lot more to it than that. The AST would have to reflect the elements in the source and not only ones needed for rendering.
You cannot address everything with a source map. What would the source map look like for a reference definition which has no rendering element at all? It may not even be referenced in the document but is needed for syntax highlighting, completion assistance and error annotations.
A lossy AST is not useful for syntax highlighting nor intelligent completion helpers as implemented in IntelliJ products, for example.
I also agree with @vitaly on not bothering with any of this now since it would require including the AST as a testable part of the spec and then it would be an endless discussion of the minutia.
I am the author of a Markdown plugin for JetBrains IDEs with syntax highlighting, completions, annotations and formatting, Markdown Navigator. I learned the hard way that having a lossless AST is not an option but a requirement. For my purposes I forked commonmark-java and made a version, flexmark-java, with a lossless AST that reflects the exact markdown elements in the source and a boat load of options that can tweak the parsing rules. I also included the AST as a testable part of the spec. Lots of bugs in the plugin came from the AST not being what was expected or from missing or incorrect source position information for some part of an element.
Without a lossless AST, whoever needs the source representation will have no choice but to at least partially re-parse the source to complete what is missing. Double parsing is a nightmare to implement, test and maintain. I did that when pegdown was the parser used by the plugin and would not want to relive the experience.
The issues facing an editor that syntax highlights or worse, assists the user in completing markdown text based on context are very different than processing of a static file or interim results for preview purposes. They are very counter-intuitive until you encounter them in an implementation.
The AST is used for context so it will affect the behaviour of the editor. What is reasonable or harmless for static processing, like the commonmark setext header rules, becomes a problem when the interim results are used to assist the user based on context:
Here is an example:
- accidental header
-
In the above the user typed in a list item and started a sub item. Only to realize that he wants to change header
to heading
. Goes up a line and starts editing, but according to commonmark rules what he is editing is a heading in a list item not just list item text followed by an empty sub item.
The plugin, has assistance on equalizing the setext header marker to the length of the heading text as you type, so complies by changing the sub item, which it sees as the setext heading marker, to match. What the user now gets is:
- accidental heading
------------------
Not a happy user because the editor failed to properly anticipate what he wanted and mucked up his text.
Trying to catch all these “exceptions” in the editor becomes an exercise in catching fleas. It is much easier to change the parser rules to not assume too much, at least for the syntax highlighting and context AST. Rendering can do whatever, it only affects the visual presentation and will not screw up the user’s text, which most users, including myself, don’t take kindly.
The same goes for list items interrupting paragraphs. “Wrap on typing” functionality can result in an inside -
or *
or +
being wrapped to the first character of the line. Now we get a list where none was intended. Again, trying to catch all this in the editor would require adding a \ to all such elements when they wrap to the beginning of line. What about block quotes, atx headings, etc. The list can be very long. Is the user going to be happy about all the extra \ in his text? I doubt it. They also don’t make the file easier to read. It is much more sane to require blank lines before lists. It is easy to add and easy on the eyes, unlike all the extra \ in the text. I am not the first to bring this up, but from what I saw in the discussion these issues were swept aside as minor edge cases. I whole heartedly disagree.
Is it better to allow lazy formatting or not is a futile debate without nailing down your use cases. If you are using markdown for making quick throw-away to do lists or comments on GitHub then of course such loose rules make sense.
If you are creating something that takes more text and effort. Something you probably would not want to create with a bare bones notepad text editor, then you probably will want a more stringent spec that will not create surprises when you add a word or change your wrapping margins. On the other hand, a few extra spaces or blank lines are not an issue because they can be easily addressed by your IDE or editor.
A lot of discussions here focus on the ease of creating markdown documents as if we are still using early K&R unix with a teletype interface and want to keep the noise level down. I would think that we should expect a bit more assistance from our development environment in creating documents. The language standard should reflect this reality. Instead decisions appear to be made on minimizing keystrokes and introducing ambiguity or counter intuitive interpretation.
What is better to force the user to insert a blank line between a paragraph and the first list item or to pepper your paragraph text with \-
, \*
, \+
just to make sure no surprises are introduced if the text is wrapped to different margins or editing causes the special character to wrap to the beginning of line? The answer depends on whether the user is writing a document in an editor or typing in a quick comment on Stack Overflow or GitHub.
I think that this dilemma exposes some ambiguity in the purpose for the commonmark language spec. Is its purpose a standard that will allow creating moderately serious document with moderate effort that will predictably render everywhere and be easy to read without rendering to HTML, or is it for being the shoe in replacement for Stack Overflow or GFM comment processor.
I think the two purposes will produce different and conflicting solutions and should not be combined under one spec. Even GFM differs in its rules for comments and documents: comment GFM has hard-wraps, documents do not.
Where Markdown’s ambiguity was approximately right, commonmark spec is choosing to be precisely wrong. For example setext headers not having minimum of 3 -
, lists not needing a blank line before a paragraph, list indentation rules that can result in a single list with items indented to infinity are just a few examples.
If this is not changed then commonmark will wind up being just another markdown flavour to support, not the markdown standard to be used for creating long-life documents. That would be a shame.
P.S. I don’t see much point in standardizing markdown rules for ephemeral comments or short documents. These are easy enough to create in any dialect and all sites give you a preview so you can make quick corrections. On the other hand, creating a long format documents in markdown absolutely needs a standard.
Some supporting details follow:
Commonmark list rules make even less sense when you see that they can cause this:
* item 1
* item 2
* item 3
* item 4
* sub item 1
* sub item 2
* sub item 3
* sub item 4
* sub sub item 1
To be rendered as:
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
<li>item 4</li>
<li>sub item 1</li>
<li>sub item 2</li>
<li>sub item 3</li>
<li>sub item 4</li>
<li>sub sub item 1</li>
</ul>
At least with 3 space indentation maximum for list items in other parsers gives a more intuitive:
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
<li>item 4
<ul>
<li>sub item 1</li>
<li>sub item 2</li>
<li>sub item 3</li>
<li>sub item 4
<ul>
<li>sub sub item 1</li>
</ul>
</li>
</ul>
</li>
</ul>
Parsing and AST for User Assisting Editor
Not only do you want an element parsed and in the AST when it does not render, you also want illegal elements that would be legal with minor changes. For example a link reference without the reference being defined should be in the AST. It is not unusual to create the ref link then define the reference. If the ref link is parsed as text until its reference is defined, then the editor/plugin cannot help the user with an error highlight of the missing reference, nor help him get the reference label right, by using the undefined ref link and ref image references as a list of suggestions for the reference label.
For example, for purposes of HTML rendering:
-
Inline link
[text](/url)
-
Link reference
[text]: /url
[text]
-
Link Reference with empty reference
[text]: /url
[text][]
are the same since they all render to identical HTML. However for the purpose of syntax highlighting, completions and error annotations they are all very different.
Case 1. is an inline link and can present a user with options to:
- convert to reference link
- change addressing from site relative to absolute
http://
. Why the distinction? If the page is on GitHub the /
resolves to the http://github.com/user/repo/blob/master
url. If it is in a github wiki it will resolve relative to http://github.com/user/repo/wiki
url.
- completions can be done in
[]
for the link text and in ()
for the reference to a project document or file
- validation is done on links to files/documents within a project and errors or incorrect links highlighted
Case 2. is a reference link with a reference definition and the user gets different options for the two
-
reference link
- option to inline a link
- validation that the reference is defined
- validation that the reference is a valid type, as for image references to non-image files
- completions between
[]
give a list with all the references defined in the document
- If an inline link after this element is converted to a reference then an empty dummy reference
[]
will be inserted automatically after this element to prevent it from becoming the reference text of the new reference link.
-
reference definition
- same options as for an inline link with the addition of highlighting when a reference definition is not being referenced in the document
- completion between
[]
will show any reference links whose reference is not defined. In other words, if you create reference links first then create the references you are aided in getting the reference text right.
- duplicate reference definitions are highlighted with a warning
Case 3. is a reference link with a reference definition mostly the same as Case 2. except for possible completions in the empty []
reference, which give a list of all defined references.