Should there be additional information in the AST?

Continuing the discussion from Issues we SHOULD resolve before 1.0 release:

Currently we lose information about:

  • whether a link was an autolink
  • whether it was a reference link, what kind, and what label
  • what the bullet character was for a list item
  • what character was used for emphasis/strong emphasis
  • whether the 1. or 1) style of ordered list was used.
  • whether a code block was backtick or indented

Should some or all of these things be part of the AST, or is this too “concrete”? (This would mainly affect conversion back to CommonMark, though other renderers could decide to be sensitive to these things.)

I am all for these being available in the AST as I would like to be able to write a Commonmark compliant renderer.

I believe these have already been implemented in commonmark.js:

  • what the bullet character was for a list item (using bulletChar)
  • whether the 1. or 1) style of ordered list was used. (using listDelimiter)

In addition, as noted elsewhere:

  • the type of hard line break in use (backslash or double space)

I will add more as I come across them. With regards to the point made on github regarding not knowing where in the file link definitions are made I don’t think this is an issue, I suggest that the Document node would have a links object representing link definitions and Markdown renderers can append these to the end of the file. Whilst some information is lost the document would function correctly when re-parsed and it is my convention to always declare link definitions at the bottom of the document.

In my experience writing html2commonmark i found these things are also hard to grasp:

  • Whether a h1 or h2 header was an ATX-header or a [setext header] (http://spec.commonmark.org/0.24/#setext-headings). This one is important, because functionally: there is a difference. A setext header can span multiple lines, while an ATX header cannot.
  • Whether a space was caused by an   of by an actual space. This might even be a bug. See example 295

Markdown

  & © Æ Ď\n¾ ℋ ⅆ\n∲ ≧̸\n

AST

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CommonMark SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text>  &amp; © Æ Ď</text>
    <softbreak />
    <text>¾ ℋ ⅆ</text>
    <softbreak />
    <text>∲ ≧̸</text>
  </paragraph>
</document>
  • The new lines after a code block inside a loose list. This might even be a bug of the commonmark.js parser, i’m not sure.

See example 232

1.      indented code

   paragraph

       more code

This converts to this AST:

<document xmlns="http://commonmark.org/xml/1.0">
  <list type="ordered" start="1" tight="false" delimiter="period">
    <item>
      <code_block> indented code
</code_block>
      <paragraph>
        <text>paragraph</text>
      </paragraph>
      <code_block>more code
</code_block>
    </item>
  </list>
</document>

I’m missing the space before ’ more code’. I was expecting this AST:

<document xmlns="http://commonmark.org/xml/1.0">
  <list type="ordered" start="1" tight="false" delimiter="period">
    <item>
      <code_block> indented code
</code_block>
      <paragraph>
        <text>paragraph</text>
      </paragraph>
      <code_block> more code
</code_block>
    </item>
  </list>
</document>

If i find more: i’ll let you know.

A &nbsp; will appear in the AST as a unicode nonbreaking space (U+00AO). That is a different character from a regular space.

Why were you expecting a space there? The more code doesn’t start in the same column as the indented code above.

Here’s a really weird edge case that shows why this part would be useful, especially for turning into an AST then back into markdown:

Say you turned this big of markdown into an AST:

_*Hello, world*_

You’d get this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CommonMark SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <emph>
      <emph>
        <text>Hello, world</text>
      </emph>
    </emph>
  </paragraph>
</document>

An ast->markdown converter might spit this out:

**Hello, world**

Resulting in bold rather than italic.

This markup actually makes sense if a custom extension turns _ into HTML i and * into em, which can be useful in some scenarios, e.g. linguistics.

This is why the AST -> commonmark translation is tricky to get right. But I think cmark’s commonmark writer does a pretty good job, without storing any information about the original markers:

% echo "_*Hello, world*_" | cmark -t commonmark
*_Hello, world_*
1 Like

Without falling to small details, some tasks require write markdown without deviations:

  1. Markdown editor with highlighter. User changes highlighted text -> original should be updated.
  2. Copy-paste from webpage to editor: user selects forum message with mouse, and appropriate peace of markdown source should be copied to editor as quote.

Proper implementation requires source map supports (~ like for js/css in browser) AND exact restore of original with applied patches.

IMHO, AST should be lossless, to be useful for complex transforms.

Supporting both of these features requires only accurate source maps, not additional information in the AST. Source maps are a separate issue, I think…clearly they are desirable, it’s just a matter of designing your parser to produce one. (I didn’t get this right in the current reference implementations, which don’t store source position for inline elements.)

I can’t agree with this statement. The end goal (IMHO) is to have homeomorphic transforms between different data formats. Sourcemap is just a one of components, required to do reverse transformation. Lossy AST cancel possibility of homeomorphic transforms.

My point it not about sourcemap, but about correct data flow in black box. AST can be designed independently, and (IMHO) it should be lossless.

Probably, some kind of tasks can be implemented with lossy AST, but i’d prefer to have mathematically correct background. That’s some kind of guarantee that no kludges needed in future.

You gave two examples of features that required more information in the AST, and I pointed out that neither does: they both simply require source maps. If there are better examples, feel free to give them.

An AST is abstract – in general it represents only the essential structure of the program or document, abstracting from syntactic details in which this is presented. For example, some ASTs omit source code comments altogether. Some ASTs omit details like whether a string was represented with single quotes or double quotes, in languages where both are permitted. So it’s up to us to decide which features must be represented in an AST. It is generally not expected or required that an AST contain enough information to reproduce the source in every detail. If there is a reason to require this in this case, it needs to be articulated and defended.

ADDED: Note that, for full reversibility, we’d have to go far beyond the list in the message at the beginning of this thread; the amount of information the AST would have to contain would be staggering. Just consider inline links. What kind of delimiter was used for the title in a link, ', ", or (? How many spaces were there between the URL and the title? Were there any newlines that got collapsed to spaces? If it was a reference link, was it a shortcut, compact, or regular? Was there whitespace between the two components, and if so what? Were any of the characters in the title or link description escaped? I’m only getting started here…

I base my suggestions on experience with javascript AST and processing tools like eslint. As far as i remember, there were a lot of troubles due missed info about comments. And i also consider normal to keep info about delimiter types and so on. Probably, linter for markdown is also useful task.

Anyway, prior to define AST details, it would be nice to define it’s goal clear. Now it looks like drived by “feelings” :slight_smile: . If we start to consider each node without general approach, it will take ages of iterations to make something stable.

May be my suggestion looks to complex, but i haven’t seen any alternative which answer question “why this AST will be good enougth”.

The original goal of the AST was to provide a representation of a structured document that could be rendered into a number of different formats, preserving semantics. The current AST meets that goal.

If you have a different goal (e.g., writing a linter that can make structure-aware changes to a document without unnecessary changes to the details of the concrete representation) then a different kind of AST may be necessary.

Of course, it’s not our intention to prevent you from using whatever kind of AST you find necessary for your purposes. The only question is what, if anything, the spec should say about it.

What I think now is that the spec shouldn’t say anything about it. The spec should describe the structure that a conforming parser needs to preserve. This structure won’t include, e.g., a distinction between + bulleted lists and - bulleted lists. But nothing stops you from preserving this structure if you want to.

My personal thoughts is, that we are not ready to freeze any kinds of data structures. I’m ok, if this things are discussed in scope of reference implementation, without propagation to spec. Current focus on document conversion is too narrow IMHO.

I’d suggest to postpone spec-related things until sourcemap complete. Because that will enable a lot of completely new use cases, where AST (structures?) can be involved.

While there is already a parsing strategy as an appendix, I would not mind much having an appendix that describes what kind of information you could store in an AST+source map, but it is highly dependent on the uses cases (you barely need the original semantic+source span for a html only converter, but for a syntax highlighter you definitely need them) so not sure it should be part of a 1.0 spec (just a nice to have)… This could be added later without any impact on the core specs

Also a note about the separation between source maps and AST: in modern parsers that are both used to provide the infrastructure for compilers and intellisense/syntax highlighting, this strict separation is no longer true and the AST is often able to carry a lossless information, with attached pre/post tokens for each AST node (including all spaces…etc).

Could the additional information be added in a post-1.0 release (of the reference implementation) on an as-needed basis, rather than holding up version 1.0 of the spec? Since an abstract syntax tree is optional for a conforming parser, and the structure of the AST is not currently defined in the spec, it seems like a seperate problem from the completion of the spec.

I give my vote to @vitaly: the AST should be lossless but would add there is a lot more to it than that. The AST would have to reflect the elements in the source and not only ones needed for rendering.

You cannot address everything with a source map. What would the source map look like for a reference definition which has no rendering element at all? It may not even be referenced in the document but is needed for syntax highlighting, completion assistance and error annotations.

A lossy AST is not useful for syntax highlighting nor intelligent completion helpers as implemented in IntelliJ products, for example.

I also agree with @vitaly on not bothering with any of this now since it would require including the AST as a testable part of the spec and then it would be an endless discussion of the minutia.

I am the author of a Markdown plugin for JetBrains IDEs with syntax highlighting, completions, annotations and formatting, Markdown Navigator. I learned the hard way that having a lossless AST is not an option but a requirement. For my purposes I forked commonmark-java and made a version, flexmark-java, with a lossless AST that reflects the exact markdown elements in the source and a boat load of options that can tweak the parsing rules. I also included the AST as a testable part of the spec. Lots of bugs in the plugin came from the AST not being what was expected or from missing or incorrect source position information for some part of an element.

Without a lossless AST, whoever needs the source representation will have no choice but to at least partially re-parse the source to complete what is missing. Double parsing is a nightmare to implement, test and maintain. I did that when pegdown was the parser used by the plugin and would not want to relive the experience.

The issues facing an editor that syntax highlights or worse, assists the user in completing markdown text based on context are very different than processing of a static file or interim results for preview purposes. They are very counter-intuitive until you encounter them in an implementation.

The AST is used for context so it will affect the behaviour of the editor. What is reasonable or harmless for static processing, like the commonmark setext header rules, becomes a problem when the interim results are used to assist the user based on context:

Here is an example:

- accidental header
  - 

In the above the user typed in a list item and started a sub item. Only to realize that he wants to change header to heading. Goes up a line and starts editing, but according to commonmark rules what he is editing is a heading in a list item not just list item text followed by an empty sub item.

The plugin, has assistance on equalizing the setext header marker to the length of the heading text as you type, so complies by changing the sub item, which it sees as the setext heading marker, to match. What the user now gets is:

- accidental heading
  ------------------ 

Not a happy user because the editor failed to properly anticipate what he wanted and mucked up his text.

Trying to catch all these “exceptions” in the editor becomes an exercise in catching fleas. It is much easier to change the parser rules to not assume too much, at least for the syntax highlighting and context AST. Rendering can do whatever, it only affects the visual presentation and will not screw up the user’s text, which most users, including myself, don’t take kindly.

The same goes for list items interrupting paragraphs. “Wrap on typing” functionality can result in an inside - or * or + being wrapped to the first character of the line. Now we get a list where none was intended. Again, trying to catch all this in the editor would require adding a \ to all such elements when they wrap to the beginning of line. What about block quotes, atx headings, etc. The list can be very long. Is the user going to be happy about all the extra \ in his text? I doubt it. They also don’t make the file easier to read. It is much more sane to require blank lines before lists. It is easy to add and easy on the eyes, unlike all the extra \ in the text. I am not the first to bring this up, but from what I saw in the discussion these issues were swept aside as minor edge cases. I whole heartedly disagree.

Is it better to allow lazy formatting or not is a futile debate without nailing down your use cases. If you are using markdown for making quick throw-away to do lists or comments on GitHub then of course such loose rules make sense.

If you are creating something that takes more text and effort. Something you probably would not want to create with a bare bones notepad text editor, then you probably will want a more stringent spec that will not create surprises when you add a word or change your wrapping margins. On the other hand, a few extra spaces or blank lines are not an issue because they can be easily addressed by your IDE or editor.

A lot of discussions here focus on the ease of creating markdown documents as if we are still using early K&R unix with a teletype interface and want to keep the noise level down. I would think that we should expect a bit more assistance from our development environment in creating documents. The language standard should reflect this reality. Instead decisions appear to be made on minimizing keystrokes and introducing ambiguity or counter intuitive interpretation.

What is better to force the user to insert a blank line between a paragraph and the first list item or to pepper your paragraph text with \-, \*, \+ just to make sure no surprises are introduced if the text is wrapped to different margins or editing causes the special character to wrap to the beginning of line? The answer depends on whether the user is writing a document in an editor or typing in a quick comment on Stack Overflow or GitHub.

I think that this dilemma exposes some ambiguity in the purpose for the commonmark language spec. Is its purpose a standard that will allow creating moderately serious document with moderate effort that will predictably render everywhere and be easy to read without rendering to HTML, or is it for being the shoe in replacement for Stack Overflow or GFM comment processor.

I think the two purposes will produce different and conflicting solutions and should not be combined under one spec. Even GFM differs in its rules for comments and documents: comment GFM has hard-wraps, documents do not.

Where Markdown’s ambiguity was approximately right, commonmark spec is choosing to be precisely wrong. For example setext headers not having minimum of 3 -, lists not needing a blank line before a paragraph, list indentation rules that can result in a single list with items indented to infinity are just a few examples.

If this is not changed then commonmark will wind up being just another markdown flavour to support, not the markdown standard to be used for creating long-life documents. That would be a shame.

P.S. I don’t see much point in standardizing markdown rules for ephemeral comments or short documents. These are easy enough to create in any dialect and all sites give you a preview so you can make quick corrections. On the other hand, creating a long format documents in markdown absolutely needs a standard.


Some supporting details follow:

Commonmark list rules make even less sense when you see that they can cause this:

* item 1
 * item 2
  * item 3
   * item 4
    * sub item 1
     * sub item 2
      * sub item 3
       * sub item 4
        * sub sub item 1

To be rendered as:

<ul>
    <li>item 1</li>
    <li>item 2</li>
    <li>item 3</li>
    <li>item 4</li>
    <li>sub item 1</li>
    <li>sub item 2</li>
    <li>sub item 3</li>
    <li>sub item 4</li>
    <li>sub sub item 1</li>
</ul>

At least with 3 space indentation maximum for list items in other parsers gives a more intuitive:

<ul>
    <li>item 1</li>
    <li>item 2</li>
    <li>item 3</li>
    <li>item 4
        <ul>
            <li>sub item 1</li>
            <li>sub item 2</li>
            <li>sub item 3</li>
            <li>sub item 4
                <ul>
                    <li>sub sub item 1</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

Parsing and AST for User Assisting Editor

Not only do you want an element parsed and in the AST when it does not render, you also want illegal elements that would be legal with minor changes. For example a link reference without the reference being defined should be in the AST. It is not unusual to create the ref link then define the reference. If the ref link is parsed as text until its reference is defined, then the editor/plugin cannot help the user with an error highlight of the missing reference, nor help him get the reference label right, by using the undefined ref link and ref image references as a list of suggestions for the reference label.

For example, for purposes of HTML rendering:

  1. Inline link

    [text](/url)
    
  2. Link reference

    [text]: /url
    [text]
    
  3. Link Reference with empty reference

    [text]: /url
    [text][]
    

are the same since they all render to identical HTML. However for the purpose of syntax highlighting, completions and error annotations they are all very different.

Case 1. is an inline link and can present a user with options to:

  • convert to reference link
  • change addressing from site relative to absolute http://. Why the distinction? If the page is on GitHub the / resolves to the http://github.com/user/repo/blob/master url. If it is in a github wiki it will resolve relative to http://github.com/user/repo/wiki url.
  • completions can be done in [] for the link text and in () for the reference to a project document or file
  • validation is done on links to files/documents within a project and errors or incorrect links highlighted

Case 2. is a reference link with a reference definition and the user gets different options for the two

  • reference link

    • option to inline a link
    • validation that the reference is defined
    • validation that the reference is a valid type, as for image references to non-image files
    • completions between [] give a list with all the references defined in the document
    • If an inline link after this element is converted to a reference then an empty dummy reference [] will be inserted automatically after this element to prevent it from becoming the reference text of the new reference link.
  • reference definition

    • same options as for an inline link with the addition of highlighting when a reference definition is not being referenced in the document
    • completion between [] will show any reference links whose reference is not defined. In other words, if you create reference links first then create the references you are aided in getting the reference text right.
    • duplicate reference definitions are highlighted with a warning

Case 3. is a reference link with a reference definition mostly the same as Case 2. except for possible completions in the empty [] reference, which give a list of all defined references.

1 Like

Thanks for all this feedback. The spec and reference implementations were certainly designed with rendering in mind, not on-the-fly highlighting, and you’re right that the latter task imposes different requirements. For now, I don’t think the spec should say anything in particular about AST elements; we’ll leave this up to implementations.

Your comment mixes remarks about the AST (the topic of this thread) with several other comments, including some criticisms of the current list and setext header rules, and the suggestion that perhaps there should be two flavors, one for internet forums and another for long-form writing.

Let me just say that a lot of thought has gone into the existing rules. There are strong reasons for choosing the current list indentation rules over, say, a one-space indent or three-space indent or four-space indent rule (see section 5.2.1 in the spec for explanation). And there is a strong reason for not requiring blank lines before lists. (This is also articulated in the spec, see after Example 264.) We have recently mitigated the problem of unwanted lists from hard-wrapping by adding the rule that an ordered list item can interrupt a list only when it starts with 1. You can still get unwanted lists if you hard-wrap a paragraph containing a -, *, or + with space on both sides, but these cases should be pretty rare. If we were designing something afresh, rather than trying to give a rational formulation to an existing markup, I would have preferred the reStructuredText approach, which requires blank lines before lists including sublists, but I think going this way is ruled out by an interest in maintaining backwards compatibility.

If you want to comment further on these spec issues concerning lists (or setext headers), please open new threads (or search for existing threads on these subjects), since this discussion doesn’t really fit into the topic of the current thread.

2 Likes

@jgm, sorry for mixing topics. I will comment in the appropriate list item thread and heading threads.

1 Like

I’d like to bump topic with a bit another view. Some of us already reached border when full (lossless) AST becomes mandatory. At least me and @vsch. I think @codinghorror too, even if he don’t know about that :slight_smile:

Since such AST design & implementation requires a lot of work, i’m interested to know who could participate in any valuable form - share experience for AST spec, coding in different languages, donations.

My personal plans for standalone mode is to do such things after 1 year. But if some collaboration possible, i could revisit my priorities. I’m also interested in making reference parser pluggable, because it’s the main reason why we maintain a different implementation instead of commiting to mainstream.

2 Likes