Is the spec specifying a data structure for markdown?

jacksonkernion · April 30, 2017, 6:05am

I might be late to the game here, but I was wondering: why doesn’t the spec separate the markdown ‘data structure’ from acceptable markdown ‘syntax’?

For instance: The spec includes two different ways of syntactically specifying ‘headings’: ATX headings and Setext headings. But there isn’t a single category for ‘headings’—instead, both ATX and Setext headings are grouped together in the leaf block category, along with: code blocks, html blocks, links, paragaphs, and blank lines.

I would have thought that the spec would detail to abstract data structure to which markdown documents must be convertible, but then also specify various ways of marking up text documents such that they are convertible to the universal markdown structure.

(A related question: Why are html blocks included here at all? I would have thought that html blocks come into play when converting markdown content to html content. But any content in markdown, should (or so I would have thought) abstract away from the way it’s output into a particular format. Furthermore, since html blocks can include mutliple lines, wouldn’t they be container blocks instead of leaf blocks?)

I may be missing something, but here’s how I’d like to see the spec’s leaf block section structured:

4.Leaf Blocks
4.1 Thematic Breaks
4.2 Headings
4.2.1 ATX Headings
4.2.2 Setext Headings
4.3 Code Blocks
4.3.1 Indented Code Blocks
4.3.2 Fenced Code Blocks
4.4. Link reference definitions
4.5 Paragraphs
4.6 Blank lines

(I ask this because I’m working on a ‘metadata layer’ for markdown documents, and it’s important for tha project to undertsnad whether markdown-formatted text should just be the ‘content’ property of a larger data structure.)

jgm · April 30, 2017, 7:36am

The spec tries to remain neutral on what kind of data structure you might use to represent a document, or indeed on whether an intermediate data structure is needed at all. These are implementation issues. (In the reference implementations, both setext and ATX headers get parsed into a generic header node, but that’s just one way to do it.)

HTML blocks are included because Markdown has always allowed you to include raw HTML sections in a document; these are passed through unchanged to HTML output (and usually just ignored in other target formats). HTML blocks are not container blocks, because container blocks are blocks that contain other blocks (not just inline or textual content).

jacksonkernion · April 30, 2017, 9:39am

Thanks! This clears up some confusion for me.