Beyond Markdown

jgm · April 17, 2018, 6:37am

Beyond Markdown

In developing Commonmark, we have tried, as far as possible, to remain faithful to John Gruber’s original Markdown syntax description. We have diverged from it only occasionally, in the interest of removing ambiguity and increasing uniformity, and with the addition of a few syntax elements that are now virtuously ubiquitous (like fenced code blocks and shortcut reference links).

There are very good reasons for being conservative in this way. But this respect for the past has made the CommonMark spec a very complicated beast. There are 17 principles governing emphasis, for example, and these rules still leave cases undecided. The rules for list items and HTML blocks are also very complex. All of these rules lead to unexpected results sometimes, and they make writing a parser for CommonMark a complex affair. I despair, at times, of getting to a spec that is worth calling 1.0.

What if we weren’t chained to the past? What if we tried to create a light markup syntax that keeps what is good about Markdown, while revising some of the features that have led to bloat and complexity in the CommonMark spec?

Let me be clear up front that I’m not suggesting any change in the goals of the Commonmark project. If these reflections lead to anything, it should probably be an entirely new project under a new name. And, being realistic, the burdens of maintaining backwards compatibility are light in comparison with the enormous practical costs of moving existing systems to a new light markup language. Still…I think it can be useful to daydream.

Six Markdown pain points

In what follows, I’ll go through the six features of Markdown that I think have created the most difficulties, and I’ll suggest how each pain point can be fixed.

Emphasis

In Markdown, emphasis is created by surrounding text with * or _ characters, *like this*. Strong emphasis is created by doubling these, **like this**. That all sounds very simple, and it’s visually clear which one is strong emphasis.

Unfortunately, these simple statements aren’t enough to pin down the syntax. Consider, for example,

**this* text**

Our simple rules are consistent with both of these readings:

this* text
this text*

So, to fully specify emphasis parsing, we need additional rules. The 17 discouragingly complex rules in the CommonMark spec are intended to force the sorts of readings that humans will find most natural.

It seems to me that the use of doubled characters for strong emphasis, and the possibility of emphasizing even part of a word, as in fan*tas*tic, have made the problem of specifying emphasis parsing far worse, by vastly increasing the ambiguities the spec must resolve. Depending on context, a string of three *** in the middle of a word might be any of the following:

A * character followed by the beginning of strong emphasis.
The end of strong emphasis followed by a * character.
The end of normal emphasis, a * character, then the beginning of normal emphasis.
The end of strong emphasis followed by the beginning of normal emphasis.
The end of normal emphasis followed by the beginning of strong emphasis.
The end of normal emphasis followed by a literal **.
A literal ** followed by the beginning of normal emphasis.
Literal ***.

How to fix emphasis

To dramatically reduce ambiguities, we can remove the doubled character delimiters for strong emphasis. Instead, use a single _ for regular emphasis, and a single * for strong emphasis. Emphasis would now start with a left-flanking but not right-flanking delimiter and end with a right-flanking but not left-flanking delimiter of the same kind.

For intraword emphasis, we’d require a special syntax:

fan~_tas_~tic

Intraword emphasis is extremely rare, so it’s a good tradeoff to make it a little harder, in exchange for simplifying the rules (and conceptual model) for emphasis in general. The special character ~ here acts like a space for purposes of parsing emphasis (allowing the intraword _ to start and end emphasis), but isn’t rendered as a space. (It thus behaves like an escaped space does in reStructuredText.)

Reference links

The usual treatment of reference links makes it impossible to classify any syntax element until the whole document has been parsed. For example, consider

[foo][bar][baz]

[bar]: url

This is interpreted as

<p><a href="url">foo</a>[baz]</p>

But suppose we define a link for baz instead of bar:

[foo][bar][baz]

[baz]: url

Then we get:

<p>[foo]<a href="url">bar</a></p>

So, we can’t tell whether [foo] is literal bracketed text or a link with link description foo until we’ve parsed the entire document.

This makes syntax highlighting very difficult, and it also complicates writing parsers. For example, you can’t parse links, then resolve references in the AST after the document is parsed.

How to fix reference links

Make reference links recognizable by their shape alone, independent of what references are defined in the document. Thus,

[foo][bar][baz]

would be parsed as a link with link text foo to whatever URL is defined for reference bar (or to nothing, if none is defined), followed by literal text [baz].

Shortcut references like

[foo]

[foo]: url

would have to be disallowed (unless we were willing to force writers to escape all literal bracket characters). The compact form could be used instead:

[foo][]

[foo]: url

This is a bit more typing, but it makes it clear and unambiguous that there is a link.

Indented code blocks and lists

Parsing indented code blocks is straightforward, but their presence complicates the rules for list items.

In specifying the syntax for list items, we need to say how far content must be indented in order to be considered part of the list item. The original Markdown syntax documents hinted at a “four-space rule,” requiring four spaces indentation, but implementations rarely followed that, and most people find it counterintuitive that

- a
  - b

wouldn’t be considered a nested list. So, in CommonMark, we surveyed a large number of possible rules, eventually ending up with a rule requiring the contents of the list item to be indented at least to the level of the first non-space content after the list marker:

  -  Item

     ^-- contents must be indented to here.

This is not a bad rule, but it adds complexity: one has to keep track not just of the position of the list marker, but of the position of the first non-space content that follows it. And then one needs special rules for cases like empty list items and list items that begin with indented code. Finally, many people still find it surprising that, for example, this isn’t a nested list:

- a
 - b

Thus one might ask: why not just require that the contents of a list item be indented at least one space past the list marker? That’s the obvious minimal rule. What blocks this is the presence of indented code blocks. If block-level content under a list item begins at one space indent after the list marker, then indented code would have to be indented five spaces past the list marker. Not only is that incompatible with the eight spaces indicated in the original Markdown syntax description, it leads to terrible results with longer list markers:

99.  Here's my list item.

     And this is indented code! Even though it
     lines up with the paragraph above!

To sum up: most of the complexity in the rules for list items is motivated by the need to deal with indented code blocks.

How to fix indented code blocks and lists

Fenced code blocks are now usually preferred to indented code blocks, because you can specify a syntax for highlighting and you needn’t indent/deindent when copying and pasting code. Since we have fenced code blocks, we don’t need indented code blocks. So, we can just get rid of them.

This frees up indentation to be used more flexibly to indicate list nesting, and we can embrace the simple, obvious rule that the contents of a list item must be indented at least one space relative to the list marker.

Another advantage of removing indented code blocks is that initial indentation can now be ignored in general, except insofar as it affects lists.

Raw HTML

From the beginning, you could insert raw HTML into Markdown documents, and it would be passed through verbatim. The idea is that you could drop back to raw HTML for anything that can’t be expressed in plain text.

This sounds simpler than it is. From the beginning, Markdown.pl distinguished between inline and block level HTML. Inline HTML tags were passed through verbatim, but their contents could be interpreted as Markdown:

<em>**hi**</em>

would give you

<em><strong>hi</strong></em>

Block-level HTML content, it was stipulated, should be separated by blank lines, and the start and end tags should not be indented. In such HTML blocks, everything would be passed through verbatim, and not interpreted as Markdown. So,

<div>
*hello*
</div>

would just give you

<div>
*hello*
</div>

This raised several problems. First, how do we identify block-level content? Do we need to hard-code a list of HTML elements that may change as HTML evolves? What about elements like <del> that can occur in inline or block contexts?

Second, what about block-level HTML that is not properly separated and indented?

hi <div>
  hello</div>

Should parsers just treat it as inline HTML and generate invalid HTML?

Third, how do we identify the end of an HTML block? Given that tags can be nested, this requires nontrivial HTML parsing. The released version of Markdown.pl produced invalid HTML for a doubly-nested <div> element; a beta version designed to fix this problem had serious performance issues.

CommonMark’s spec for HTML blocks was designed to make it easy to parse raw HTML blocks (without indefinite lookahead or full implementation of HTML parsing), and also to make it possible for authors to include CommonMark content inside block-level HTML tags, if they wanted to. But the result is rather complex: seven distinct pairs of start and end conditions. The rules for inline HTML are also complex, with a large number of definitions.

In addition, as Markdown has become useful not just for creating HTML, but for creating documents in a number of different formats, the way HTML is singled out for raw pass-through has come to seem a bit arbitrary. Those who author in other formats would benefit from a way to pass through raw content, too.

How to fix raw HTML

Instead of passing through raw HTML, we should introduce a special syntax that allows passing through raw content of any format. For this we can overload our existing containers for raw strings: code spans and coed blocks:

This is raw HTML: `<img src="myimage.jpg">`{=html}.

And here's an HTML block:

```{=html}
<div id="main">
 <div class="article">
```

But we can do LaTeX too:

```{=latex}
\begin{tikzpicture}
\node[inner sep=0pt] (russell) at (0,0)
    {\includegraphics[width=.25\textwidth]{bertrand_russell.jpg}};
\node[inner sep=0pt] (whitehead) at (5,-6)
    {\includegraphics[width=.25\textwidth]{alfred_north_whitehead.jpg}};
\draw[<->,thick] (russell.south east) -- (whitehead.north west)
    node[midway,fill=white] {Principia Mathematica};
\end{tikzpicture}
```

We could even pass through different raw content to different formats, for example including HTML and LaTeX versions of a complex figure.

Lists and blank lines

Can a list interrupt a paragraph, like this?

Paragraph test.
- Item one
- Item two

The original Markdown syntax documentation does not settle this, but Markdown.pl and its test suite require a blank line between paragraph text and a following list. As the test suite indicates, this requirement was introduced in order to avoid accidental creation of lists by things like:

I think he weighed 200 pounds, maybe even
220.  But he was no more than five feet tall.

However, one exception was made: when the paragraph text is itself part of a list item, no blank line is required.

-   Paragraph one

    paragraph two
    - sublist item one
    - sublist item two

If this exception were not made, then we would not be able to recognize a nested list in this kind of case:

- a
  - b
  - c
- d

In thinking about the CommonMark spec for list items, we realized that the Markdown.pl behavior violates what we called the principle of uniformity, which says that the contents of a list item should have the same meaning they would have outside of the list item. This principle implies that if

a
- b
- c

does not contain a list, then

- a
  - b
  - c
- d

does not contain a sublist. We think that the principle of uniformity is important. Indeed, the way we specify list items and block quotes presupposes it. This means that we faced a choice: either require a blank line between paragraph text and a following list, or allow lists to interrupt paragraphs, and risk accidental interpretation of paragraph text as a list. We took the first option to be off the table, since it is very common in Markdown to have tight sublists without a preceding blank line. So we opted for the second option, mitigating the damage with an ugly heuristic (we only allow an ordered list to interrupt a paragraph when the list number is 1).

How to fix lists and blank lines

We should require a blank line between paragraph text and a list. Always. That means, even in sublists. So, to create a tight list with a sublist, you’d write:

- a

  - b
  - c

- d

We’ll say a list is tight if it contains at least one pair of items with no blank line between, so in the above example, the inner list is tight and the outer list is not. To get both lists tight:

- a

  - b
  - c
- d

Attributes

Markdown offers no general way to add attributes (such as classes or identifiers) to elements. This deprives it of a native way of creating internal links to sections of a document. (Many implementations have introduced subtly different ways of automatically generating identifiers from headers.) It also deprives it of a natural extension mechanism. Markdown has containers for inlines (e.g. emphasis), blocks (e.g. block quote), and raw inline content (code spans), and raw block content (code blocks). If arbitrary attributes could be attached to these, they could be manipulated by filters to produce very flexible output. For example, one could treat a block quote with the class “warning” as a warning admonition, or one could treat a code block with the class “dot” as a graphviz dot diagram, to be rendered as an image. Currently, though, the only way to attach attributes to an element is to drop down to raw HTML.

How to fix attributes

Introduce a syntax for an attribute specification. Following pandoc, use braces {} for this. An identifier is indicated with #. A bare word is treated as a class. Use = for an arbitrary key/value attribute.

Allow attributes to be added on the line before any block element and directly after any inline element:

{#myheader}
# The *Blue Title*{blue position=left}

Here the identifier myheader is added to the header block, and the class blue and key/value attribute position=left are added to the emphasized text Blue Title.

Attributes specifiers must fit on one line, but several may be used (and will then be combined):

{#mywarning}
{warning}
> Don't try this at home!
> It might be dangerous.

Perhaps it would be helpful to add a syntax for unadorned inline spans, and a fenced generic block container, as in pandoc. But we can use emphasis for an inline container and block quote for a block container, so this wouldn’t be absolutely necessary.

Summary of recommendations

Emphasis

a. Use distinct characters for emphasis and strong emphasis.
b. Don’t use doubled-character delimiters.
c. Simplify emphasis rules.
d. Introduce special syntax for intraword emphasis, with ~ behaving like a space as far as parsing emphasis goes, but render as nothing.
Reference links

a. Don’t make parsing something as a link depend on whether a reference link definition exists elsewhere in the document.
b. Remove shortcut reference links.
Code

a. Remove indented code blocks. Use only fenced blocks for code.
Lists

a. Use simple rule for determining what belongs under a list item: anything indented at all with respect to the list marker belongs in the item.
b. Require a blank line between paragraph content and a following list.
c. Revise rules for tight lists: a list is tight if any two items lack a blank line between them.
HTML

a. Remove automatic pass-through of raw HTML. Things like   will now be treated as regular text and escaped.
b. Introduce an explicit syntax for passing through raw content in an arbitrary format. In inline contexts, a code span followed by {=FORMAT}; in block contexts, a fenced code block with info string {=FORMAT}.
Attributes

a. Introduce a uniform attribute syntax, like this: {class #identifier key=value}.
b. Allow attributes on any block element. The identifier must appear by itself on the line before the block element. Multiple attributes can be specified on successive lines; they will be combined.
c. Allow attributes on any inline element. The identifier must appear immediately after (and adjacent to) the inline element to which it is to apply.

mity · April 17, 2018, 10:15am

From technical point of view, I like it. Or at least most of it. And if it is how Markdown would look like from the very beginning, it would be great.

But I have strong doubts about its chances now, given the adoption of Markdown/CommonMark. And whether it wouldn’t make things actually worse by adding to the babel.

Crissov · April 17, 2018, 10:24am

I agree with a lot of those principles a partially backwards-incompatible Commonmark 2 could use, but I disagree with others, e.g. 1d, 2b, 4b and some details of 5 and 6. I suspect most people will not agree entirely with @jgm here. This probably makes it a non-starter. We do not even have all makers of markdown parsers on board for Commonmark support, then why would we need yet another slightly different language with the same scope?

Anyway, if you were to reinvent markdown, I strongly suggest to start from general principles. If repeating marker characters, for instance, does nothing for emphasis markup, it should not be employed for determining the level of heading hierarchy and quotation nesting either.

jgm · April 17, 2018, 5:40pm

These elements go together; you can’t easily pick and choose a la carte:

If you reject 1d, then you have to either disallow intraword emphasis entirely (which I don’t think is a good idea), or you can’t have 1c (since much of the complexity of the emphasis rules can be avoided if you have a special syntax for intraword emphasis). Which do you prefer?
If you reject 2b, then you have to reject 2a, or treat all text within square brackets as links, and require escaping for normal square brackets. Which do you prefer?
If you reject 4b, then (as we’ve discussed elsewhere) you have to put in place some hacky heuristics in order to respect the principle of uniformity.

If most people did entirely agree with me, it would be the first time in my life that has happened!

chrisalley · April 18, 2018, 6:06am

English vs Esperanto. JavaScript vs CoffeeScript. RSS vs Atom. HTML vs XHTML. These are just four examples I could think of, but there will be many more. Languages that become and stay widespread aren’t necessarily the most simple or pure, and often contain baggage to maintain backward compatibility. People aren’t going to switch to a new lightweight markup language because it’s less complex, and if a simpler language is adopted, it likely won’t be because it’s simpler. It’s worth asking whether you want to invest a lot of time into a new language that is unlikely to replace Markdown.

There’s also an existing project for creating a successor language to Markdown: Markua. Markua targets formats other than HTML, supports attributes, and has the benefit of already being used in a real product (Leanpub). Have you thought about joining the Markua project, rather than creating more divergence?

That said, if there is going to be a project to create a successor to Markdown, I would suggest being conservative with the general syntax so that it’s easy for users to shift to the new language, while simplifying/cleaning up the details. Keep it as another Markdown flavour, so that users can use the same basic syntax for the most common elements such as links, headings, and emphasis; if a piece of software adopts the new language, users will be able to muddle through when making basic edits without having to relearn fundamentals.

In response to the points made:

(1) Single asterisks for emphasis are well established and quite intuitive. The general rule is simple - add more asterisks to add more emphasis. I would be cautious about abandoning this syntax. I’m not sure how common emphasis inside emphasis is, but this seems like an edge case that could be avoided by disallowing it and requiring writers to close one type of emphasis before opening the next, e.g.

**strong***normal***strong**

would become

strongnormalstrong

This seems natural, since when speaking you have to stop strongly emphasising something if you’re going to start regularly emphasising the next part.

(3) If removing indented code blocks is the only way to fix the list behaviour, this seems like a worthy sacrifice. It would be good to get some data on the number of people still using indented code blocks, but assuming it’s mainly programmers using this syntax, most should be familiar with the fenced style by now. If indented code blocks are removed, I suggest removing the other problematic significant white space syntax as well: the two space line break rule.

(4) Lists are an area where Markdown could be greatly improved, particularly with ordered lists. Making the list number significant so that users can write descending ordered lists or start an arbitrary number just makes sense and I’ve seen users confused when Markdown behaves otherwise. Furthermore, letter ordered lists are common enough to be a part of the language. And the indentation rules could be made simpler, like how you suggested.

(5) The main problem with not allowing raw HTML by default is the lack of alternative syntax for some elements. For example, highlighting a book title with <cite>. What about the use of  and  which have a seperate meaning from  and ? The proposed addition of {=FORMAT} every time HTML is used makes using these elements much more verbose. And it’s not clear that we can create a new syntax that is more readable for every HTML element; some elements will probably be less readable.

I would be more in favour of a seperate worlds approach - e.g. allow HTML to be added directly. When an HTML tag is started, it continues to be parsed as HTML until that HTML tag is closed, irrespective of any Markdown syntax added between it; Markdown and HTML syntax cannot be mixed together. Simplify the rules by removing the distinction between inline and block elements too.

(6) Could attributes be added as an extension to CommonMark? Is a new language/flavour required for this?

peterarmstrong · April 18, 2018, 6:24am

Markua is an open spec (https://leanpub.com/markua/read). I’d be thrilled if its attribute syntax (https://leanpub.com/markua/read#attributes) was used here, since I created it

Now, Markua isn’t fully implemented in Leanpub still (see https://leanpub.com/markua/read#leanpub-authors for what’s unfinished), as we’re a bootstrapped startup and have a lot on our plate. But it’s pretty well thought through, and bug reports are welcome.

What I’d absolutely love to see, of course, is a Pandoc reader and writer for Markua. But anyone who would implement that at this point should be willing to tolerate a bit of uncertainty, since the Markua spec still may have bugs. Once it’s 100% implemented on Leanpub in the future, the number of bugs should presumably be lower…

Crissov · April 18, 2018, 7:06am

I wasn’t verbose enough. In 1d) I mostly just don’t like the alternative syntax you are proposing, but the general first is fine with me. Not adopting 2b) is indeed more important to me than 2a), and maybe there are other solutions to limit the unwanted effects. For 4a) and 4b) I believe that a viable compromise after 3a) would be to require either a blank line (i.e. 4b)) or indentation before a list nested inside a paragraph or another list. This could result from one of the more general principles I mentioned.

jgm · April 18, 2018, 4:44pm

That was in fact the aim here. I just thought about the features of Markdown (and commonmark) that have made the project of creating an unambiguous and not overly complex spec difficult, and proposed minimal changes to fix these. The result would be very much like commonmark and Markdown except in a few respects.

They’re established in Markdown. But before Markdown, of course, there were all sorts of conventions in plain text email, and one of the most common was to use different characters for *strong* and /regular/ emphasis. Many other light markup formats do the same. I think there’s a very good technical reason for avoiding the “doubled asterisk” syntax, as explained in my post. Anyway, it’s a suggestion I made from long experience fiddling with emphasis rules. It’s quite easy to think of rules that make sense for the cases you happen to have in mind—but then there are always other cases! I don’t think authors are going to want to give up the possibility of doing strong emphasis inside regular emphasis (or the reverse).

I think you’re aware of this already, but in case not: CommonMark already has both significant start numbers and letter-ordered lists.

With the addition of attributes, there’d be a cleaner solution to this problem. One could write [Book Title]{citation}, for example, and have the renderer turn this into something appropriate for the output format. I know some people think of Markdown as primarily a way to write HTML. That’s fine, but I prefer to think of light markup syntaxes as ways of writing documents that can be converted into many different formats. Given those interests, privileging HTML doesn’t make sense.

This is something to consider. Adding this attribute syntax (which I see now is very similar to markua’s, though my own inspiration was pandoc’s attribute syntax) would probably not change the interpretation of existing documents, since you’d be very unlikely to write one of these attribute specifiers for any other reason.

I’m trying to understand how the indentation idea would work. Presumably, the thought is that this would be a paragraph followed by a list:

foo
 - bar

But this would not be:

foo
- bar

Applying the principle of uniformity, then, here you’d have a sublist:

- foo
   - bar

But here you wouldn’t:

- foo
  - bar

Do I have it right? To me this seems a bit unnatural. It looks best to line up the markers with the content of the parent item in a sublist. Of course, I realize that on your proposal there’d still be the option of using a blank line, but I wonder whether it’s worth the extra complexity. After all, people will spontaneously create lists without the indentation; they’ll have to learn that this doesn’t work, and that they need to do something special. Why not just have them learn that they need a blank line?

westurner · April 18, 2018, 6:24pm

Attribute support could be extremely useful for RDFa and/or JSONLD support. Currently, the only way to embed structured data in CommonMark is with HTML and RDFa (RDF-in-atttributes).

Schema.org/CreativeWork #examples for document-level metadata would be a great start. And then how to point to a Schema.org/Dataset?

Support for Linked Data from attributes should be a priority.

Crissov · April 19, 2018, 6:43am

With indented code blocks gone, the rules get a lot simpler.

paragraph
- paragraph

paragraph
 - list

paragraph
    -    list

paragraph

- list

- list
 - nested list

- list
      -    nested list

- list

- list

- list

 - nested list?

aidantwoods · April 19, 2018, 6:20pm

You had me at removing doubled character delimiters from emphasis

I really like the changes to simplify emphasis. I’m not entirely sure I like the choice of the ~ delimiter but I think the key thing you’ve identified is that intraword emphasis should have some kind of special marking so that we ensure that it is intended. I wonder if we could make this idea more a more general concept, a kind of “anti-escape” if you will.
Delimiter characters that depend on context (like emphasis) could either be “active”, “passive”, or “inactive”. “Active” ones would always render (the effect of your fan~_tas_~tic), “passive” ones would render if context allows (i.e. the left/right flanking rules are satisfied), and “inactive” ones would just be literal characters. By default all delimiter characters are passive, unless marked as active, or inactive (escaped).
The immediate problem I see with this is that context also defines whether a delimiter is an opener or a closer (since both use the same character), so this “active” marker would have to allow that to be specified (e.g. as you’ve done by placing the ~ on the left or right of the delimiter).
I really like the recommendations for reference links.
Indented code blocks aren’t very useful, and they do indeed complicate things. Getting rid of them is a good solution.
I like the moving of HTML to a proper block and inline syntax. `'s are markdown’s way of allowing odd text to be passed through untouched – allowing custom treatment of the untouched text (to HTML, or other) is a very natural extension.
I like the idea here, I think that a class should have a . prefix personally for uniformity with css, and there are probably some details to work out with 6b and 6c.

Perhaps even some there would be a way of reconciling this {...} syntax with the {=...} proposed in 5. . I know these will be distinguishable cases as defined, but is there perhaps more that can be done? e.g. suppose that {=...} was allowed before any block and was a general “argument passing syntax”. We could perhaps use {=(delimiter: roman, start: 6)} before a list to dictate the list counter for example. Not that we have to do it exactly like this, but perhaps something to think about with regard to unifying custom data inputs for behaviours with blocks/inlines. This so that we are not just limited to doing this kind of thing in code blocks and code spans.

Crissov · April 20, 2018, 7:28pm

If interword emphasis needs special syntax, why not use doubled markers only there?

foo _emph_ baz
foo *strong* baz
foo_bar_baz
foo*bar*baz
foo _bar_baz
foo *bar*baz
foo_bar_ baz
foo*bar* baz
foo__emph__baz
foo**strong**baz
foo __emph__baz
foo **strong**baz
foo__emph__ baz
foo**strong** baz

chrisalley · April 21, 2018, 2:29am

I’m having trouble seeing the use cases for emphasis within emphasis. The main example I found in the spec was for use in bibliographies, but is emphasis the right element to use for these, rather than text offset from normal prose? The original Markdown spec was released before HTML5 reclassified some of the older tags to differentiate emphasis and other alternative prose, but presumably a successor to Markdown would want to take the range of different text elements that are rendered alternatively into account. Previously I wrote that I thought the forward slash would be suitable for marking up alternative voice/mood.

The original Markdown syntax guide does not mention emphasis inside of emphasis being a requirement of Markdown. Perhaps we need to support this in CommonMark because Markdown implementations already support this but a successor can be freed from these constraints, particularly the behaviour of Markdown.pl.

Were letter-ordered lists added to CommonMark? I couldn’t find mention of them in the latest version of the spec.

The [Book Title]{citation} syntax you mentioned isn’t bad, the main concern I have is that it’s another syntax to learn for an author who already knows HTML. Since Markdown was originally designed as a light weight syntax for “issues that can be conveyed in plain text”, there was always a way that web authors could fall back to heavier features without learning lots of extra syntax. If the language aims to be more general, it is less targeted at that specific audience. This was, I believe, one of the motivations for encouraging different flavours of Markdown, rather than having one general syntax for everyone.

To give a software analogy, we have Apple and Microsoft who have vastly different strategies when it comes to user interfaces. Windows 10 has a very general interface that is designed for both touch screens and mouse/trackpad inputs. Apple on the other hand, has two very distinct user interfaces with iOS and macOS, the former which is consists of thicker icons, the latter featuring thiner UI elements that allow for very precise and subtle movements. Generally, the UI of Windows 10 attempts to reach some kind of middle ground which makes it arguably not the best UI for either input type, but with the benefit of being more universal and compatible.

So if the successor language is aiming to move away from being a superset of HTML to something more general, that might be less appealing to someone using it for the very specific purpose of web authoring. HTML first, everything else second, already works well for those users. If the goal of the successor format is indeed to become more general and universal, that’s something that should probably be explicit in the goals of the project so that people can decide if it’s the right language for them.

If the goals of the two projects are close enough, it might be worth making this an official successor to CommonMark, a “CommonMark Strict” or “CommonMark Lite”, with regular CommonMark acting as a transitional spec for users coming from the various loosely specified Markdown specifications. But if the goals are indeed fundamentally different (from Markdown), a new language would make more sense.

jgm · April 21, 2018, 4:08am

My mistake. I thought they had been – I remember advocating for them in the early discussions we had – but I guess they weren’t.

vas · April 21, 2018, 8:52am

I think Markdown should stick to content including its semantic structure (This is a heading. This set of items belong to a list, an ordered as opposed to unordered one.) and not style (Center the H1 heading and underline it. Show order using letters).

This separation of concerns is properly followed between HTML and CSS. One sets a list up with letters in CSS, not in HTML via the CSS list-style-type Property, (which supports many options. Traditional Katakana iroha numbering anyone?). It’s also why in HTML its strong and emphasis not bold and italic.

mb21 · April 21, 2018, 12:15pm

Very interesting proposal! A few comments:

Emphasis

Couldn’t agree more. The current markdown rules are confusing to explain to new users as well.

Though before making a definite decision, I would love to know current emphasis usage statistics. Are there any usage numbers of markdown documents in the wild?

(Not so sure about fan~_tas_~tic though, but I guess why not?)

Indented code blocks and lists

To remain somewhat more backwards compatible to CommonMark, instead of getting rid of indented code blocks entirely, one could also simply disallow them within lists and blockquotes and such, but keep allowing them when not nested in another element (which should account for 99% of existing uses).

Raw HTML

{=html} is great, but what about using Markdown inside HTML? Like:

<aside>
  my _great_ text
</aside>

Maybe if we had a generic block container (like the ::: in pandoc), Markdown inside HTML wouldn’t really be needed anymore.

Attributes

As you probably know, I’m all in on attributes. (Attribute discussion on this forum.) The specifics for different elements are a bit trickier to figure out (e.g. paragraphs, lists, list items), but I can see how placing the attribute before block elements, as you propose, might help in parsing.

several may be used (and will then be combined):

Not sure about that, isn’t the following simpler?

{warning #mywarning}

Dreaming?

Interestingly, most of the proposed changes are things only markdown-power-users would notice anyway. For example, most people just fiddle with lists until it’s right in the preview.

However, Emphasis and Raw HTML are two things almost everyone who has come across markdown somewhere is familiar with and would probably find annoying if it doesn’t work the way he/she expects anymore. Not sure what that means though… maybe if it weren’t for those two changes it could even pass through as “CommonMark v2”?

jgm · April 21, 2018, 6:48pm

Here you can just do something like

```{=html}
<aside>
```
my _great_ text
```{=html}
</aside>
```

I didn’t mean to exclude this. You can have several attributes in one attribute block. But, to limit the need for lookahead in parsing, it’s convenient to limit attributes to one line, so if you have a lot of attributes, it’s nice to be able to have several attribute blocks that will be combined.

mb21 · April 22, 2018, 8:42am

I know you’re a fan of hard-wrapping, but personally, I would rather have one overlong line with a single attribute block (that’s why we have attributes: to put technical junk in there which isn’t part of the text but still necessary sometimes), rather than several attribute block that I have to remember are actually a single one (just doesn’t feel natural to me, although I can see that it would be better for the parser). But either way, it’s probably somewhat of a detail…

jgm · April 22, 2018, 4:18pm

On the proposal, you could still just use one long line with all the attributes. I just want to leave the possibility of having several attribute blocks that are consolidated. That would be nice for people who like to hard-wrap to a certain width. Besides, the parser needs to do something if it encounters multiple attribute blocks in a row, and consolidating them this way seems the most natural choice.

piro_or · May 1, 2018, 3:12am

I disagree this idea, because I believe that Markdown was loved by many people who totally tired from too complex formats - rst, RD, MediaWiki, and more. I believe that extending of the Markdown format itself must keep its backward compatibility, otherwise, people who want to extend Markdown without compatibility should move toward to any other known complex formats designed for the purpose - for example, raw HTML. It is well designed and structured.

Simple format, less expressive, but enough for most popular cases - I believed it was the design strategy of Markdown. As the rsult, today non-programmer people are also using Markdown - writers, designers. So I think it’s a bad idea that breaking compatibility for just few demands - it will lose Markdown’s value. How many people really want such a complex combination of multiple "*"s?

Thus, it is the time to graduate Markdown and migrate to any other known major formats, for people who suffered from Markdown’s less expressiveness. But please don’t get other people involved to complex syntax hell…