Strict Markdown subset

Hi all – Do you think it would be helpful to define a strict, unambiguous subset of Markdown/CommonMark? What I mean is that Markdown by default allows for several different ways to do lots of things, and the CommonMark spec allows for all sorts of extreme artifacts and syntax.

For example, a strict subset would specify one way to do emphasis (e.g. asterisks only), one way to do headings (e.g. octothorpes, called ATX I think), one way to do unordered lists (e.g. hyphens), and so forth. It could also strictly limit the length of delimiter runs and lock down other syntactic artifacts.

I wonder if having a much simpler strict subset would help clarify the full CommonMark spec, or make it easier to write. Certainly writing the spec for the strict subset would be easier than the CommonMark spec. And writing parsers would be easier.

Is there some other way such a project or subproject would help CommonMark? I don’t have any strong opinions here. It’s fine if it’s orthogonal and needs to be its own project.

This might seem similar to a Markdown linter that forces one way of doing each thing, but a linter isn’t a spec, and probably wouldn’t spawn lean standalone parsers.

Those of you who have written parsers – how much ease-of-parsing benefits do you think could be realized?

(A good name would be Stark, abbreviating STrict mARKdown, and a nod to the true King in the North. :grinning:)

2 Likes

I think it might be useful as a recommendation for automatic Markdown writers or convertors from other formats: They could generate something very similar to each other, which could likely be a good thing.

Enforced by parsers, imho not so much, if at all. Many apps would likely have to keep some backward compatibility with older documents anyway (and why to keep two so similar parsers around).

Also, I very much doubt that a wide consensus which of the duplicate to keep and which to remove could be reached: Many authors have already chosen whether they prefer e.g. a setext heading over ATX heading or vice versa: If you remove any of the two, the affected people quite likely won’t migrate to such strict parser at all, so it could actually add to the babel instead of solving anything.

3 Likes

I believe Markdown’s spirit and what makes it successful is the degree to which it is designed for humans as opposed to machines, that is to say the degree to which it is like a natural language and not like a programming language. Markdown is about getting the machine to parse what humans can read without a spec and write with little explanation. If it weren’t for machines we’d not need a spec at all. Humans can read each other’s ad hoc or idiosyncratic plain text styles effortlessly. If we had A.I. today, Markdown would be dead.

In some cases Markdown’s support for multiple styles is all about the above. For example, recognizing many ways to delimit lists and many styles of thematic break. In other cases Markdown is a compromise: Setext is what an untrained human would write, and is the most readable heading style, but it is limited to two levels. If you took away Setext, Markdown suddenly stops recognizing the most natural way humans do headings. Most good writing, for humans at least, never uses more than two levels of heading, so the limit is not a limit or is a beneficial limit. Not coincidentally and not without irony, the types of docs that do use 3 or more levels are specifications and legal docs.

Markdown’s complexity both in spec and parsing has little to do with supporting a variety of styles. If it were complexity in service of what I describe above, then so be it. That’s what machines are for. But nearly all of Markdown’s complexity stems from its support for lazy continuation and sloppy structure. The logic in support of these two metastasized throughout the spec in ways you wouldn’t realize until you try to write a parser and a spec. These are the sources of the “extreme artifacts and syntax” to which you allude, not ATX vs Setext. I think both were misguided attempts at being more human. They ended up the opposite. They are in my opinion Markdown’s biggest mistakes.

Just to give one example (I have many more), specifying the following behavior does no service to humans. Quite the opposite. See for yourself. Then write a spec to get all of these different interpretations of laziness and sloppiness in line!

>> everything below stems from the desire
> > to support *this* sloppiness
and *this* laziness as part of a single
block quote.


> at level 1.
>> at level 2.
  >
  > at level 1.
>   > at level 2.
  >   > continuing level 2.
lazily continuing level 2.
> still at level 2.
still at level 2.
>
> at level 1.
lazily at level 1.
>
>>>>> at level 5.
>>> lazily at level 5.
>
not lazy, at root level.

If we were to define a “strict Markdown”, lazy continuation and sloppy structure should be first on the chopping block.

3 Likes

What are @jgm’s thoughts on this?

It’s true that disallowing lazy continuations would simplify creation of parsers. But we have parsers now that handle these things efficiently, and a spec that defines behavior even for crazy things like the above example. (I guarantee that if you eliminated laziness you’d have howls of protest.) Having just one bullet list marker or thematic break style would not simplify the spec or parsers significantly.

I have already put down my thoughts about how some tweaks to Markdown syntax would create a more rational language and simplify the spec and parsers:

and with slight modifications at

https://johnmacfarlane.net/beyond-markdown.html

3 Likes

@vas Thanks for the examples. What is the precise definition of lazy continuation?

The CommonMark specification allows that list item or quote block continues on the next line even when the author was lazy to use the > (for quote block) and indent the line contents properly (both quote and list items).

For example:

> These two lines together form one
paragraph in a block quote even though there is no `>` at the 2nd one.

* Ditto for a long list item which
can also be broken into multiple lines in a similar way.
1 Like

Some may see my approach as a herecy others as fundamentalism, but here it is: StrictMark.

I am not sure if this follows the theme of this thread exactly, but I have a suggestion.

I know that in my posts asking for clarifications, @jgm and other have responded with helpful hints and other suggestions. Would there be any benefit to having a list of the top X recommendations (where X is a low 2 digit number, if that), that are suggested by the people on this list?

I think one good example of that is one of @jgm 's suggestions that (doing this from memory, so excuse me if I get this wrong) every list should be prefixed by a blank line.

I see the benefit in such a list. Do others?

Regarding that Markdown is for humans, I agree that humans are good at understanding the implicit formatting when they see plain markdown, but they don’t produce it naturally.

Just like with LaTeX, I have tried to get quite a few people to use it for their writing, arguing why it is so much better than using word processors.

The problem is this: When prompted to just write plain text, nobody writes “useful” markdown.
A few things are intuitive, like for instance bullet points or numbered lists. But nobody is ever
going to come up with the syntax for inserting images or ATX headings on their own.
I can’t even remember the syntax for inserting images (Which type of paren goes where?
What comes first, the url or the alt text?).
The worst thing is probably the double-newline-starts-a-paragraph-single-newline-is-ignored
rule (GitHub and Discourse actually don’t fully apply that rule). No beginner gets this one right, because word processors have taught all of us very insistently that a single newline means you are starting a new paragraph.
Intuition is about what you are used to. People are used to word processors. They don’t just magically write Markdown given the opportunity. I have tried.

And this is getting worse as time passes. Fewer and fewer people use plain-text emails or typewriters.

To summarize, Markdown needs to be learned. Given that Markdown’s philosophy assumes the opposite, I guess this is just a limitation of Markdown.

That said, I do see the value of having a simple strict markup language who’s syntax is as easy to
read as Markdown’s.

The discussion of strict vs non-strict is somewhat akin to strong vs weak typing in programming languages. My personal opinion is that while weak typing is a bit nicer to the beginner (nobody likes being screamed at in red by a compiler), the beginner-unfriendliness of strongly typed languages is a problem that can be solved by better tooling. To me the benefits of
strong typing outweigh its drawbacks by a lot. In essence, being strict allows catching misunderstandings between the computer and the human earlier and dealing with them in a better way. The same would go for this hypothetical language vs Markdown.

@alehed,

Strong vs weak typing in programming languages is the perfect analogy. There are equally intelligent and productive programmers on both sides of that “divide”. There are also programmers who like both, using each in different circumstances, or who prefer one of the the numerous programming languages that exist somewhere in between the two endpoints of the strong to weak typing spectrum.

It really doesn’t make sense to argue that a programming language that was designed to be weakly typed become strongly typed. It’s good we have choices. Even if one day we decide one is better than the other, it will be because we had all these choices and we learned from them. But for now, you may like strongly typed languages, but you can’t really argue with the success and popularity of Python. And it is used by far more than “beginners”.

Likewise authoring formats. We have Markdown, reStructuredText, Asciidoc, DocBook, HTML, HTMLBook, LaTeX, FrameMaker, the many word processors… Markdown is what it is. And has been for nearly two decades. While you may prefer a stricter format, as with Python you can’t really argue with Markdown’s popularity and success (your many references to “nobody” ignores this). And it really doesn’t make sense to break backwards compatibility with something so established.

Better to create a new language, but first check out one of the many that already exist. For example if what you’re after is strictness, check out RST. Try some malformed RST in the online demo and see what happens. It’s also richer than Markdown. See also reStructuredText vs Markdown for documentation. (The cartoon at the top happens to fit our discussion.)

But it might be precisely that strictness that resulted in Markdown being more popular than RST, even though RST came out a couple of years earlier. :man_shrugging:t5: