StrictMark: Markdown refactored

gritzko · January 29, 2021, 11:22am

Hi!

I was using cmark for my ~wiki (which is actually an HTTP interface to a CRDT oplog). I was a bit unhappy with that extra dependency. I have a ton of parsers in the codebase, all Ragel based. So I thought, can I create a Ragel parser for CommonMark? Seems like a very basic grammar. As it turned out, it is not that simple. CommonMark basically consists of a ton of ad-hoc rules, plus half a ton of exceptions to those rules.

Then I thought: can I make a rational subset of CommonMark that is simple, clear, non-ambiguous? As long as it is backwards-compatible, I can leverage all the exisiting Markdown codebases. Meanwhile, the new code will only deal with the strict grammar.

Well, this is how far I advanced: StrictMark. Currently I use a Ragel-based parser, as intended. The formal grammar is in the end of the document.

I apreciate your thoughts on the subject!

jgm · January 29, 2021, 6:10pm

As you’ve probably gathered from my previous posts, I’m very sympathetic to the desire to have something simpler.

But I think you should clarify in what respect your syntax is a “subset” of commonmark. Does it mean

every string that is a valid StrictMark document is also a valid commonmark document?

Well, that’s not a very interesting claim, since every string is a valid commonmark document.

Does it mean

every string that is a valid StrictMark document is a valid commonmark document with the same meaning?

That would be interesting, but from what I can see it’s not true.

gritzko · January 29, 2021, 7:01pm

It is definitely not true, there are differences in the syntax. Although I believe it is possible to make any given document read the same way in both. I mean, if it is written cautiously then yes, but otherwise not guaranteed.

vas · February 5, 2021, 4:53am

@gritzko

First, on the use “subset” to describe StrictMark in relation to Markdown: I think it is rather misleading, especially when it is stated up front in the sales pitch:

StrictMark is a rational subset of Markdown that implements all the features with the shortest formal grammar possible. Hence, uniform syntax and no ambiguities. The idea is that StrictMark can reuse all the existing Markdown support, without sharing the weight of the legacy syntax and its incidental complexity.

It implies that what I write in StrictMark will be rendered properly / interpreted correctly by other Markdown tools. This you admit is not true.

You could say “StrictMark is a rationalized and stricter variant of Markdown that is mostly backwards compatible.” (I’m not even sure about the “mostly”). But this leads me to my next point…

vas · February 5, 2021, 5:40am

@gritzko

Second, a thought on the raison d’être (sorry, I just love this French phrase) as well as the reason of success for Markdown: It is designed primarily for humans and secondarily for machines.

Markdown endeavors to let humans write plain text as they would naturally. It tries to codify these natural plain text “styles” just enough to enable machines to parse them. In the tradeoff between convenience for humans vs convenience for machines, it always favors humans.

To put it another way, as much as possible the onus is put on the machine (and the programmer of the machine) to understand the human writer, not the other way around. Writing Markdown does not and should not feel like writing code. That’s why Markdown is so flexible and forgiving and not strict. Markdown is not perfect in this regard, but making it strict actually takes it in the wrong direction. Eliminating underlined (Setext) headings takes it in the wrong direction. Having “1. item one” not be read as a list item because it is short a space takes it in the wrong direction.

You can do all of these things, as you do in StrictMark, but then I wouldn’t associate it so closely to Markdown, i.e. I wouldn’t even call it a variant of Markdown as I tentatively suggested above. StrictMark is just another markup syntax that overlaps Markdown just as reStructeredText, org-mode, Asciidoc and many others do.

chrisalley · February 5, 2021, 6:08am

Related discussion:

You could set up your wiki to automatically clean up the Markdown with something like Prettier or a variation based on your StrictMark syntax. This would allow the inputted text to be compatible with non-strict Markdown, while still giving you a uniform version to save in the database. As @vas mentioned, one of the nice things about Markdown is that it’s forgiving for humans.

codinghorror · February 7, 2021, 2:33am

And the direction and goal of CommonMark is to make Markdown CONSISTENT, neither more nor less strict, but as close to “as it was” as we can get.

(Although in practice some decisions did have to be made in order to get consistency… as @vas points out it is important to understand the philosophy of the project so everyone is aligned around the same goal.)

vas · February 7, 2021, 8:23am

Would it make sense to clarify the philosophy/goals/direction in a post pinned to the top of this forum, to both set expectations and help the forum focus? This would apply not just to the core standard but also to any possible extensions adopted.

There really are a lot of people who effectively believe that everything expressible in HTML needs to be expressible in Markdown/CommonMark even if it means Markdown starts looking more like code and less like prose.

gritzko · February 10, 2021, 10:11am

It is designed primarily for humans and secondarily for machines.

@vas I understand your reasoning. In theory, we might continue that line even further by training a neural network to convert Markdown-ish text into HTML. In case you have a huge corpus of real-world Markdown, that might be the right way to go. Then, a text is valid Markdown if it feels like it and if the NN picks it up. Fits the corpus - good. If not - bad. That might work better than a precedent-based spec no regular human will actually read.

Because currently you are half-way between two well-established schools of thought with their well-developed methodologies. Formal languages on one side and pattern recognition on the other.

My rationale was quite simple: I want to be able to reason about syntax validity without re-checking the spec. If I can remember the rules, then good. If not, bad. That part I achieved. Then, the issue is to stay backward-compatible with as many implementations as possible. Because the code is deployed, people have habits and there is no clean slate. That part is challenging. I like what @chrisalley said and I will try to implement that sort of prettification.

Having “1. item one ” not be read as a list item because it is short a space takes it in the wrong direction.

Regarding humans vs program languages, python has meaningful indents and it is considered a programming language for not-really-programmers, so… apparently it works somehow for all those people…

vas · February 10, 2021, 6:30pm

Yes, if you want to stick to Markdown (as opposed to going with the (I believe) stricter RST, Asciidoc or org-mode), then limiting your own markdown via prettification or linting is the way to go, not trying to establish a new Markdown variant.

You mention training a neural net. I’ve also used AI to explain why Markdown is the way it is. I strongly believe this to be the reason Markdown is so popular. Let’s not destroy that, especially since there already exist numerous Markdown-like “strict” code-like formats. In fact I think we can take Markdown’s “pseudo AI” even further down the try to interpret the plain text with human eyes path. I’m even working on it. Though maybe a trained NN would be better than pseudo-AI heuristics.

I beg to differ and most python programmers would take your claim as not just wrong, but insulting. In any case, Markdown is in no way analogous to Python.

The thing is the rules are only complicated for pretty extreme corner cases. Markdown more than meets the 80/20 rule for what it is attempting to do. Probably 95/5 or even 99/1. Most writers have no problem with it. It’s mainly parser programmers who complain . I believe the flaws Markdown/CommonMark has (like the rules for lazy continuation) are flawed because they don’t follow the original principle about readability for humans enough, and as I mentioned above I would fix it in the opposite direction you advocate.