Originally posted in Djot: A light markup language by @jgm.
Hello,
I have been interested and have been following CommonMark and Beyond Markdown for a couple of years now. I love that you unified the different markdown flavors by writing a specification and that you are constantly trying to improve it.
I used to write tons of documentation in markdown, CommonMark, and, finally, Pandoc’s Markdown. With time I needed more HTML-like elements and features, that’s why I transitioned to pandoc which has the largest feature set. While pandoc is an awesome tool, I found it’s markdown’s syntax “hacky” (lacking a better word) and verbose, thus harder to read.
There was another thing that was bothering me. As written in Beyond Markdown, because CommonMark tries to stay faithful to markdown, it has complex rules which make the language harder to understand for humans and computers alike.
For these reasons (and because I really wanted to write a parser), I made my own “Beyond Markdown” markdown, called Touch (GitHub - touchmarine/to: 📜 Touch Lightweight Markup Language; Familiar, Extendable, Auto-Formattable). I have made it well over a year ago now and I have been using it daily since. As you are freshly starting to develop djot (at least publicy) and getting the ball rolling on Beyond Markdown, I would like to share some of my findings from developing my personal version of Beyond Markdown.
My Experiences Developing a Personal “Beyond Markdown” markdown
This is a brain dump, it is not necessarily written in a meaningful order.
I will use markdown as an all-encompassing term for markdown-like languages.
Feature Set and Progressive Enhancement
There are roughly 3 levels of markdown usage:
- plain text with annotations (em, strong)
- structured content (headings, code blocks)
- advanced markup (tables, table of contents, HTML elements, latex)
While CommonMark serves well the levels 1 and 2, things get complicated in level 3. If you switch to GFM, which is widely known and supported, you get tables, but other features are still out of reach.
You can go a long way with the features CommonMark/GFM offer you, but when you need more, you will have to drastically change your workflow. Either you switch to another, more powerful, and lesser-known markdown flavour or you switch to another (non-lightweight) language completely and use markdown inside it.
Goal
The language should be progressive and able to scale with you and your team.
My Solution
The language should come with a default set of common elements (like a protocol that everyone speaks and can interchange). But, it should also be extendable so you and your team can add the elements you need without having to change the whole workflow.
Instead of hardcoding elements, Touch knows how to parse “signatures” of elements. All the different types of element syntaxes are abstraced into “signature” types like indented and fenced.
Explicit, Simple, and Obvious
I like Go’s syntax and was inspired by it; I have a preference for orthogonal building blocks that are easily composable and as little to no magic as possible.
Goals
My goals for the syntax were/are:
- familiar (to markdown and all other commonly used markup languages)
- no complicated rules
- orthogonal (changing a does not change b)
- canonical (one true way to do things)
- don’t use characters with widely-known uses as delimiters (e.g. #, @, {{, {#)
My Solution
Straightforward rules
- UTF-8
- no SmartyPants or entity/numeric references—use Unicode which is readable in plain text form
- only obvious whitespace—no double space line breaks
- a list is nested if it is indented more than the delimiter of the parent list
- no indented code blocks
- no HTML (you can add custom elements instead)
Inline elements use 2 character delimiters
Pros:
- more explicit:
- easier to differentiate from normal punctuation (
*
vs**
) - easier to parse
- improved scannability (and I believe unhurt readability)
- easier to differentiate from normal punctuation (
- closing delimiter is not required (the element spans until the end of the current block):
- closing delimiter is added automatically by auto-formatter
- when typing in environment with live preview,
**a
can immediately bolda
Cons:
- more verbose
Block elements use 1 character delimiters
We don’t normally use punctuation characters (which usually serve as delimiters) at the start of paragraphs. As such, a single character is enough to know that *
in * a
is a delimiter. Because we know that inline delimiters are 2 characters and blocks 1 character long, we can easily differentiate between the two even if they use the same delimiter character:
* // block element
** // inline element
Orthogonal and Composable
Elements should do one job and that job only, no matter where they are placed. Following this principle makes for obvious syntax and enables composability. Touch utilises composability to enable the use of complex elements that are hard to express otherwise.
Sticky Elements
Sticky elements are elements that stick to other elements. They form a compound element with the element they stick to (called the target element). In this compound element, the sticky element acts as an auxiliary that provides additional info to the target element. Let’s look at a couple of sticky elements.
Named Link
Named Link is formed if a Link is placed after a Group.
[[a]] // Group
((b)) // Link
[[a]]((b)) // Named Link
Sticky Subtitle
Sticky Subtitle is formed if a Subtitle is placed after any other block element.
= Report // Title (or other block element)
_ Q2 // Subtitle
In HTML, the Sticky Subtitle in the example above can be represented in a <header>:
<header>
<h1>Report</h1>
<p>Subtitle</p>
</header>
Sticky Attributes
Sticky Attributes is formed if an Attributes is placed before any other block element.
! id="heading2" class="display" // Attributes
== Heading 2 // Heading (or other block element)
Auto-Formattable
I like style guides and I love code formatters. They relieve me of an unnecessary burden and make collaboration easier. No more style-related pull requests and flame wars.
That is why, Touch comes with a formatter that automatically formats your code into the normalized form. It also supports hard wrapping at a custom line length (which solves the pain point of manual line wrapping that @vas pointed out).
Post-Processing (Transformers and Aggregators)
Touch utilisies a lot of post-processing which has the following advantages:
- leaner and simpler parser
- enables greater level of composition
- modular design
Touch splits post-processors into transformers and aggregators.
Transformers
Transformers traverse the node tree and add new elements to it. Their job is to group elements by looking for simple patterns. They are used to add paragraphs, lists, and sticky elements.
The patterns are simple for humans and computers:
- paragraphs group leaf elements that have a sibling
- lists group contiguous sequences of the same sibling elements
- sticky elements group elements that have a sticky element placed before/after them
Aggregators
Aggregators aggregate (collect) data we are interested in, such as the headings needed to generate a table of contents. They traverse the node tree after the transformers.
Tables
Tables are hard. You need another dimension to represent them which is almost impossible in plain text if you need readability. I have tried numerous notations for tables and none felt right. GFM tables and Djot pipe tables provide enough functionality, are easy to read and write, and are useful. Nonetheless, they make parsing and tooling more complex and we should minimize complexity if possible.
I think that tables are important and some basic functionality should be available in lightweight markup languages. Without them, we quickly get to the point of breaking our workflow and needing to change our tools, again. I currently use Unicode tables designed in online editors, but while the readability is great, editing is painful as you need another tool.
One of the major challenges with tables is what is allowed inside them. Can you place blockquotes in them? Can you write inline comments? Can you use backslash escapes? Tables have their own context and the rules inside them are not the same as outside. They break our orthogonality rule. There is no easy solution as they do not naturally fit into the linear format of documents.
The most pragmatic solution (to me, right now) seems to be Djot-like pipe tables with auto-formatting help. Or maybe, we could have a small purpose-built markup language for tables that would be integrated like code blocks? After all, would it not be more correct and obvious that tables would live in their own blocks?
Configuration
This section is specific to Touch and it’s configuration.
Touch’s configuration is pretty messy and difficult to use right now. You need to use my “Extended JSON” (which is just pre-processed JSON with raw multiline strings) which is rubbish and the templates are hard to read and contain too much logic.
I am currently testing a new configuration which makes things much easier and templates basically logic-less. If you are evaluating whether you would use a configurable markup language, make sure to checkout the new configuration. You can compare the default configurations to see the difference: