Djot: A light markup language by @jgm

I fixed some significant typos and added clarifications above. The most important being:

My obviously strong and likely controversial thoughts on manual line wrapping reflects a lot time and work I’ve put into a project aimed at making plain text more expressive and flexible while making it more natural at the same time. I’ve examined many, many paths backwards and forwards, not just in my head but in countless use-cases and working code. In doing so, I discovered how much complexity and unnaturalness derive simply from the manual line wrapping fork in the road. This fork occurred so far back that I think most of us, myself included, don’t even realize it was a fork, that we could have made a different choice, and that there is a whole other realm of possibility on the other side of the mountain.

I’m not done with my exploration. I’ve only recently decided to backtrack all the way to that primordial fork and will be exhaustively examining the other world. Maybe I’ll discover I was wrong / rediscover the reason all(?) lightweight markup formats made the choice they did. I wasn’t going to write about this yet, but it was djot doubling down on manual line wrapping that spurred me to write. I’m hoping this leads to a fruitful discussion.

1 Like

Thanks for these interesting comments. The issue of hard-wrapping was one we talked about in the discussions leading up to commonmark.

I have to say that I don’t really agree with your readability judgments; it seems to me that the extra whitespace is desirable (and certainly not undesirable) in these cases.

A couple additional points:

First, if your proposal is that a hard line break should be interpreted as a hard break (i.e. <br>), then there would still be an ambiguity with

She counted (all the way from 1 to
9) and then went in.

Should this be parsed as a single paragraph containing a hard line break and the literal 9), or a paragraph followed by an ordered list? Here you’d need something like commonmark’s current kludge (making a special exception for the number 1 for starting a list without a line break), or you’d need to forbid hard breaks before potential list markers (an expressive blind spot).

Second, genuine hard line breaks inside a paragraph are extremely rare in good typography; it seems a shame to assign this rare thing to the newline character. A somewhat different proposal would make newlines create paragraph breaks. But this isn’t good for readability; it will be too unclear where a paragraph ends and the next one begins.

Third, paragraphs consisting of extremely long lines can cause problems unrelated to readability. For example, lines of text in emails are not supposed to exceed 998 characters (RFC 5322 - Internet Message Format). Coding standards also sometimes impose a maximum line length; if the syntax is used in code comments, it would be affected by this. Diffability is another important consideration: if you have extremely long lines, it is harder to see where changes have occurred inside a paragraph.

Djot does not require hard-wrapping, but it allows it, and I think that’s still important.

1 Like

Sure. I intended to write a short summary that would fit into here but it quickly got long.

New topic: My Experiences Developing a Personal “Beyond Markdown” markdown - Touch Lightweight Markup Language

Depends on use case, but not everything people write is destined for print with “good typography”. Most of it stays online, in casual discussions.
Now here is another “fork”: Some people, I have no idea how many, routinely break lines after sentences or short groups of sentences, with the intent of making the structure of their writing clearer to the reader. I do. We embrace the structure of plain text to have 2 levels of grouping: long lines that are effectively “sub-paragraphs”, and “real paragraphs” separated by blank lines.
(In markdown, I find myself adding double spaces after most line breaks. Rarely do I hard-wrap for myself without wanting the reader to also see the break.)

It’s rare in print and definitely violates writing style guides (unless you write poetry). Possibly because paper pages has a cost and a weight? But online I’ve already seen it in some people’s emails in the 90s (can’t speak to before my time :person_shrugging:) and AFAICT it’s not going away.

This consideration goes both ways…

  • All good diff tools like github/gitlab/meld/delta do 2-level diffing where the whole touched line is highlighted, but then the specific touched words are highlighted stronger within it. So writing each paragraph as one long line is not catastrophic to modern diffing. :neutral_face:
  • Automated hard-wrapping around a fixed width leads to the worst diffs! When words slide from one line to another, they add tons of noise to the diff that’s hard to filter out. :-1:
  • OTOH the best advice for minimizing diff noise is indeed “add newline after every sentence”, when working with formats like LaTeX, markdown etc. that’ll still render them as single paragraph. :+1:
1 Like

It’s a good oportunity to mention the little-known text/plain; format=flowed RFC 3676: The Text/Plain Format and DelSp Parameters. A brilliant ASCII hack IMHO, and I regret it hasn’t caught on among unix terminal tools — or even among mail readers where it was intended :frowning_face:

Hard-wrapping is a lossy action — it’s hard to recognize which newlines are “original” semantic breaks and which were inserted “soft breaks” and safe to remove.
The idea of format=flowed was that by default line breaks still mean a hard break and you opt-out by leaving a single trailing space before the newline. Pairs of “Space Newline” are generally removable, or at least the Newline is (a 2nd param DelSp controlled whether the space is kept, which fits e.g. English, or dropped, which fits languages like Chinese that don’t use spaces between words).
That’s the opposite of markdown, where single source newlines by default are ignored as soft breaks, and 2 trailing spaces opt-in to a hard break.
This choice was made for compatibility with existing ambiguous plain text (and with email clients ignorant of format=flawed), where you can’t be sure which breaks were soft. It’s safer to preserve all breaks than munge all lines together.

=> The particular choice of “Space Newline” is debatable, but consider the idea of opt-in vs opt-out.
Whether a markup format defaults to ignoring newlines or treating them as semantic line breaks, either way it can allow hard-wrapping without requiring it!

1 Like

@jgm, I wanted to hear other points of view before I chimed in again, but since it’s been a couple of weeks…

I spoke of manual line wrapping being a primordial fork in the road. The real primordial fork is this choice:

Should machines be made to understand humans or should humans be made to make themselves understandable to machines?

It is this question that drives my analysis.

machine                                          human
oriented                                        oriented
   <------------------------------------------------> 
       SGML/HTML      Asciidoc   Markdown

Where does djot fall? For some of its choices, a little to the right of Markdown. But in doubling down on manual line wrapping, to the left, which the examples in my above reply try to demonstrate.

The notion of “tight lists” tells us that extra whitespace can be undesirable. More importantly, requiring it has only one purpose: making humans make themselves understandable to the machine. We would not be allowed to write the following even though it is both unambiguous and natural for us:

Before leaving:
- eat all the perishables
- water the plants
- turn out the lights

And we would only know that we were not allowed to write it this way because we’d have to learn a set of rules designed to make ourselves understandable to machines.

I always thought CommonMark’s "kludge’ was a step in the right direction (and defended it more than once on this forum): Make the machine figure out the human. I’d want to go further, adding a heuristic such as: Is there an 8) list item somewhere before this? Is there an 10) item that follows? If not, this is unlikely to be a list item.

If you eliminate manual line wrapping (treating newlines as soft breaks), then it gets considerably better. First, a large source of ambiguity, what @cben calls “Hard-wrapping is a lossy action”, goes away. There simply would be no line break before 9). We can come up with other examples where ambiguity occurs, but they become increasingly contrived. In other words, incredibly rare in the real world. Second, you could modify the heuristic: If unsure, leave it as-is. Even if it were a list item, it will still be seen as such by the human reader even without an explicit list rendering (in HTML or whatever), because it was seen as such in the original.

On your second point: Without any rules, humans will naturally delineate paragraphs with blank lines where necessary.

On your third: Thank you for making me aware of RFC-5322. At some point we have to break the bonds out outmoded limitations (limitations I assume exist because at the time computing resources were so limited choices were made in favor of machines at the expense of humans). Even the RFC says “Receiving implementations would do well to handle an arbitrarily large number of characters in a line for robustness sake.” But more importantly, who writes in plain text markup in an email client that won’t render that Markdown into HTML? Finally, say someone wants/needs to email raw plain text markup. The software would insert newlines where necessary. The occasional > 998 character paragraph (this one is 780 chars), would get an odd line break in the middle. It’s not a big deal.

I’m with @cben on the diffability question.

Djot doesn’t require hard-wrapping, but even users who don’t use it pay its costs.

[I sent a reply by email, but for some reason it didn’t come through. So, apologies if this is a duplicate.]

On Jul 26, 2022, at 11:13 AM, vas via CommonMark Discussion <noreply@talk.commonmark.org> wrote:

If you eliminate manual line wrapping (treating newlines as soft breaks), then it gets considerably better. First, a large source of ambiguity, what @cben calls “Hard-wrapping is a lossy action”, goes away. There simply would be no line break before 9).

I don’t understand your response to my point. Most likely we are understanding some key terms, like “soft break,” differently. As I was using the term here, a “soft break” is a newline in the markdown source that is interpreted semantically as a space. Djot does “treat newlines as soft breaks” in this sense, so I assume you mean something different.

I understood you to be proposing that newlines in paragraphs would produce a hard line break in the rendered output. Assuming that’s right, then you face an ambiguity in

She counted (all the way from 1 to
9) and then went in.

Is this a single paragraph with a hard break,

<p>She counted (all the way from 1 to<br>
9) and then went in.</p>

or is it a paragraph followed by a list?

<p>She counted (all the way from 1 to</p>
<ol start=9><li>and then went in</li></ol>

That’s the ambiguity. It’s just the same as the ambiguity we face now with commonmark, and which we resolve with the unprincipled restriction on start numbers. I guess you like that way of resolving it, but I’ve never been happy with it. In any case, it comes up whether or not you allow hard wrapping (i.e., treat newlines in paragraphs as equivalent to spaces).

As for your main point: certainly, djot falls a bit to the left of markdown. With markdown and commonmark, the aim was to magically guess what humans intended, as far as possible. My mantra was always “favor what is intuitive to humans, and make the parser more complex if necessary.” That is why the parsing rules are so complex. The problem is, even with all this complexity, there are many cases where we don’t get the results people intuitively intend. So, my choice with djot is to give up trying. Performing the task well would require a high degree of general intelligence. Maybe in the future, some successor of GPT-3 will be used to parse our plain text documents, but for now, I’d rather have a simple set of rules that we can keep in our heads, so the output is predictable.

To elaborate: in my example above, any human can tell that the 9) on the second line is the end of the parenthesized phrase in the preceding line, and not the start of a list, whereas in

A further point is
9) blah blah blah

we have a list. But in doing this we’re relying on our grasp of the meaning of what is written; it’s very hard to predict human intentions in such cases with a set of syntactic rules. Consider the minimal pair:

A more interesting number is
6. This is a "perfect" number.

vs

A more interesting point is
6. This is a "perfect" number.

In the second case a list is probably intended (depending on preceding content), while in the first a list is not intended. We can figure that out because we know the difference between “point” and “number” and we have a grip on what the writer is trying to achieve. Our markdown parsers can’t learn this by being given more and more complex rules. They’d need to have a psychological model of the humans writing the text, and an understandig of the meanings of the words.

The idea behind djot is to keep things simple, uniform and predictable while still achieving most of the aims of markdown.

John

1 Like

On diffability: I think the best approach for diffability is one sentence per line. However, that speaks in favor of djot’s current approach, on which newlines inside paragraphs render as spaces rather than hard breaks.

The bottom line, though, is that I don’t want to impose any one style on the user. Some people like one sentence per line. Some people like hard wrapping to a fixed width. Some people like one big line per paragraph. All of these styles work fine with djot. I’d be reluctant to change djot in a way that requires one of these styles and excludes the others.

1 Like

The most obvious criticism I have is that

  1. It should be possible to attach arbitrary attributes to any element.

Is not a proper design goal. It’s a design decision made to achieve some design goal. A design goal is something like “It should be easy to implement the parser” or “It should be possible to add html blink tags.”

Now I’m very dubious on attaching arbitrary attributes as a design choice, but it would be a mistake for me to argue against it since there’s no actual design goal. I don’t know what you’re trying to accomplish. And I need to know to be able to suggest alternatives.

Part of what I’m thinking at the moment is that I don’t like the idea of sprinkling bits of markup (in the form of attribute specifications) all over the text, since this directly conflicts with markdown’s goal of being readable as text. An alternative might be to actually standardize the “Front Matter” convention you see in CMS’s that use markdown (e.g. Hugo), where you have

-----
title: My stupid clickbait blog post
date: 2022-8-6
tags:
  - seo
  - clickbait
-----

Then maybe you could define one or more attributes to contain queries (maybe css selectors) to locate block/inline elements and specify attributes to be applied to them. In this way you contain all the ugliness to the front matter. Now this is probably a terrible idea that you shouldn’t use, but without a proper design goal it’s hard to actually demonstrate this. For example, this approach would probably make the parser easier to implement (assuming parsing/evaluating the css selectors is handled by an existing library), and it satisfies the goal of allowing us to apply arbitrary attributes to elements.

Or to make this even more pathological, we specify a series of sed programs to be run on the finished html, so we don’t have to worry about having a css library. This appears to satisfy the design goals even better. I just threw up in my mouth a little.

Let’s switch gears and see if I can come up with some suggestions that are actually worthwhile. One of the changes you made was to cut out redundant syntax elements. Personally, I have a strong preference for the setext headings and would miss them if they were gone, but let’s leave that aside briefly. One could argue that if you’re going to start trimming down the language, there’s no reason to include the inline link definition ([link text](uri)) when the reference link definition ([link text]...[link text]: uri) is obviously so much better.

Assuming for the sake of argument that we agree with me about reference links, we could apply the same reasoning to attributes. Instead of writing

The word *atelier*{weight="600"} is French.

(Yuck!) We write e.g.

The word *atelier*{fw} is French.

{fw}: weight="600"

In both cases, we satisfy the goal of being able to add non-text stuff (links, arbitrary html attributes) to our text without substantially compromising the readability of the text. And by banning the obviously-inferior inline versions, we simplify the parser. Also, we only define fw once, so we can re-use the same 4-char reference several times in case we add more french words to our document.

The disadvantage of this is that we can’t straightforwardly compose the two definitions, as in

[Link text](url){title="Click me!"}

But we could still presumably compose the invocations and write

[Link text]{cmt}

{cmt}: title="Click Me"
[Link text]: url

in re the setext headings: I want the way I write the h1 elt to look like a big title, and the way I write the h2 elt to look like a substantially-document-dividing subtitle. I don’t think that the atx headings qualify. Ignoring post-title hashes does at least let me write

# Title ##################################

intro

## Subsection 0 #########

text

## Subsection 1 ##########

more text

Which isn’t bad, but I’m not a huge fan. This maybe isn’t a critical issue, but I still don’t get the same warm fuzzy feelings I get from markdown setext headings.

An alternative might be: If the heading contains a trailing #, The line immediately after the heading is skipped. Then I write

# Title #
=============================

intro

## Sub 0 #
------------------------------------------------

text

## Sub 1 #
-------------------------------------------------

text

Now this is almost as nice in plain text as markdown (the extra hashes aren’t a big deal), and we don’t actually have to make the parser understand the setext headings, since they’re basically just comments. In fact, I can use $$$$$$$$$$$$$$$$$$$$$$$$$$$ as my h3 separator if I want.

Finally, I agree with @jgm on the hard-wrap support thing, but I really don’t feel like saying anything substantial about it.

1 Like

The primary goal is to provide a flexible way to make the markup extensible.

Suppose you are writing a document that needs to index certain terms. What you want is a filter: a program that reads the abstract produced by the parser and enhances it by adding an index. (If you’re not familiar with this in practice, go look at some examples of pandoc filters.) But in order for this to work, the filter needs to know which terms are to be put into the index, so this information needs to be represented in the AST. If we can add attributes to bits of text, we can do that using attributes. e.g., [cat]{.index}, or [cat]{.index see="feline"}.

That’s just one application. Instead of marking individual words and phrases, you might want to mark groups of paragraphs. Then you attach the attributes to a fenced div. Or you might want to mark a code block as content that should be turned into a diagram via mermaid, using certain DPI settings or whatever. Again, you can do this with attributes.

So, attributes provide hooks through which external filters can interact with the parsed AST.

That’s the main reason I want them, anyway. They’re also useful for things like defining anchors for cross-references.

1 Like

Okay, that makes sense.

Looking at your example makes me wonder if I want to augment my attribute-ref proposal to make the attribute specs composable, as in

[cat]{cat} and [dog]{dog}

{idx}: .index
{cat}: {idx} see="feline"
{dog}: {idx} see="canine"

Or whether that introduces too much complexity.

I discovered djot while writing (and failing twice) a simple commonmark parser, and I must say it’s a relieve to find another project with the same kind of conclusions.

While I appreciate it strictness (especially on the tables) and some of the new inline syntax (ins/del/sup/sub) that add value for a low complexity cost, It sometimes feel like a want-to-fit-all-profile:

  • parser : with it simpler/safer syntax
  • publisher : with it math, smart punctuation …
  • developer : with it div/span and raw code injection

But I’m worried by the following aspects:

  • Having a first class LaTeX Math support mean that any djot render shall comes with a LaTeX Math renderer library.

  • Is the ~2000 lines emoji-to-unicode list part of the djot standard that all parser shall follow and bundle with them ?

  • it sometimes add (too) much of choices that bring corner cases: a i. can be either <ol type=i> and <ol start=9 type=a> while it was supposed to eliminate them

  • having both ~~~ and ``` for code block contradict Rationale 11

  • the span/div attribute is a mess to parse with regex, sometimes it’s before it element (block) and sometimes it’s after (inline). Needless to thay that giving programming ability with those attributes or with the =html native code injection will prevent this language from being used on any user-given content (forum/chat/readme)

  • A developer writing it README will get upset by the smart punctuation substituting all it dash and quote, but a PhD or a blogger will probably be happy about it.

  • I’m not enough of djot master to understand why backtick mess the inline parsing so much: it interrupt when in a middle of an em or strong (I thought the first opener win) so

    ``test
    

    is rendered the same as

    ``test``
    
  • optional fun fact: the \<newline> syntax does the opposite of what all programming language does: they create a continuation line

I started writing a djot2dom JS render, and it feels simpler to parse, yet all those new corner case and heavy features made me lost my motivation (again).

I think I’ll continue my journey in the search for a stricter GFM subset :slight_smile:

Math support is not a big issue. You can always pass the raw TeX math through verbatim (it is usually still fairly readable). In HTML, you just need this plus one line to pull in MathJax from a CDN.

Emojis: well, the standard is vague on what is required to conform for emojis, but I don’t know how important this is.

I’m still not sure about having both forms for code blocks; I take the point about Rationale 11.

Worries about code injection: the parser will recognize raw HTML and also attributes. How these are rendered is up to the renderer, which could decide to omit all raw HTML and potentially dangerous attributes, or sanitize them somehow. Alternatively, pass the result of a naive render through a standard sanitizing library. Nothing here prevents you from using djot with untrusted input.

Yes, unclosed backticks are implicitly closed by the end of the paragraph. This avoids the need for expensive backtracking, and in practice I don’t think it’s a problem.

By the way, if you want to write a parser, I think your best approach would be to convert the existing Lua parser to JavaScript. Mostly this should be straightforward, since Lua is quite similar to JavaScript.

Thanks, I think I got the idea : the spec is more about the parsing than about the rendering.

For the code block : I forgot to say that I loved the idea of a variable number of backtick to handle encapsulation.

For the parser: my goal is to keep it easy hackable (so in a <200LoC range) while taking advantage of the browser DOM API to directly build the Elements tree without the need for an IR/AST. So not really the same goal :slight_smile:

My post of djot on Hacker News made it to the front page and has sparked a growing discussion over there: Djot: A light markup language by the creator of Pandoc and CommonMark | Hacker News

I just noticed in referencing you answer here in a GitHub comment that I never answered you (Yes, we are using terms differently. No, my proposal would treat the example you give as a a single paragraph with a hard break, not what either Markdown or djot does). I think at the time I figured you didn’t really need my response because your answer made clear your coherent philosophy for djot and why my thoughts on what I call “manual line breaks” didn’t make sense for it. Let me know if you actually want my explanation.

The reason I joined this forum is Djot! :smile:

I only joined this forum to voice some of my issues regarding markdown and was delighted to learn about djot which fixes some of them. However, I have to agree with @vas on his main criticisms, especially regarding line breaks. I wholeheartedly agree on the “human-friendly” requirement for markdown / a potential successor of it and for this, hard-wrapping is a bane.

Lists, nested ones in particular, are a pain in markdown, but I think they got considerably better in djot. What I especially like is the symmetry: if you want text before or after a list, you need an empty line to separate it. Still, due to hard-wrapping compatibility, the annoying empty line remains.
As I see it, the problem stems from a conceptual difference that I claim to exist in the meaning of the word “paragraph” from the standpoint of writing and layouting:

  • From a writer’s perspective, it is “a self-contained unit of discourse in writing dealing with a particular point or idea.” (Taken from Wikipedia).
  • From a layout perspective, it is a junk of text with a line break at the end and sometimes a bit of indentation and spacing here and there.

As a result, most writers don’t conceptualize lists as breaking a paragraph but rather as part of it. However, this is exactly what HTML does due to it’s nature of being an instrument of layouting, which offers an obvious way to interpret and implement lists. However, this is against the intuition of the average user (or maybe only me) and even if it wasn’t it shouldn’t be if djot or markdown seek to be used not only for HTML but other targets such as *TeX as well.
Personally, I really like this idea of separating thoughts visually by surrounding it with empty lines. It would make sense to render

some idea
text before list
1. text
  more text
  1. nested
    1. even deeper
    back one level
    1. deeper again
  back two levels
2. finally a second item
text after list

next topic

basically as is. And yes, this is an example where we have a single thought span over multiple “paragraphs”…
This also relates to the “ambiguity” @jgm mentioned. It’s not an ambiguity if you decide on precedence. In my opinion, it should be treated as a paragraph followed by a list.
Here is a proposed simplified parsing rule set for producing HTML:

  • If the next line is empty (or white space only), the next action is probably to close a tag.
  • Otherwise check the indentation and close all tags on the stack that go deeper than this level.
    • If the thing in the line is compatible with the tag on the top of the stack add it. If a p tag is the top of the stack, add br in between.
    • Otherwise, close the tag on top of the stack and add a new tag corresponding to the added line. If the new object is normal text, encapsulate it using p.

Maybe GitHub markdown is already what we want in this regard. I have to admit, that I didn’t check how they do it. Edit 2: Maybe it would actually be better to let new lines separate p tags instead of br tags inside a p.

In defense of hard-breaks: Sometimes they are indeed useful. For example when you have formulas in normal text. But please not as default behavior. A reversal of the situation using the \ character would work fine in my opinion and may also be familiar to people from the C pre-processor or python, where it is used in precisely this meaning: This line break is not a line break.
Edit: I think that cpp adds the line breaks in the output, so my comment on the C pre-processor is not correct, but that doesn’t affect the point I made.

Another thing I never understood is the need for loose and tight lists.
Edit: In my opinion, breaks with the concept of the empty line as a separation between things.
In the following snippet, if I want the new line between A and B, I get a mixture of tight and loose lists although the list starting with C is formated in the same way, except that its second paragraph is another list rather than a piece of text.

text

1. A
  
  B

  1. C

    1. text
    2. text

    text
  2. text
2. text
3. text

text

I guess, that’s my 50 cents. Turned out longer than expected.