Designing a programming language so its source files are markdown files

I’m playing with some ideas for a domain-specific language. The programs written in it are likely to be small but fairly dense so I was hoping to make it easy to document things and to keep documentation in sync with code.

I thought that I might have the source files themselves be markdown so there wouldn’t be a syntax for comments per se. Instead, if my language were called myDSL I might write

# Hello, World!
Now once more with feeling, programmatically
```myDSL
print("Hello, World!")
```

The code block with the language tag would indicate to the lexer that it needs to stop ignoring comments and produce some tokens for the compiler.

Literate programming provides examples of how to extract code from documentation and stitch it into a coherent whole before annoying the compiler. I did a search for “literate markdown” and there is some prior art for this idea: ycomb, fadado/literate and others. Even after looking at those though, I had a few concerns that I’m still fuzzy on.

  • We know how to associate delimited comments with declarations; javadoc comments precede the declaration. Finding the text associated with a specific declaration so we can supply IDEs with tooltips might be harder with this approach.
  • Often, some files in a project need to be identified as main files where execution starts. These are typically done with a shebang like #!/bin/rundown when compilation isn’t mandatory. Is that compatible with markdown? It seems like it might conflict with the level-1 header.
  • Most libraries include some symbolic constants defined in code, and referred to from both code and documentation. It’s hard from markdown to refer to specific lines in a code block, or to identify a code block. Maybe there’s a markdown feature we could piggyback on to allow symbolic constants to be defined in the documentation and referred to from code:
    [my_symbolic_constant]: data:text/plain,"A constant value"
    
    could define data available to from both documentation and code. The trend in recent languages seems to be away from strictly typed constants (see Go’s arbitrary precision and untyped constants, and Java’s use of polyexpressions for lambdas and method references). The lack of a nice way to declare a type for symbolic constants declared this way may not be a problem. Any way to associate an anchor with a ```JSON…``` block or table might allow code to refer to lookup tables and default configurations defined in documentation.
  • Many programming languages hava a mechanism to make symbols available to code another source file: import, include, load, require. Markdown documentation can link to other markdown files via [_](_) syntax. Piggybacking on that syntax would require changing human-visible documentation to conform to the needs of code. Named links might help again
   [a_local_dependency]: ../foo/bar.my-dsl.md
   [a_remote_depenceny]: https://github.com/releases/1.0.0/baz/boo.md

   ```myDSL
   use x from [a_local_dependency];
   use * from [a_remote_dependency];
   ```
  • Often the best way to document an API is with examples. Whenever I write usage examples I try to write tests to avoid embarrassing myself. I fail often enough that it’d be nice to extract usage examples and run them with tests a la updoc. So it seems that code in documentation might have at least two flavors: production and test/example code. Is there a convention I could piggyback on to make it obvious to even a reader who has not read the language spec which code snippets are test code and which are production code?
  • Sometimes one language will embed another. For example, C can embed inline assembly. If the lexer needs to route snippets of test and production code differently, and possibly extract snippets of code in 2 or more languages, should there be some unambiguous signal to the lexer that this is a code block to extract? Is there one that doesn’t break code pretty-printing when viewed as markdown?
  • Good documentation should explain how abstractions can be combined to solve a class of problems. Having all the gory details in the documentation may distract from that. C has a split (module pragmas like inline) between headers that declare what’s available and .c files that implement. Recent languages have almost uniformly moved away from that model for, I assume, good reasons. A linter that discourage large blocks of code in modules that are within a certain link depth from README might help teams structure code to be easily navigable. Alternatively, code blocks that are collapsed by default when viewed in markdown might help.

I’m sure there’re many other issues that I haven’t considered. What do people think? Is embedding code in documentation a terrible idea? Has this been done on a scale that lessons have been learned? What are other issues to consider? Are there other features of markdown that are relevant?

If you’ve read this far, thanks,
mike

1 Like

Out of interest, have you tried zyedidia/Literate. As far as I can tell, it offers what you’re after.

Thanks for the pointer.

I saw that or something very similar. It does a nice job tying together code blocks and I was unfamiliar with named ---...--- style code blocks.

I was asking more about how to design a language so that the code and doc sections interoperate well.
Designing a language separately from markdown and using a general tool like that would not address

  • associating comments with declarations
  • shebang
  • referencing symbolic constants from the docs
  • module linking
  • distinguishing test and prod code

I have a quibble with that implementation. It extracts blocks instead of replacing non-code content with whitespace which means that line numbers in error messages index the extracted code, but not lines in the markdown. In the C case, that could be worked around via a #line before each extracted block:

#line 123 "foo.md"

but a language-agnostic solution is going to be harder.
Preserving diagnostic metadata is less of a concern if the language is designed around markdown.

Much of the original needs solved by Knuth were needed by having Standard Pascal as the “assembly language” are not present in modern languages and as Github has demonstrated Markdown with friends can do a lot of work for documentation. Hence I thought, what can be done just by switching back and forth between the programming language and Markdown using comments and backticks carefully.

I had a rather complex but short Java demonstration program that I did this for at https://github.com/ravn/dagger2-hello-world (where the Java file is also symlinked to by README.md so Github will render it directly). I found Markdown to be expressive enough for my needs.

I am considering writing it up as a blog article.

Mike Samuel,
It is a great idea and I thought you might be interested in an implementation of literate programming for the statistical programming language “R”.

Have you heard of R Markdown?
R Markdown is markdown with code blocks for the statistical programming language “R” and other programming languages.

The RStudio IDE allows users to create R Markdown (.rmd) which is markdown that is pre-processed by the R knitR package to Pandoc compatible markdown (.md) where it can be output to HTML, PDF or Microsoft Word.
https://rmarkdown.rstudio.com/lesson-2.html

I would like VS Code to do the same thing more generally for all VS Code supported languages
and added some information to an issue here: (I was learning as I posted, so the later posts are more informed).

1 Like