I have designed the implementations to be fast – e.g. avoiding backtracking whenever possible. I have optimized them to the best of my ability, but I’m not a js or C wizard, so I’m sure they can still be improved.
I can give you some data on performance.
make benchjs on my macbook air says:
std markdown → html x 226 ops/sec ±6.74% (88 runs sampled)
showdown.js markdown → html x 123 ops/sec ±1.22% (81 runs sampled)
marked.js markdown → html x 415 ops/sec ±0.76% (94 runs sampled)
So there may be room for improvement; but of course, marked isn’t as accurate a converter, so some of its speed may be due to short cuts it is taking.
My tests of the C implementation suggest that its performance is about the same as discount’s. For example, on my laptop it takes 0.03s (user) to convert a 179K manual. It seems to me that this should be fast enough. sundown is considerably faster, though, so again there may be room for improvement – though again, I worry that sundown achieves the performance by making many shortcuts that make proper parsing impossible. Anyway, I am sure C experts will be able to improve the performance quite a bit.
Just to update this thread: after many performance improvements by Vicent Marti, cmark is now about 6 times faster than discount and just a tad slower than sundown.
I’ve spent the last couple days optimizing commonmark.js, which is now just a little slower than marked.js:
commonmark.js markdown->html x 709 ops/sec ±1.28% (95 runs sampled)
showdown.js markdown->html x 248 ops/sec ±1.90% (87 runs sampled)
marked.js markdown->html x 729 ops/sec ±2.20% (94 runs sampled)
markdown-it markdown->html x 986 ops/sec ±1.15% (94 runs sampled)
Note that the benchmarks are highly dependent on the specific input used; I used a 10 MB Markdown text composed of twenty concatenated copies of the first edition of Pro Git. (make benchjs will run the benchmark above; make benchjs BENCHINP=foo.txt will use foo.txt for the source.)
My mistake! Well, I’m happy that commonmark.js does even better on the progit benchmark. It would probably good to put in place a better benchmark suite that tests a variety of different inputs, as you have in markdown-it.
Seems you were right in earlier measurements, and even markdown-it is still 5x slower than C implementation, on big files, instead of 2-3x as i expected . No more ideas what to optimize.
I even tried to unify token classes at first level of properties, but didn’t noticed speed gain.
By the way, I don’t think spect.txt is a good benchmark text. The reason is that it’s a very special kind of file – not Markdown but a combination of Markdown and some special conventions.
So, for example, all of the raw HTML in the right-hand panes of the examples will come through as raw HTML (since it’s not in a real code block in spec.txt), giving you a document that’s much heavier in raw HTML than in any other kind of construct – and quite atypical. The . delimiters for the example blocks will also prevent many of the Markdown examples from having their normal meanings, for example,
.
1. a list
2. not
.
will not be a parsed as a list, but as a regular paragraph, because of the dots.
Ideally we’d want a benchmark text that (a) contains every kind of construct and (b) reflects ordinary documents in the relative frequencies of things.
You are right, spec is not very good for it. It’s better to replace with something else.
I don’t care much about absolute speed and compare with other parsers. The most important for me is to detect anomalies and regressions. High markup density is preferable for my goals, but spec is really not ideal because of dots.
For markdown-it the most expencive will be very big nested blockquetes (they cause stacked remapping for ranges of lines). Inlines are ~ similar to reference. Probably, it’s possible to rewrite block level to be similar to reference, but design of optimal shared state machine is very boring - both me and Alex don’t wish to do it again. Ideally, if reference parser could support markup extensions is some “linear” way (not via tons of pre/post hooks).
An interesting aspect I noticed when testing the performance of various .NET implementations - some fail really hard (>30 sec) when given a huge HTML file such as IMDB homepage. If you create a test suite for performance, such a test might be worth including.
I also used ProGit book for benchmarking but instead of using just English concatenated 10 times, I merged all languages together - so there are much more Unicode stuff going on which might be useful to test.
An interesting aspect I noticed when testing the performance of various .NET implementations - some fail really hard (>30 sec) when given a huge HTML file such as IMDB homepage. If you create a test suite for performance, such a test might be worth including.
This must be something specific to these implementations. Both the
reference implementations (C and JS) parse that page in an instant.
I also used ProGit book for benchmarking but instead of using just English concatenated 10 times, I merged all languages together - so there are much more Unicode stuff going on which might be useful to test.
In upcoming 4.0 of markdown-it, rewriting renderer to support attributes list costed ~20% of performance. Much better than i expected. New renderer use simiplar approach as reference parser - attributes are stored in array, to allow extend output without renderer functions change.
I like these ideas, but when I am looking at benchmarks and stuff, I am concerned about different styles of writing Markdown documents. For example, I almost always using Atx Headings, Inline Links and no HTML. So, I can use my own blog Markdown to test the performance of how I write… but not as much for other styles.
Would there be some use in trying to get together a series of “sample” files that are good benchmarks that are some kind of adjunct to the specification?
Good question; @jgm do you have a set of “reference” documents in markdown (probably large ones, with lots of markdown features in the doc) that could be used for benchmarking speed of implementations? I looked up my old notes and I found reference to MDTest 1.1 but that was mostly about conformance, not performance.
Aha, reading back through the book, it looks like this was the choice – ProGit book, with all languages merged?
I’ll bet the GitHub folks have a corpus of docs that is a representative sample of the Markdown in all the GitHub repos they host, that they use to perf test their implementation since I’m sure performance is especially important for them.
@codinghorror wouldn’t Stack Exchange also have such a sample?
It was a one-time hack, but I can imagine it might be expanded and turned into a more reusable form. The advantage of it is that it does actually multiple mini-benchmarks, each trying to address some particular part of the parser implementation. So this approach can also give a hint where your implementation can be behind.
There are also some small fine-grained benchmark sample files in the cmark repository (bench/samples). (Run with ‘make newbench’.) Most of these derive originally from the markdown-it project. You can use these to determine that, for example, parsing of nested block quotes is faster in X than in Y.
Cool suggestions all! I will start looking into each of them and report back here when i have some time.
While I don’t need to have everything be perfect, I do want to make sure I put my best foot forward and test a decent variety of styles to let me know: 1) that I didn’t miss any grammar related issues and 2) that I didn’t sacrifice performance in one area by not looking at it sufficiently.