Corpus of MarkDown documents for benchmarking?

mity · December 6, 2016, 1:45am

Hello,

So far, for some profiling and benchmarking, I have been mostly using Pro Git book (amalgamation of many langauge versions), as generated by Cmark’s make bench.

I now tried to use it to find some bottle necks and optimization opportunities (gprof & gcov rulez) in my implementation. But some numbers I’ve got show that the book is likely not that good data for this kind of work, given this statistics of line classification:

Indented code (start)    577100 (38.9%)
Blank                    348271 (23.5%)
Paragraph                318740 (21.4%)
Indented code (cont.)    151810 (10.2%)
ATX header                55850 ( 3.8%)
Html (start)              27460 ( 1.9%)
Html (cont.)               4920 ( 0.3%)
=======================================
Total lines             1484151

(The numbers are taken by using gcov on MD4C’s md_analyze_line() which serves as a core of (leaf) block analysis.)

If you look into the book source, it can be seen there is indeed a lot of code blocks (demonstrating git in action) and that textual content mostly does not include any line breaks, minimizing count of textual lines (Paragraphs + ATX headers) to mere ~25% of lines.

I guess that such MarkDown usage is quite specific, so I would like to ask the community whether anybody here has some kind of larger corpus of MarkDown documents available, ideally compiled from various sources and/or authors, and which would be more representative and less prone to such strong biases.

jgm · December 6, 2016, 9:30am

See here for commonmark.js. These can be run using make bench-detailed from the root; this has a little awk script for creating a nice table of results for inclusion in the README.md here.

Most of these were taken from the markdown-it benchmark suite, so thanks to @vitaly for them.

It would be nice to set something like this up for the C implementations, too.

vitaly · December 6, 2016, 9:52am

I’d like to clarify, those test files are NOT “average samples”. Those help to benchmark separate components, for development checks.

If anyone wish to do general parsers compare, other samples required.

jgm · December 6, 2016, 9:59am

I’ve added a newbench Makefile target to cmark with the samples.

Perhaps we can also find a better “average document” for general benchmarking? I chose progit because it was long, “real”, freely available, and multilingual.

lwmr · December 15, 2016, 4:02am

I’d write a crawler that visits every GitHub repository accessible from https://awesome.re/, (optionally) checks the repo’s license and grabs every .md document available. Looks like a good Markdown-in-the-wild sample to me.