So far, for some profiling and benchmarking, I have been mostly using Pro Git book (amalgamation of many langauge versions), as generated by Cmark’s
I now tried to use it to find some bottle necks and optimization opportunities (
gcov rulez) in my implementation. But some numbers I’ve got show that the book is likely not that good data for this kind of work, given this statistics of line classification:
Indented code (start) 577100 (38.9%) Blank 348271 (23.5%) Paragraph 318740 (21.4%) Indented code (cont.) 151810 (10.2%) ATX header 55850 ( 3.8%) Html (start) 27460 ( 1.9%) Html (cont.) 4920 ( 0.3%) ======================================= Total lines 1484151
(The numbers are taken by using
gcov on MD4C’s
md_analyze_line() which serves as a core of (leaf) block analysis.)
If you look into the book source, it can be seen there is indeed a lot of code blocks (demonstrating
git in action) and that textual content mostly does not include any line breaks, minimizing count of textual lines (Paragraphs + ATX headers) to mere ~25% of lines.
I guess that such MarkDown usage is quite specific, so I would like to ask the community whether anybody here has some kind of larger corpus of MarkDown documents available, ideally compiled from various sources and/or authors, and which would be more representative and less prone to such strong biases.