Hello,
So far, for some profiling and benchmarking, I have been mostly using Pro Git book (amalgamation of many langauge versions), as generated by Cmark’s make bench
.
I now tried to use it to find some bottle necks and optimization opportunities (gprof
& gcov
rulez) in my implementation. But some numbers I’ve got show that the book is likely not that good data for this kind of work, given this statistics of line classification:
Indented code (start) 577100 (38.9%)
Blank 348271 (23.5%)
Paragraph 318740 (21.4%)
Indented code (cont.) 151810 (10.2%)
ATX header 55850 ( 3.8%)
Html (start) 27460 ( 1.9%)
Html (cont.) 4920 ( 0.3%)
=======================================
Total lines 1484151
(The numbers are taken by using gcov
on MD4C’s md_analyze_line()
which serves as a core of (leaf) block analysis.)
If you look into the book source, it can be seen there is indeed a lot of code blocks (demonstrating git
in action) and that textual content mostly does not include any line breaks, minimizing count of textual lines (Paragraphs + ATX headers) to mere ~25% of lines.
I guess that such MarkDown usage is quite specific, so I would like to ask the community whether anybody here has some kind of larger corpus of MarkDown documents available, ideally compiled from various sources and/or authors, and which would be more representative and less prone to such strong biases.