Being inspired by this project I have started implementing a parser in C#. At this moment I have 103 tests succeeding.
I will be publishing on GitHub and NuGet, but before I do, I want to avoid any controversy with regard to naming.
I have tentatively called it CommonMarkSharp. Does anyone have an issue with that name?
The guideline is this: you are free to use the name CommonMark in any way you like, provided you pass all tests in the current version of the spec.
Thanks for the clarification. Passing all the tests is certainly the goal.
I have also started the port of the existing C code to C#. I have similar level of completeness, but creating two different implementations does not seem to be the best idea. Any suggestions on how we could merge the efforts?
You should look at this blog post:
how to call the JS CommonMark engine using the CodeFluent Script Engine
The main problem with this approach would be performance.
The next problem is cross-platform compatibility since nowadays .NET libraries like these must run on Mono and .NET Framework and on Android/iOS/Windows.
An update for my progress - I have only 6 tests failing anymore, 4 from those are just problems with the Perl test runner (it does not work properly with UTF on Windows (Strawberry Perl) and one issue with different newlines that a simple regex does not correct for some reason).
I finished the port to C#: https://github.com/Knagis/CommonMark.NET
It passes all the tests in the current specification.
The next steps are performance and memory profiling and some refactoring for the public interface (the syntax tree elements).
The code violates C# naming convention and has some incomprehensive method names (e.g. cr). I understand that this is due to the C roots of the project. Will you accept pull requests fixing this?
Yes, the naming is this way only because it was ported directly from C code. The same is true for why there are so many
ref parameters - those should go away as well.
I will of course accept pull requests to fix this although we should be careful to not change the structure too much and also note the original names in comments. The idea behind this is so that when the specification or tests change, we could implement the changes by looking at those made to the reference implementation, instead of reinventing the wheel.
Before the specification reaches version 1.0, I believe such pragmatic approach would be better.
Fair enough, I’ll note methods’ original names in comments. Although usually it’s just removing the underscore and camel casing the name.
What about automated testing btw? I think that’s a rather high priority, to get some normal QA (perl script testing is so 90s).
Yes, I agree that instead of the perl script unit tests within VS should be used (especially since the failing tests currently only fail because Perl on Windows and/or Windows console does not properly handle unicode symbols).
The one thing we should keep is that the source of the tests is the spec.txt file so that for new versions it is just a matter of replacing that one file.
The first beta version of the CommonMark.NET implementation has been published to NuGet: https://www.nuget.org/packages/CommonMark.NET/.
Edit: since then a new version (0.1.1) has been uploaded. No longer marked as pre-release. A very simple benchmark shows that for processing spec.txt document, CommonMark.NET now outperforms MarkdownSharp by 50%. Now just a little more to beat Markdown.Deep…
Yet another update - 0.1.3 has been optimized to perform just as fast as Markdown.Deep which is the fastest alternative on .NET currently (that I know of).
CommonMark.NET 0.1.3 7 ms 11% (current release for this library)
CommonMark.NET 0.1.2 15 ms 23%
CommonMark.NET 0.1.1 27 ms 42%
CommonMark.NET 0.1.0 56 ms 84% (first public release)
MarkdownSharp 1.13 55 ms 84% (MS and MD might not conform to
MarkdownDeep 1.5 7 ms 11% CommonMark specification)
CommonMarkSharp 0.1.1 91 ms 140%
Baseline 65 ms 100% (used to compare results on different machines)
It was aways interesting for me to compare modern js JITs with static-typed languages.
Selected samples: (1 of 26)
Sample: spec.txt (109764 bytes)
> current x 77.77 ops/sec ±1.44% (68 runs sampled)
> marked-0.3.2 x 23.10 ops/sec ±0.66% (42 runs sampled)
> stmd x 39.92 ops/sec ±4.07% (51 runs sampled)
13ms, lol. (mbp retina). current =
remarkable in strict commonmark mode.
In case there is someone who is interested but is not following on GitHub:
CommonMark.NET has had 5 more releases since Sep 14
- the performance is now even better (~4ms where the 0.1.3 release had ~7ms for the
- updated the implementation to the version 2 of the specification (updates to entity handling, url encoding and emphasis parsing)
@Knagis, what computer/CPU do you use for benchmarking?
Core i5-2500 @ 3.3Ghz. Using the spec.txt version 1 (114 782 bytes).
Found a sundown wrapper for .NET - MoonShine and added that to the comparison.
Unfortunately it seems that most of the performance gain is lost probably due to string interop so it is actually slower (~2x) for very small inputs. But for parsing 112KB spec.txt it performs just 17% faster than CommonMark.NET (on average 3ms vs 4ms).
A better comparsion would be progit.md (I concatenated all languages together, resulting in 10MB file) where sundown/MoonShine does it in 277ms while CommonMark.NET spends 534ms. Still a very good result in my opinion (if only .NET would give access to string internals…).
progit.md 10,6 MB (3 iterations)
Library Total Each vs Baseline
Baseline 17329 5776 100%
CommonMark.NET 1601 534 9%
CommonMarkSharp 5960 1987 34%
MarkdownSharp 16271 5424 94%
MarkdownDeep 1080 360 6%
MoonShine (sundown) 830 277 5%