Encoding ambiguities in CommonMark


I’m developing a syntax plugin for Markdown (but changing to CommonMark currently) which takes a very limited approach to parsing.

My major concern is that some terms are not really clarified.

  • In section 2, you say

    This spec does not specify an encoding

    Nonetheless, in the further document, you specify several things with respect to ASCII.

  • In section 2, you say

    Line endings are replaced by newline characters (LF).

    I have not seriously worked with Windows for years, but is this really clean, but shouldn’t the exact encoding rather be subject to the implementation?!

  • In section 6.1, you say

    Any ASCII punctuation character may be backslash-escaped:

    What is an “ASCII punctuation character”? Is it what is noted below, in example 207, or are there more? And what about other punctuation characters from some other codespaces?

  • In section 6.7, you say

    … followed by zero or more characters other than ASCII whitespace and
    control characters, <, and >."

    What are control characters? ASCII 0-31?!
    Why not reference rfc3986 for this at all, which is the definition of an URI?

  • In section 6.7, you specify a range of schemes to be recognized. Wouldn’t it be easier and cleaner to just reference the IANA range which you obviously took anyway?

Don’t get me wrong, I don’t want to nitpick unimportant things and I appreciate a more formal definition of Mark* than before. But since I have to work completely with regular expressions, these ambiguities need to be clarified to make CommonMark really be future-compatible.

Furthermore, having some formal definition would be really useful instead of a phrased-out one. E.g., providing a regex which will clarify what you actually specify, or a grammar or so (don’t know, I’m no computer scientist). Section 6.4 is far too long for what you want to specify! This also partially coincides with the thread by roop.
Of course you cannot specify everything with an regex, but if the rest is unambiguous, then some regexes would really help understanding the text, especially the ones for list items and blockquotes.

PS: This Discourse does not obey CommonMark. Newlines are treated as newlines, and paragraphing in list items does not work as intended, as can be seen in this text.

1 Like

Line-endings are not encoding. That Windows (stubbornly) uses a different line ending (CR), does not change the encoding. Both are vaild in ASCII, UTF-8, etc.

If you google that, the first hit is “ASCII Punctuation and Number Characters”. That is the complete list.

It’s never claimed to.

Have you checked out the reference implementation, it has most of what you ask for: https://github.com/jgm/stmd/blob/master/js/stmd.js

1 Like

Ok, then it’s not encoding. It’s still unnecessary annoying to do this. No matter what Windows “stubbornly” does, it renders the output of the proposed CommonMark renderers unreadable for Windows users, which are still by far a large majority.
Or are there any specific reasons why LF vs. CRLF must be in the standard which would make a difference except for the compliance checks?

So Kerry Redshaw from Brisbane, Queensland, Australia is the reference for what ASCII punctuation is? Maybe that should then be noted in the standard.

When you open a new text field, it proposes to use “Markdown or BBCode”.

Yes, I did. This is why I’m asking if this couldn’t be part of the standard which is not formally written anyway. Providing a few lines of regexes is different to pointing to 1,5k lines of code with regexes in it.


1 Like

Of course there is. Anyone implementing markdown needs to know what to expect as output. Otherwise, markdown implementations are not interchangeable.

So if you make software catering to windows users (or the web for that matter), you know what to do before outputting the markdown to the user – namely, add some CR’s.

Well, it’s not like ASCII is some sort of obscure unheard-of standard that is not documented anywhere else.

Yeah, so it does not claim to adhere to CommonMark. As rwzy linked, there’s good reasons for that.

No one is saying it couldn’t, and since this is an open source project, feel free to add that and submit a pull request.

RFC 5234 does not define “punctuation characters”, but I take this to mean the set-difference of VCHAR \ { ALPHA ∪ DIGIT }. (Whitespace is not considered visible.)