Extension terminology and rules


#1

Continuing the discussion from Guide for syntax extensions:

I wholeheartedly agree with @jroper’s post. Sadly, most replies last year talked about an expressive generic syntax instead, which hasn’t been implemented anywhere before and for the most part just mimics HTML.
I believe we should agree on a common nomenclature and some basic rules and identify innate extension points. Here are some thoughts and ideas of mine.

Terminology

There already is hardly any Markdown implementation that limits itself to the syntax and semantics as specified by Commonmark 1.0. They try to parse supersets thereof and their grammars are referred to as flavors, variants, dialects etc. or they are described as a set of extensions to the interoperable core. Sometimes, e.g. in the preparation of the CM spec, a preprocessor converts additional idiosyncratic syntax (often specific to a site or project) to standard CM/MD or HTML.

When a parser alleviates a rigid syntax, e.g. laxer white-space handling, this is commonly known as (syntactic) sugar, but the term is also applied to minor, but often backward-incompatible extensions like [Shortcut Links] without trailing brackets. This also includes invisible extras like the automatic generation of identifiers (IDs) for certain HTML elements, mostly headings, based upon their textual content.

A group of related extensions can form a module. Modules should be harmonized with each other, but special interest modules may be mutually incompatible. Profiles combine a number of required, optional and for forbidden modules to suit a specific domain or use-case.

Some extensions are already deployed and supported by default in several interoperable implementations, they are add-ons, whereas those implemented but off by default are options and those incompatible among implementations are plug-ins. Others are just proposed drafts without noteworthy implementation.

Rules

  • Lines have optional indentation followed by a single optional alphanumeric attribute to the optional and nestable line prefix which must be followed by whitespace before the line content which may be followed by whitespace and the prefix repeated as a line suffix. (Some existing prefix may deviate from this.)
  • Phrasal affixes (i.e. prefix or postfix or both) never have a space between themselves and their content (inside), but they always require non-alphanumeric characters (or nothing) on the other side (outside).
  • Phrasal prefix and postfix have the same shape but may be mirrored or rotated images of each other (i.e. brackets), unless there is strong precedent (e.g. SGML entity references with ampersand start and semicolon end).
  • A phrasal postfix (or prefix) may be declared optional to apply to a single alphanumeric word, but there must never be a semantic difference to the double-affix variant.
  • New syntax should fallback gracefully:
    • Content must never disappear depending on parser used.
    • Markup characters must not alienate readers if displayed verbatim.
  • New standardized markup should follow established practice in plain-text media first and in existing implementations second.
  • A sequence of the same phrasal affix twice or more should be treated as …?…

Patterns open to extension

  • The link and embed syntax can be extended in several ways.

    • Currently:[link text](URL "optional title") or [linkt text][reference] with preceding ! for embedding.
    • The exclamation mark can be substituted by another punctuation mark. Implementations that do not understand its meaning should render a normal hyperlink instead (and they all do).
    • The title in parenthesis links may be followed by other optional attributes, e.g. image dimensions. (Degrades awfully.)
    • The alt text may contain additional markup or information in a predefined format.
    • There can be additional meaningful markup in the definition lines of reference links.
    • Invalid single-character URLs (e.g. :, ?, #, %) may be used as markers for special treatment. (Does not work well with embeds.)
    • Pseudo-protocols may be parsed for special effects. (Discount supports abbr: etc.)
  • Every visible (= printable) non-alphanumeric character from US-ASCII (= Basic Latin block in Unicode) should be considered potentially active, i.e. it can be either a line prefix or a phrasal affix.

  • Lines consisting of a single non-alphanumeric ASCII character repeated at least three times may have structural meaning.

    • With intervening whitespace there must be nothing else in the line and it is considered some kind of separator.
    • Otherwise the line may be a fence and parameters may follow that will not appear in output. These apply to the following block.
    • More heading levels may get a Setext-like underline.
  • Numbered list items could get more valid parameters, i.e. formats.


#2

@jgm Can we get a separate repository under https://github.com/commonmark/ that shall contain

  1. a spec to define these terms and
  2. specs for actual extensions (and modules and flavors) with embedded test cases like the main spec?

Chronological overview of existing meta threads


#3

Just for the record, I’m keeping my documentation in the branch extensions in my fork of the Common Mark spec repository for now.