The following is my attempt to come up with a set of rules for CommonMark tables,
- defining the syntax of pertaining CommonMark “table blocks”, and
- describing the transformation of these blocks into a generic table model,
which can be mapped into the output document type after parsing.
EDIT 2015-11-08T13:15+01: I have put the proposed specification extension as a “free-standing” HTML document here. This is the current version as of 2015-11-08, ie the same text which follows below in this posting.
If needed, I will add and link to updated revisions of the spec there as well (the Subversion-generated time stamp right at the bottom of the document should make clear what revision you are looking at).
I think the rules are simple enough to understand and apply (at least I hope so, the wording could surely be improved a bit), as well as reasonable easy to implement, and they allow for both writing “nice” CommonMark text as well as more “terse” writing styles.
I think this “CommonMark tables” specification could provide a fundament for an actual extension of the CommonMark specification, and comments are highly appreciated so we can move forward to (finally) bring tables into the CommonMark spec.
The table specification here assumes that the table element in the target document type
has a distinct, but optional, “table header” element; and
encompasses the special case of a “degenerate” table consisting of a
single cell only (ie no table header, only one row and column); and
allows block content (paragraphs, lists, etc) inside the table data cells.
This is in fact the case for the W3C HTML 4.01
<table> content model:
<!ELEMENT (TH|TD) - O (%flow;)* -- table header cell, table data cell--> <!ENTITY % flow "%block; | %inline;">
and for the ISO/IEC 15445:2000 HTML
<table> element type:
<!ELEMENT (TH|TD) - O %table.content; > <!ENTITY % table.content "(%block; | %text;)*" >
<!ELEMENT entry %a-whole-lot-of-stuff; >
And even the
<tbl> element of the sample DTD for “general documents” in Annex E of ISO 8879:1986 could be used as a target, as each cell can contain paragraphs and lists etc:
<!ELEMENT c 0 0 %m.pseq; -- Cell in body row --> <!ENTITY % m.pseq "(p, ((%s.p.d;)|(%ps.zz;))*)" -- Paragraph sequence -->
So there should be no problem to map our (simplistic) CommonMark table into the desired target document type (the same should hold for LaTeX and RTF etc too).
A table is obviously a container block, bearing some similarity with block quotes, thus the language used here paraphrases the block quote description of the CommonMark specification in some places—and a natural place in the specification would be as a new sub-section 5.4, at the end of the section describing “container blocks”.
The rules are intended to be general enough to allow the “abuse” of tables for things like poetry verses (see examples below):
| And what I really want to know is this: are things getting better or are they getting worse?
would (or rather: should) be transformed into an HTML table (for example) like this:
<table><tbody> <tr><td>And what I really want to know is this:<br> are things getting better<br> or are they getting worse?<br></td></tr> </tbody></table>
This use (or misuse?) of tables was discussed here, and gave in fact the impetus for this proposal.
There is obvious room for enhancement: indicating the horizontal (and vertical?) alignment of columns (or rows, or even single cells?) is certainly useful, and PHP Markdown Extra uses colons in “table rules” for this:
|:-----|:------:|------:| | left | center | right |
However, I feel it would be premature to include something like this here, as there are too many open questions: should we also allow a similar syntax to specify alignment of section headings?:
##: Centered heading text :##
If not, why not? How would such alignment prescriptions map into target documents? What about vertical aligment?
A more urgent extension IMO would be a syntax to specify column spans and row spans, ie cells which extend over multiple adjacent columns and/or rows in the output table.
As far as I can tell, the syntax rules given here also encompass the basic PHP Markdown Extra syntax for tables implemented for example in the discount parser by David Parsons (but there is no syntax for alignment and row spans or column spans defined in this CommonMark proposal yet); but the semantics are—intentionally—slightly different:
This block in PHP Markdown Extra
First Header | Second Header ------------- | ------------- Content Cell | Content Cell Content Cell | Content Cell
generates a table whith a
<thead> containing the first row given; the proposed CommonMark specification here would write
First Header | Second Header ============= | ============= Content Cell | Content Cell Content Cell | Content Cell
to achieve the same result. The above example using “
-” places all rows into the
<tbody> element, which is the same element structure one gets from
------------- | ------------- First Header | Second Header ------------- | ------------- Content Cell | Content Cell ------------- | ------------- Content Cell | Content Cell ------------- | -------------
(only visually different—depending on the output document format and renderer) or even
| First Header | Second Header Content Cell | Content Cell Content Cell | Content Cell
The table marker “
|” VERTICAL LINE U+007C is used to designate a block of lines as a table, and to delimit CommonMark text destined for different columns in this table. Each row of the table consists of a sequence of table data cells.
5.4.1 Table block
A block of lines without intervening blank lines where
the first line starts with 0 to 3 spaces of inital indent, followed
by the character “
any line contains (other than white space) only “
|”, and either
zero or more “
-” HYPHEN-MINUS U+002D, or zero or more “
is transformed into a table in the output document. [OR: “… is a table” / “specifies a table” ?]
A line having the second form is called a table rule.
5.4.2 Table columns
Content between “
|” in lines which are not table rules is split and distributed into the output table columns from left to right; the leading and the trailing “
|” characers are optional.
Leading and trailing white space between column content and the “
|” is discarded in the output.
A table may have only one column.
5.4.3 Table rows
The CommonMark text for each column is split into table data cells by introducing row breaks across all columns, so that each column has the same number of cells:
A table rule introduces a row break across all columns
- in a multi-column table, or
- in a single-table column if it contains “
-” or “
except that leading and trailing table rules are ignored (they are only there to make the CommonMark typescript nicer).
A table rule containing “
=” separates the table header row from the table body, if there is only one (non-rule) line above it; otherwise it is treated like a table rule with “
A table rule containing “
-” introduces a “visible” break between table rows in a multi-column table (depending on the output document type and renderer).
A table rule without “
=” or “
-” introduces a “normal” break between table rules in a multi-column table.
Other “row breaks” are introduced only if there are more than one column (in any one of the lines in the table block), and only if each of the plain text fragments in all columns allows it simultanously:
A blank line in a column allows row breaks above and below it.
List items, code blocks, and block quotes inside a column are kept together in the output table data cell, and allow row breaks only above and below the whole block.
Compact lists (containing no blank lines) are kept together, including a paragraph that precedes the list (without a blank line in between); no row break inside such a list is allowed.
Paragraphs are kept together in a table data cell only if the second and following line are indented by 0 to 3 spaces.
Consecutive “regular” lines in the plain text columns with the same indent do allow row breaks above or below each of the lines.
Content in table data cells emanating from multiple lines in the CommonMark source will be separated by “hard line breaks” in the output table.
The minimal table has only one cell:
Accordingly, it produces a table containing a single table data cell:
<table><tbody> <tr><td>Hi!</td></tr> </tbody></table>
Because the HTML
<td> can contain both block and inline elements, the
<td> has just character data content in this case.
Single-column tables are not split into cells automatically, so this example
would reproduce the line break, but result in only one table data cell, too:
<table><tbody> <tr><td>Hello,<br>there!</td></tr> </tbody></table>
Because line breaks are taken “literally”, and paragraph structure is preserved in single-column tables, this can be used to format for example, verses and lyrics:
And what I really want to know is this: are things getting better or are they getting worse? | Can we start all over again?
Note that in this example, the line containing just the “
|” suffices to mark this block of lines as a table: It is a table rule line, but lacking “
=” or “
-” it will not introduce a new row into a single-column table.
The table generated from this example has a single table data cell as well, but this time it contains two paragraphs (similar to a block quote containing a blank line):
<table><tbody> <tr><td><p>And what I really want to know is this:<br> are things getting better<br> or are they getting worse?</p> <p>Can we start all over again?</p></td></tr> </tbody></table>
This could be written in a more elaborate style, but would produce the exact same result:
| And what I really want to know is this: | | are things getting better | | or are they getting worse? | | | | Can we start all over again? |
A single-column table can be broken into cells explicitly, using table rule lines containing “
-” or “
| One |-- | Two | and |-- | Three
Now we get three table body rows, each with one table data cell, but still no table header row:
<table><tbody> <tr><td>One</td></tr> <tr><td>Two<br>and</td></tr> <tr><td>Three</td></tr> </tbody></table>
To produce a table heading row (an
<thead> element in HTML), one has to use a table rule line with “
| One |===== | Two | and |----- | Three
Now the first line ends up as the content of the (single-cell) table heading row:
<table><thead> <tr><td>One</td></tr></thead> <tbody><tr><td>Two<br>and</td></tr> <tr><td>Three</td></tr> </tbody></table>
Multi-column tables are usually split into cells line by line:
| A1 | A2 | B1 | B2
which can be written somewhat terser as:
| A1 | A2 B1 | B2
or in the equivalent syntax (using a table rule line):
---|---- A1 | A2 B1 | B2
A1 | A2 ---|---- B1 | B2
They all produce the same element structure information:
<table><tbody> <tr><td>A1</td><td>A2</td></tr> <tr><td>B1</td><td>B2</td></tr> </tbody></table>
Lists and blockquotes will not be split into adjacent cells in different rows: for example
------|----- - A1a | A2a - A1b | A2b - B1 | B2
will only produce one row, because the left column contains a single unordered list.
But the first column here
------|----- - A1a | A2a - A1b | A2b B1 | B2
has only a two-item list, followed by a new paragraph: this allows a row break below
- A1b and above
B1, and below
A2b anyway, and we get:
<table><tbody> <tr><td><ul><li>A1a</li><li>A1b</li></ul></td> <td>A2a<br>A2b</td></tr> <tr><td>B1</td><td>B2</td></tr> </tbody></table>
Note that the “line break” between
A2b is reproduced again using a
<br> in the upper-right table data cell.
Without the unordered list, we would get three rows out of of the “regular” table
------|----- A1a | A2a A1b | A2b B1 | B2
To keep “lines” in a “paragraph” (of the content fragments in a column) together, one can indent the following lines a bit (using 0 to 3 spaces, relative to the preceding “
|” or line start. This “joins” the first line of a paragraph with the subsequent lines together, and prohibits row breaks to be inserted:
------|----- A1a | A2a A1b | A2b B1 | B2
will “join” the
A1b in the left column, and similarly
------|----- A1a | A2a A1b | A2b B1 | B2
would “join” the
A2b in the right column.
Both of these table blocks transform into the exact same element structure:
<table><tbody> <tr><td>A1a<br>A1b</td> <td>A2a<br>A2b</td></tr> <tr><td>B1</td><td>B2</td></tr> </tbody></table>
Here the line breaks in both cells in the upper row are reproduced, and the resulting table structure shows a close similarity to the one in the example above using an unordered list—as do the CommonMark input texts for both examples.