RFC Spec extension for *tables*: Syntax and transformation rules

#1

The following is my attempt to come up with a set of rules for CommonMark tables,

  1. defining the syntax of pertaining CommonMark “table blocks”, and
  2. describing the transformation of these blocks into a generic table model,

which can be mapped into the output document type after parsing.

EDIT 2015-11-08T13:15+01: I have put the proposed specification extension as a “free-standing” HTML document here. This is the current version as of 2015-11-08, ie the same text which follows below in this posting.

If needed, I will add and link to updated revisions of the spec there as well (the Subversion-generated time stamp right at the bottom of the document should make clear what revision you are looking at).

          —tin-pot

I think the rules are simple enough to understand and apply (at least I hope so, the wording could surely be improved a bit), as well as reasonable easy to implement, and they allow for both writing “nice” CommonMark text as well as more “terse” writing styles.

I think this “CommonMark tables” specification could provide a fundament for an actual extension of the CommonMark specification, and comments are highly appreciated so we can move forward to (finally) bring tables into the CommonMark spec.


Table model

The table specification here assumes that the table element in the target document type

  1. has a distinct, but optional, “table header” element; and

  2. encompasses the special case of a “degenerate” table consisting of a
    single cell only (ie no table header, only one row and column); and

  3. allows block content (paragraphs, lists, etc) inside the table data cells.

This is in fact the case for the W3C HTML 4.01 <table> content model:

<!ELEMENT (TH|TD)  - O (%flow;)*  -- table header cell, table data cell-->
<!ENTITY % flow "%block; | %inline;">

and for the ISO/IEC 15445:2000 HTML <table> element type:

<!ELEMENT (TH|TD)     - O  %table.content; >
<!ENTITY % table.content   "(%block; | %text;)*" >

and for the DocBook 3.1 CALS <Table> as well, where the table data cell is the <entry> element is too complex to quote here:

<!ELEMENT entry %a-whole-lot-of-stuff; >

and in the DocBook 5.1 CALS table too (but DocBook 5 also has a HTML table, compatible with XHTML).

And even the <tbl> element of the sample DTD for “general documents” in Annex E of ISO 8879:1986 could be used as a target, as each cell can contain paragraphs and lists etc:

<!ELEMENT  c   0 0  %m.pseq; -- Cell in body row -->
<!ENTITY % m.pseq  "(p, ((%s.p.d;)|(%ps.zz;))*)" -- Paragraph sequence -->

So there should be no problem to map our (simplistic) CommonMark table into the desired target document type (the same should hold for LaTeX and RTF etc too).

Overview

A table is obviously a container block, bearing some similarity with block quotes, thus the language used here paraphrases the block quote description of the CommonMark specification in some places—and a natural place in the specification would be as a new sub-section 5.4, at the end of the section describing “container blocks”.

The rules are intended to be general enough to allow the “abuse” of tables for things like poetry verses (see examples below):

|
And what I really want to know is this:
are things getting better
or are they getting worse?

would (or rather: should) be transformed into an HTML table (for example) like this:

<table><tbody>
<tr><td>And what I really want to know is this:<br>
are things getting better<br>
or are they getting worse?<br></td></tr>
</tbody></table>

This use (or misuse?) of tables was discussed here, and gave in fact the impetus for this proposal.

There is obvious room for enhancement: indicating the horizontal (and vertical?) alignment of columns (or rows, or even single cells?) is certainly useful, and PHP Markdown Extra uses colons in “table rules” for this:

|:-----|:------:|------:|
| left | center | right |

However, I feel it would be premature to include something like this here, as there are too many open questions: should we also allow a similar syntax to specify alignment of section headings?:

##: Centered heading text :##

If not, why not? How would such alignment prescriptions map into target documents? What about vertical aligment?

A more urgent extension IMO would be a syntax to specify column spans and row spans, ie cells which extend over multiple adjacent columns and/or rows in the output table.

Prior art

As far as I can tell, the syntax rules given here also encompass the basic PHP Markdown Extra syntax for tables implemented for example in the discount parser by David Parsons (but there is no syntax for alignment and row spans or column spans defined in this CommonMark proposal yet); but the semantics are—intentionally—slightly different:

This block in PHP Markdown Extra

First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

generates a table whith a <thead> containing the first row given; the proposed CommonMark specification here would write

First Header  | Second Header
============= | =============
Content Cell  | Content Cell
Content Cell  | Content Cell

to achieve the same result. The above example using “-” places all rows into the <tbody> element, which is the same element structure one gets from

------------- | -------------
First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
------------- | -------------
Content Cell  | Content Cell
------------- | -------------

(only visually different—depending on the output document format and renderer) or even

| First Header | Second Header
Content Cell | Content Cell
Content Cell | Content Cell

Proposed specification

5.4 Tables

The table marker|” VERTICAL LINE U+007C is used to designate a block of lines as a table, and to delimit CommonMark text destined for different columns in this table. Each row of the table consists of a sequence of table data cells.

5.4.1 Table block

A block of lines without intervening blank lines where

  • the first line starts with 0 to 3 spaces of inital indent, followed
    by the character “|”; and/or

  • any line contains (other than white space) only “|”, and either
    zero or more “-” HYPHEN-MINUS U+002D, or zero or more “=” EQUALS
    SIGN U+003D;

is transformed into a table in the output document. [OR: “… is a table” / “specifies a table” ?]

A line having the second form is called a table rule.

5.4.2 Table columns

Content between “|” in lines which are not table rules is split and distributed into the output table columns from left to right; the leading and the trailing “|” characers are optional.

Leading and trailing white space between column content and the “|” is discarded in the output.

A table may have only one column.

5.4.3 Table rows

The CommonMark text for each column is split into table data cells by introducing row breaks across all columns, so that each column has the same number of cells:

  1. A table rule introduces a row break across all columns

    • in a multi-column table, or
    • in a single-table column if it contains “-” or “=”;

    except that leading and trailing table rules are ignored (they are only there to make the CommonMark typescript nicer).

    A table rule containing “=” separates the table header row from the table body, if there is only one (non-rule) line above it; otherwise it is treated like a table rule with “-”.

    A table rule containing “-” introduces a “visible” break between table rows in a multi-column table (depending on the output document type and renderer).

    A table rule without “=” or “-” introduces a “normal” break between table rules in a multi-column table.

  2. Other “row breaks” are introduced only if there are more than one column (in any one of the lines in the table block), and only if each of the plain text fragments in all columns allows it simultanously:

    • A blank line in a column allows row breaks above and below it.

    • List items, code blocks, and block quotes inside a column are kept together in the output table data cell, and allow row breaks only above and below the whole block.

    • Compact lists (containing no blank lines) are kept together, including a paragraph that precedes the list (without a blank line in between); no row break inside such a list is allowed.

    • Paragraphs are kept together in a table data cell only if the second and following line are indented by 0 to 3 spaces.

    • Consecutive “regular” lines in the plain text columns with the same indent do allow row breaks above or below each of the lines.

  3. Content in table data cells emanating from multiple lines in the CommonMark source will be separated by “hard line breaks” in the output table.

5.4.4 Examples

The minimal table has only one cell:

|Hi!

Accordingly, it produces a table containing a single table data cell:

<table><tbody>
<tr><td>Hi!</td></tr>
</tbody></table>

Because the HTML <td> can contain both block and inline elements, the <td> has just character data content in this case.

Single-column tables are not split into cells automatically, so this example

|Hello,
|there!

would reproduce the line break, but result in only one table data cell, too:

<table><tbody>
<tr><td>Hello,<br>there!</td></tr>
</tbody></table>

Because line breaks are taken “literally”, and paragraph structure is preserved in single-column tables, this can be used to format for example, verses and lyrics:

And what I really want to know is this:
are things getting better
or are they getting worse?
|
Can we start all over again?

Note that in this example, the line containing just the “|” suffices to mark this block of lines as a table: It is a table rule line, but lacking “=” or “-” it will not introduce a new row into a single-column table.

The table generated from this example has a single table data cell as well, but this time it contains two paragraphs (similar to a block quote containing a blank line):

<table><tbody>
<tr><td><p>And what I really want to know is this:<br>
are things getting better<br>
or are they getting worse?</p>
<p>Can we start all over again?</p></td></tr>
</tbody></table>

This could be written in a more elaborate style, but would produce the exact same result:

| And what I really want to know is this: |
| are things getting better               |
| or are they getting worse?              |
|                                         |
| Can we start all over again?            |

A single-column table can be broken into cells explicitly, using table rule lines containing “-” or “=”:

| One
|--
| Two
| and
|--
| Three

Now we get three table body rows, each with one table data cell, but still no table header row:

<table><tbody>
<tr><td>One</td></tr>
<tr><td>Two<br>and</td></tr>
<tr><td>Three</td></tr>
</tbody></table>

To produce a table heading row (an <thead> element in HTML), one has to use a table rule line with “=”:

| One
|=====
| Two
| and
|-----
| Three

Now the first line ends up as the content of the (single-cell) table heading row:

<table><thead>
<tr><td>One</td></tr></thead>
<tbody><tr><td>Two<br>and</td></tr>
<tr><td>Three</td></tr>
</tbody></table>

Multi-column tables are usually split into cells line by line:

| A1 | A2
| B1 | B2

which can be written somewhat terser as:

| A1 | A2
B1 | B2

or in the equivalent syntax (using a table rule line):

---|----
A1 | A2
B1 | B2

or (nicer?)

A1 | A2
---|----
B1 | B2

They all produce the same element structure information:

<table><tbody>
<tr><td>A1</td><td>A2</td></tr>
<tr><td>B1</td><td>B2</td></tr>
</tbody></table>

Lists and blockquotes will not be split into adjacent cells in different rows: for example

------|-----
- A1a | A2a
- A1b | A2b
- B1  | B2

will only produce one row, because the left column contains a single unordered list.

But the first column here

------|-----
- A1a | A2a
- A1b | A2b
B1    | B2

has only a two-item list, followed by a new paragraph: this allows a row break below - A1b and above B1, and below A2b anyway, and we get:

<table><tbody>
<tr><td><ul><li>A1a</li><li>A1b</li></ul></td>
    <td>A2a<br>A2b</td></tr>
<tr><td>B1</td><td>B2</td></tr>
</tbody></table>

Note that the “line break” between A2a and A2b is reproduced again using a <br> in the upper-right table data cell.

Without the unordered list, we would get three rows out of of the “regular” table

------|-----
A1a   | A2a
A1b   | A2b
B1    | B2

To keep “lines” in a “paragraph” (of the content fragments in a column) together, one can indent the following lines a bit (using 0 to 3 spaces, relative to the preceding “|” or line start. This “joins” the first line of a paragraph with the subsequent lines together, and prohibits row breaks to be inserted:

------|-----
A1a   | A2a
 A1b  | A2b
B1    | B2

will “join” the A1a and A1b in the left column, and similarly

------|-----
A1a   | A2a
A1b   |  A2b
B1    | B2

would “join” the A2a and A2b in the right column.

Both of these table blocks transform into the exact same element structure:

<table><tbody>
<tr><td>A1a<br>A1b</td>
    <td>A2a<br>A2b</td></tr>
<tr><td>B1</td><td>B2</td></tr>
</tbody></table>

Here the line breaks in both cells in the upper row are reproduced, and the resulting table structure shows a close similarity to the one in the example above using an unordered list—as do the CommonMark input texts for both examples.

2 Likes

Use escaped space as &nbsp;
#2

Might you also be interested in this other thread, on a proposal for a restricted subset of pipe syntax as an alternative to CSV?

http://talk.commonmark.org/t/side-thoughts-promoting-pipe-tables-as-a-potential-alternative-to-csv-format-e-g-psv/1862


Also talks about tables in

http://talk.commonmark.org/t/tables-in-pure-markdown/81/81


An interesting approach to multi line in http://justatheory.com/computers/markup/markdown-table-rfc.html is how PostgreSQL uses the : character: ( Talk in: Madoko - microsoft research's markdown editor )

This is how PostgreSQL renders tables, with multiline cells:

  id  |    name     |         description          | price  
------+-------------+------------------------------+--------
    1 | gizmo       | Takes care of the doohickies |   1.99
    2 | doodad      | Collects *gizmos*            |  23.80
   10 | dojigger    | Handles:                     | 102.98
      :             : * gizmos                     : 
      :             : * doodads                    : 
      :             : * thingamobobs               : 
 1024 | thingamabob | Self-explanatory, no?        |   0.99

Which might be a better and simpler way to approach this issue, than how your are suggesting at the moment.

0 Likes

#3

@mofosyne : Thanks for the helpful links! Here are some comments on the examples discussed there:


CSV table format with “pipes”

Might you also be interested in this other thread, on a proposal for a restricted subset of pipe syntax as an alternative to CSV?

Nonwithstanding the question if and how colon “:” should and could be used in CommonMark “table block”, the “example spreadsheet” would be recognized and parsed according to the CommonMark table rules I proposed (with the exception of “:”, discussed later).

The resulting element structure should be (remember that I cobble together these HTML results by hand—so the usual disclaimers apply!):

<table><tbody>
  <tr><td>Tables</td><td>Are</td><td>Cool</td></tr>
  <tr><td>col 1 is</td><td>leftaligned</td><td>$1600</td></tr>
  <tr><td>col 1 is</td><td>centered</td><td>$12</td></tr>
  <tr><td>col 1 is</td><td>right-aligned</td><td>$1</td></tr>
</tbody></table>

Note that

  1. the “:” have been ignored here (they could obviously translate into an “appropriate” attribute in the <td> elements);

  2. because the “table ruler line” uses only “-” (and not “=”), the first row in the CommonMark table will be placed into the table body, and leave the table heading (the <thead> element of HTML) omitted (the default).

If the row containing “Tables”, “Are”, and “Coolshould form a table heading row (HTML <thead> element) in the output, the table rule line could be re-written as

|=---------|:-------------:|------:|

[NOTE: Well, my current wording says the table rule line should only contain either=or-”, but that’s just a preliminary rule after all, right? :wink: ]

But CSV has no concept of table heading row anyway, IIRC …

To summarize: The difference between “CSV table” and “HTML table” (or “DocBook table” etc) boils down not to differences in CommonMark parsing or input syntax, but to differences in the generated output element structure and their rendering, as far as I can tell.


Tables in pure Markdown

Also talks about tables in

this discussion, which has various suggestions and examples of “CommonMark tables”.

The examples given by @nichtich would—as far as I understand my proposal and can “simulate” a parser—all be recognized by the proposed specification (in the current wording), again with the the same two restrictions that

  1. column-alignment using “:” is (not yet) specified;

  2. to separate off a table heading row, a table rule line must have “=”, because using “-” alone does not suffice to mark up a table heading row (in the current wording).

If this is deemed important, the specification could easily be adapted to cover these two details too—but I’m not sure right now how the “column alignment” should be mapped into the output element structure.


On the other hand, your (@mofosyne’s) own table example given in the discussion over there:

A header           | Another header      |   Price
====================================================
Some text here     | Another bit of text |    34,10
Lorem ipsum        | In principo creavit |    624,45
----------------------------------------------------
||                                   Total:   658,55

and your comment

If there is a need to distinguish headers in tables. This could be an option, of specifying thicker lines.

both reflect exactly my motivation for the distinction between “=” and “-” in table rule lines, and also nearly exactly matches my proposed syntax rules:

Which would require at least one “|” (“pipes”, or the “official” name vertical line) character in table rule lines, because otherwise eg a block of lines containing just one line having only=” as content would be detected as a table block by the syntax rules for tables:

A header           | Another header      |   Price
===================|=====================|==========
Some text here     | Another bit of text |    34,10
Lorem ipsum        | In principo creavit |    624,45
-------------------|---------------------|----------
||                                   Total:   658,55

or —"reduced to the max"— this one:

A header | Another header | Price
|=
Some text here | Another bit of text | 34,10
Lorem ipsum | In principo creavit | 624,45
|-
|| Total:   658,55

[NOTE: I’m not sure if one shoud require the “|” to be “free-standing” in table content lines, ie separated from cell content by white space. But I certainly think the CommonMark text looks better and is easier to recognize by a human parserreader this way! ]

Any proposals for that little detail?


Open questions

Questions remaining open with regard to the examples seen so far:

  1. If and how should “:” be used and interpreted in table rule lines? (IMO: Sooner or later: yes; but maybe later than right now.)

  2. Should mixing “=” and “-” in table rule lines be allowed? (IMO: Yes!)

  3. Is the proposed requirement to use “=” in a table rule line if one wants to produce a table heading row too simple-minded? (IMO: Probably yes, in the light of “current practice”.)

  4. Should it be a requirement that “|” in “table content lines” (ie in lines in a table block which are not “table rule lines”) be separated from cell content by white space? (IMO: Probably yes.)

Kind regards,

tin-pot

0 Likes

#4

@mofosyne: Ups, I nearly forgot your last example, the “PostgreSQL” syntax for (database?) tables in “plain text”: Here we go:


The idea there seems to be (correct me if I’m wrong) that:

  1. Each line where “|” is used to separate columns ends up in it’s own row; but

  2. lines where “:” is used to separate columns are “continuation lines”, and add merely more content to the currently “open” table row.

Well, regarding

Which might be a better and simpler way to approach this issue, than how your are suggesting at the moment.

I would certainly agree that these rules are simpler and—appropriately so for a database plain-text dump format—make it very clear where one table row ends and the next starts (namely at every line with “|”).


But for our (syntax use) case at hand, namely writing or authoring tables in a CommonMark text (which is obviously a different scenario than a machine dumping a database table IMO), I’m not convinced this would help us much, and doubt this approach would be “better” in any reasonable sense:

  • Using “:” as a delimiter in CommonMark lines containing authored content (ie lines that are not table rule lines) poses immediately the question how a “regular content” colon is supposed to be entered: one would have to type “\:” for every colon used in a table cell—and I expect that to occur not that rarely!

  • Starting a new table row (or in the jargon of my proposal: introducing a row break after each input line with “|”) would obviously render impossible the “abuse” (or “elegant side benefit”, depending on taste) of tables I sketched in the proposal:

    | One
    | Two
    

would certainly produce two rows using the “PostgreSQL” syntax rules; and this is not what I would like to see here, and accordingly not what the current proposal would produce.


At the heart of the matter, the difference is this:

  • The “PostgreSQL” syntax starts new table rows “by default”, and needs explicit mark up to indicate that input lines should “continue” to add input into the same output table row.

  • The (proposed) CommonMark syntax not only treats one-column tables slightly different than multi-column tables (for a reason), but primarily it strives to adhere to a simple principle, which could be paraphrased like this:

Keep CommonMark input blocks (if you look at just one column of the input table block) in one piece in a single output table data cell.


The rules I wrote are supposed to (implicitly) substantiate this principle (and are wrong if they fail to do so!). And they are a bit more complicated than one might wish for solely for one reason: an input table block like

A1 | A2
B1 | B2

should (obvioulsy IMO, and according to “current practice” in other Markdown table extensions) produce a 2 × 2 output table, even given that each input row content:

A1
B1

and

A2
B2

is actually a single “paragraph” block according to CommonMark syntax rules (and in fact in every Markdown variant too).

That’s the reason behind the—admittedly ugly, but not too ugly I hope— rule that allows you (or forces you, if you will) to write

A1
  B1

or

A2
  B2

in the respective input column to make it explicit that you want to keep the single “paragraph” CommonMark block in this column together.

Because of the similarity in syntax (layout) and semantics to eg an unordered list:

A1
  - B1

which also would “join” the two input lines in the respective column together into a single table data cell in the output—because of this similarity I find the “special rule” for paragraph blocks inside table cells not too ugly, or hard to understand and apply.

Seen this way, I add a third point arguing that the PostgreSQL format is not better than the currently proposed syntax:

  • The PostgreSQL rules would force you to explicitly mark up “continuation lines” in your input table block, even when the content of your input table block makes it perfectly clear by itself (according mostly to existing CommonMark syntax rules) that you don’t want a new row, by assuming that you don’t want CommonMark blocks (like lists, blockquotes, etc) to get split and to end up in different rows of the output table.

And one final point (IMO a minor minor one, and less important than the others):

  • No other Markdown extension for tables I know of uses something like the PostgreSQL syntax rules.

tl;dr The “PostgrSQL” syntax rules make perfect sense for dumping a database table into an unambiguous plain-text format—but not so much for writing a table in a CommonMark plain-text document (or “typescript”, if you’re like me, or whatever term you prefer).

Best regards,

tin-pot

0 Likes

#5

Just some random, preliminary thoughts and observations:

  1. As defined, the syntax requires too much back-tracking.

  2. Leading pipe should only be optional …

  3. either if the table was started (and will be ended) by a table rule

  4. or after the maximum / fixed number of columns has been established unambiguously.

  5. Trailing pipe can always be optional, but should be omitted if the leading one was.

  6. The plus sign + should be an alias to pipe | in table rules that use the hyphen-minus - and
    the hash sign # should be an alias to pipe | in table rules that use the equals sign =:
    +----+----+ and #====#====#
    ASCII tables in email also often employed comma ,, period ., backtick \`` and apostrophe'for the corners.,----±—.top and \``----+----' bottom.

  7. Useful table headers frequently contain multiple rows, often with cells spanning several columns.

  8. There’s also table footers, which should work the inverse way to headers. Multiple table bodies (or row-groups), on the other hand, are not too common and probably don’t have to be supported, but could be with asterisk * or underscore _ table rules.

  9. HTML tables can cope with different number of cells in adjacent rows, but LaTeX tables (rather tabulars) are more strict and always require the right number of dividers as specified at the start.

  10. Table rules can either be inspired by ATX-style headings, which support equals sign = and hyphen-minus - as used in this proposal, or by horizontal rules, which currently work with hyphen-minus -, asterisk * and underscore _. Only the former has an established hierarchy, = > -. These related markups should be more coordinated, in my humble opinion.

  11. One could use line breaks to separate rows in tables with a hyphen table rule separating header and body, and in tables with a header separated by an equals table rule, rows would be separated by hyphen table rules, i.e. only explicitly. This would allow row-spanning quite simply.
    |----| continue cell above |----|

  12. Almost all current pipe table implementations (except Maruku) require at least one pipe in a table rule, usually it should match lines with content cells.

  13. One can think of pipes constituting (parallel) columns. Rows are either established by line breaks or by the known horizontal rules. If that was true, the following should work:
    |----|****|____| equals |----|----|----|

  14. There is currently no way to specify row header cells, which are quite common.

  15. Kramdown currently has the most aggressive table detection algorithm: Basically any line with a literal pipe character | in it constitutes a table.

1 Like

#6

@Crissov : Wow, that’s an impressively detailed list of remarks—grant me some time to digest it, please …

0 Likes

#7

@Crissov : Okay, here we go (after wasting some time again wrestling with this site’s atrociuous Markdown implementation, and on the even more brain-dead “[quote]” mechanism …).

Thank you again for the meticulous remarks, and for the mention of the

implementations of Markdown, which were both new to me (at least I can’t remember either one of them). I really appreciate your feedback!

I’ll refer to the two syntax descriptions linked above if needed.

Here are my answers to your remarks, or my remarks on your remarks on my proposal … :wink:


1. As defined, the syntax requires too much back-tracking.

I’m not sure that I understand exactly what you mean by that, but I guess that you refer to the fact that a parser implementation would have to do back-tracking, eg to see that this block:

One
two
three
|

is in fact a “table block”, and not a regular CommonMark paragraph (so the parser must first “see” the end of the block of lines before deciding what kind of block this is).

If that’s what you mean, my first reaction would be: “So what? – The syntax is supposed to ease the author’s job, not the parser implementor’s!”

But if you have a good proposal how to remedy this situation (for the parser), without imposing arbitrary restrictions on the author of a CommonMark table, I’d be glad to discuss this issue further.

[NOTE: I assume that what you don’t mean is that the human reader has to do “back-tracking”—which would amount to a curious way of saying “the syntax is too complicated”. Is this correct? ]


2. Leading pipe should only be optional …

  1. either if the table was started (and will be ended) by a table rule
  2. or after the maximum / fixed number of columns has been established unambiguously.

I take this to mean that you don’t want the line-leading “|” being optional in all the cases where it is now (pretty much always, that is).

Since you can always write the line-leading “|” character into the “table block”—on each and every line if you like—I don’t quite see the point in disallowing it being omitted.

Maybe it has something to do with your second remark, but you’d have to elaborate further what you mean by “establishing the maximum/fixed number of columns unambiguously”. And also: how this could be done using a (modified) table syntax. Right now I’d have to take rather wild guesses what you mean by this.


3. Trailing pipe can always be optional, but should be omitted if the leading one was.

Hmm: Trailing pipe is always optional right now, so it can always be omitted if one wants to. To decide where (in which specific instances) it should be omitted is thus in the hands of the author of the CommonMark text.

Or do you rather mean: the trailing “|” character should not be allowed to be omitted, but instead be required to be there if there was a leading one in the same line?

If so: why? What would be the advantage of this restriction of the authors choices? Because there should be some gain in introducing such a rule, IMO.


4. The plus sign + should be an alias to pipe | in table rules that use the hyphen-minus - and the hash sign # should be an alias to pipe | in table rules that use the equals sign =:

+----+----+ and #====#====#

ASCII tables in email also often employed comma ,, period ., backtick ` and apostrophe ' for the corners.

,----+----. top and `----+----' bottom.

This seems to be a matter of taste: while I second your suggestion that “+” would make for “nice” ASCII art tables, I’m not sure about “#”: there is no “vertical double line” after all, so both “+” and “#” are “wrong” in one direction (horizontally or vertically) in any case.

I’d rather strongly oppose using “.” FULL STOP and “,” COMMA, or “'” APOSTROPHE to “decorate” table rule lines. And even more so to suddenly change the rules of CommonMark (and all Markdown variants) about “backtick” “`” GRAVE ACCENT. And even still more so if it is for the sole reason to put a redundant “backtick” where is looks nicer.

But questions of taste aside: allowing “.” and a host of other characters into table rule lines (just for the visual pleasure, if I understand you correctly) would not only complicate the syntax rules, and thus the parser implementation too, but would also increase the risk of accidentally “recognizing” a block of CommonMark lines as a “table block” (because a single “recognized” table rule line alone would have this effect) which was not meant to be one.

So my position is: I would probably support introducing “+” into table rule lines, because it does in fact look “right”, and I think it is indeed a common style to write “ASCII art” tables. I would be quite sceptical regarding additonal “decoration” characters beyone this one, and strictly oppose introducing a special exception for “backtick”, without any good reason to do so nor benefit this would gain.


5. Useful table headers frequently contain multiple rows, often with cells spanning several columns.

That is what I had called “col[umn] spans” in my proposal, and the vertical analogue is accordingly called “row spans”.

There is no syntax I could propose for this yet, but I agree that one would be useful: as you say, “col spans” and “row spans” occur frequently in real-world tables, so this is certainly one item for a TO-DO list.


6. There’s also table footers, which should work the inverse way to headers. Multiple table bodies (or row-groups), on the other hand, are not too common and probably don’t have to be supported, but could be with asterisk * or underscore _ table rules.

The obvious syntax rule to “mark up” a “table footer row” in this sense would be symmetrical to the rule for table heading rows.

I’m not sure what you mean by “multiple table bodies (or row-groups)”: there is certainly no such thing in the HTML table model—which by the way can’t represent a “table footer row” either.

So in any case one would have to decide how exactly this should map into HTML, and also into the various other document types one would want to generate from CommonMark (like DocBook, or DITA, or LaTeX, or ISO 12083 etc).

The scant support for “table footer rows” (let alone for “multiple table bodies”) in popular document types does not justify this (or these) extensions, in my point of view.


7. HTML tables can cope with different number of cells in adjacent rows, but LaTeX tables (rather tabulars) are more strict and always require the right number of dividers as specified at the start.

It seems that LaTeX can do “col spans” just fine, using the \multicolumn{n} macro. (I had to look this one up too …)


8. Table rules can either be inspired by ATX-style headings, which support equals sign = and hyphen-minus - as used in this proposal, or by horizontal rules, which currently work with hyphen-minus -, asterisk * and underscore _. Only the former has an established hierarchy, = > -. These related markups should be more coordinated, in my humble opinion.

Nah! I don’t want to go there! – Look at the list of candidates of non-blank “characters to be used in table rule lines” accumulated thus far:

  1. |” VERTICAL LINE: We probably all agree about that one.
  2. -” HYPHEN-MINUS: And about that one too.
  3. =” EQUALS SIGN: In my proposal, but I’m not so sure any more.
  4. +” PLUS SIGN: Probably okay, just for the looks of it.
  5. :” COLON: Will likely be used to signify horizontal alignment.
  6. #” NUMBER SIGN: I see no point in this one.
  7. .” FULL STOP: Or in this one.
  8. ,” COMMA: Or in this one.
  9. '” APOSTROPHE: Or in this one, either.
  10. `” GRAVE ACCENT: You know what I think about this one
  11. *” ASTERISK: Why, what should that buy us?
  12. _” LOW LINE: Is redundant too (and vertically asymmetric without reason or meaning).

I can hardly come up with three different kinds of table rule lines (using SPACE alone, or HYPHEN-MINUS, or EQUALS SIGN), and reasons to have them!

The COLON is so much entrenched in “prior art” (for horizontal alignment of columns rsp cells) that it seems unavoidable: I think of it as a “reserved” character in this sense.

As I said, I’m kind-of indifferent about PLUS SIGN, at it looks like an innocuous addition with some aesthetic value as well as existing practice (in “ASCII art” tables, I don’t know and care less about Markdown).

But all the other seven (or eight, or even nine!) needless characters? No.


9. One could use line breaks to separate rows in tables with a hyphen table rule separating header and body, and in tables with a header separated by an equals table rule, rows would be separated by hyphen table rules, i.e. only explicitly. This would allow row-spanning quite simply.

 |----| continue cell above |----|

I have a difficult time trying to understand this remark:

  • On can not use line breaks (in the sense of blank lines) in table block; on the other hand there is obviously a line break after each line in the block: so which line breaks do you mean?

  • On using a table rule line with HYPHEN-MINUS in it to separate a table header row from the first table body row: yes, that’s already there (well, using EQUALS SIGN, but that’s a trivial change).

  • If I understand you correctly, you propose that a HYPHEN-MINUS table rule line would not produce a table heading row if (all? one? any?) other table rule line in the block also uses HYPHEN-MINUS—correct?

But if I understand you mentioning “row-spanning” right: the example
table rule line you give is supposed to be used like this? :

| A1 |         A2          | A3 |
|----| continue cell above |----|
| b1 |   b21    |    b22   | b3 |

(With some different “decoration” in place of the “continue cell above” text, obviously.) This is certainly an interesting idea for marking up “col spans”! How about actually using the PLUS SIGN for something meaningful:

| A1 |         A2          | A3 |
|----|----------+----------|----|
| b1 |   b21    |    b22   | b3 |

A parser could “see” by the VERTICAL LINEs in the first row that there are three cells to fill in the current row, and by the VERTICAL LINEs together with the PLUS SIGN(s) in the table rule line below this first row, that the second cell actually spans across two cells (the second and the third) in the next row.

Nice! I think we’re on our way towards a “col span” syntax!


10. Almost all current pipe table implementations (except Maruku) require at least one pipe in a table rule, usually it should match lines with content cells.

As far as I can tell, Maruku uses the PHP Extra syntax for tables, and I think this requires at least one “|” in each line of the “table block”.

Note that the current wording of my proposal intends to require at least one “|” in each table rule line—if my text can be interpreted otherwise, I’d have to check and change that.


11. One can think of pipes constituting (parallel) columns. Rows are either established by line breaks or by the known horizontal rules. If that was true, the following should work:

|----|****|____| equals |----|----|----|

Yes, so far I was under the impression that there is no difference in meaning associated with the suggested plethora of “table rule characters”, so your two table rule lines would actually have the exact same “meaning” (ie be translated in the exact same way), as would the variations:

|====|*--*|--__|

|....|''''|,,,,|

|----|****|____|

|----#****|____#

#----|****#____`

'----|****|____|

|` `-|*--*|--__|

You see why I would find this bizzare?


12. There is currently no way to specify row header cells, which are quite common.

No, there is: the wording of the current proposal implies that a table rule line with (at least one) EQUALS SIGN in it can be used for this purpose—but this is probably too restrictive.

I expect that I will change that soon, so that a any “non-blank” table rule line with HYPHEN-MINUS and/or EQUALS SIGN in it will produce (ie split off) a table heading row. (Provided this table rule line is placed below the first content row line in the block, of course!)

Or did you mean something else by “specify row header cells”?


13. Kramdown currently has the most aggressive table detection algorithm: Basically any line with a literal pipe character | in it constitutes a table.

I glanced through the Kramdown syntax description and didn’t find anything particularly “aggressive” or disturbing, or even surprising: as far as I can see, the table syntax there pretty much is equivalent to my proposed syntax, where “any line with a literal “|” in it” (and only white space else) is a table rule line, and does constitute a “table block”.

So far, I fail to see a problem with this rule.


tl;dr:

  1. A syntax for “column spans” is needed (and for “row spans” probably too).

  2. The PLUS SIGN could be used in such a syntax.

  3. The current proposal’s wording requires a table rule line containing (at least one) EQUALS SIGN “=” to split off and produce a table heading row: this is too restrictive and will be changed.

  4. The COLON “:” will sooner or later be used to indicate horizontal alignment of table columns (or cells), adopting the use of COLON in existing Markdown table syntaxes.

  5. So far I can see no grave incompatibilities between my proposal and popular Markdown syntaxes.

  6. I find these two “meta-principles” important, will try to adhere to them, and will oppose throwing them out for no good reason (because IMO they distinguish the CommonMark and generally the Markdown approach from all the other “plain-text” syntaxes out there):

  • Syntax rules should suit the author (of CommonMark texts, not of the syntax rules, of course!), and not the parser (implementation and implementor).

  • Good syntax rules are (a) few in number, (b) easy to understand, and © flexible to apply.

But an explicit discussion about these “principles” and what they imply would be better suited for a separate item in the talk.commonmark list of topics.

Kind regards,

tin-pot

0 Likes

#8
  1. Back-tracking is hard for both, human and computer parsers. The type of block should be determined within the first two lines.
  2. Making leading pipe mandatory in most cases is a direct consequence of issue 1.
  3. Trailing pipe is similar to trailing hash signs in headings. I just wanted to make clear that although the reasoning is different, they should still parallel leading pipes.
  4. I’m not a big fan of hash-equals, but it is commonly used in adhoc email tables, which was the major source for original Markdown. The note on corner symbols was rather informative, not a proposal to include them.
  5. Your proposal makes multi-row headers impossible. That’s bad.
  6. HTML does support multiple <tbody> elements per table, and the scope attribute for headers has a rowgroup value. Latex’s longtable even allows multiple headers and footers, i.e. one on each page for tables spanning multiple pages.
  7. I meant that in HTML you can have 3 explicit <td> in one row, 2 in the one before and 4 in the one after, but in LaTeX when you specify 3 columns you need to have 2 ampersands & and a line end \\. That’s just something to consider when deciding whether each line of a table must have the same number of pipes |.
  8. My point was , you’re using the heading characters = and -, but “rule” (as in <hr>) terminology. I’ve already raised the topic of harmonization in a thread of its own. It’s mostly a matter of consistency.
  9. This follows from issues 8 and 11. I’ll explain it further down. Your plus sign + for colspan idea seems worthwhile to explore further.
  10. See this example for what I meant, I’m not sure why Maruku sees just one column, though. When I started to write my unfinished CM table syntax proposal, I tried to reuse horizontal rules, thinking that existing implementations already supported that.
  11. No, you’re creating a straw man: comma, period, apostrophe, backtick and even hash sign are completely don’t apply here at all. **** and ____ are equivalent to ---- in generating horizontal rules outside “column blocks”. Normal CM text is written in an implicit 1-column block. Pipes would constitute explicit column blocks. An explicit 1-column block could be treated in a special way (like Pandoc does). I was arguing that the result of “horizontal rules” would be slightly different in explicit column blocks, i.e. </tr><tr> instead of <hr> in HTML.
  12. Row header cell in HTML: <tbody>…<tr><th scope=row>row header<td>normal content</tr>
  13. Try foo | bar
  14. Syntax for a caption (i.e. table heading) is missing. (I forgot to mention that before.)

Code examples for 2. (optional leading pipe)

Table started and ended with table rule:

---|---
 A | B
---|---

Table with unambiguous number of columns from first row:

| A | B
  C | D

Code examples for 9. (possible different handling of = and - table rules)

The HTML output below only shows the contents of the <tbody>, the header is always the same:

<table><thead><tr><th> A <th> B </thead><tbody>
<!-- … -->
</tbody></table>
| A | B |
|---|---|
| C | D |
| E | F |

<tr><td> C <td> D
<tr><td> E <td> F

Test implementations

| A | B |
|===|===|
| C | D |
| E | F |

<tr><td> C E <td> D F

or maybe

<tr><td> C<br>E <td> D<br>F

Test implementations

| A | B |
|===|===|
| C | D |
|---|---|
| G | H |

<tr><td> C <td> D
<tr><td> G <td> H

Test implementations

| A | B |
|===|===|
| C | D |
|---| F |
| G | H |

<tr><td> C <td rowspan=2> D F H
<tr><td> G

or maybe

<tr><td> C <td rowspan=2> D<br>F<br>H
<tr><td> G

Test implementations

0 Likes

#9

1. Back-tracking is hard for both, human and computer parsers. The type of block should be determined within the first two lines.

If this should be a problem, I would agree to a simple rule like

  • There must be a line containing a “|” character (ie either a table rule line or an explicit table content line) within the first N lines of a “table block”

for the block to be recognized as a table. Would that reduce back-tracking enough in your opinion (you seem to prefer N = 2 in this rule)?


There is nothing I have to add to (2.), (3.), and (4.) for now.


5. Your proposal makes multi-row headers impossible. That’s bad.

I have to admit that I didn’t aim at making every table structure possible in (say) HTML expressible in the CommonMark table syntax. However, since we already have three different kinds of table rule lines, it should be possible to come up with rules for them to “split off” multiple input rows for an output table header row (and for a table footer row too). I will cook up some syntax for it, if you don’t already have an idea.


6. HTML does support multiple <tbody> elements per table, and the scope attribute for headers has a rowgroup value. Latex’s longtable even allows multiple headers and footers, i.e. one on each page for tables spanning multiple pages.

The same answer as to (5.) applies here in a way: I’m not sure it is worthwile to reflect every feature an HTML table exhibits in the CommonMark syntax. There are already a lot of HTML elements and constructions without a CommonMark syntax, and this will probably not change either (think of the “logical character styles” like <SAMP>, or the group of <FORM> related element types).

I’d rather have a basic table syntax soon (with an implementation to gain practical experience using that syntax) than accumulating syntax for rarely used table constructs (I for once have never used multiple table heading rows, or multiple table bodies, or even a single table footer row).

As long as we don’t preclude now that later on syntax extensions to cover more of these features can be added (for example, by nailing down the same meaning for too many syntax variants), I think this is the right approach.


7. I meant that in HTML you can have 3 explicit <td> in one row, 2 in the one before and 4 in the one after, but in LaTeX when you specify 3 columns you need to have 2 ampersands & and a line end \\. That’s just something to consider when deciding whether each line of a table must have the same number of pipes |

Hmm: so far there is no requirement that each line has the same number of “|”, but that alone doesn’t help—one has to have a means to “distribute” the “|” to the right “places” (so that they align with the rows above and below) so to say.

An ad-hoc syntax resolving this would be to treat either

  • adjacent “|” characters in a special way, or
  • use decimal numbers adjacent to the “|” characters

in order to let something like

| a | | c |

mean "three cells with “a”, “” (nothing), and “c” in it, while

| a || c |

| a 2| c |

would both mean “a 2-column wide cell with “a” in it, and a one-column wide “regular” cell with “c” in it”. Wheras

| a | c ||

| a | c 2|

would have the two-column wide (2-col-span) cell on the right, containing “c”.

The syntax using digits (or numbers) is clearly a bit ugly. And I would
prefer the style using PLUS SIGN in a table rule line over both of these
ad-hoc ideas.


8. My point was, you’re using the heading characters = and -, but “rule” (as in <hr>) terminology. I’ve already raised the topic of harmonization in a thread of its own. It’s mostly a matter of consistency.

I hadn’t made the connection with “horizontal rules” in my mind like this. For example, you can (often, at least in older books) see a row of spaced-out asterisks or more ornate “asterisms” and “fleurons” as a separator (just like a horizontal rule) between parts of (for example) the same chapter. Wikipedia has more on this under “section (typography)”. I associate this with the use of “*” in CommonMark horizontal rules, and I see no such connection from asterisks to “horizontal rules” (or are these called “leaders”?) between table rows; therefor—at least for “*”—I see no inconsistency either.

[EDIT: Oh wait—the terminology is the problem? How do you call the thin, black, straight, horizontal and vertical lines across a table, between rows and columns and thus cells? “Rules”, “borders”, “lines”, “contours”? I’m not too fond of the term “table rule lines” either …

I’d like to emphasize that the use of “|”, “-”, “=”, and “+” too has nothing to do with CommonMark syntax for “section headings” or “horizontal rules”: people have been using these characters for ASCII box drawing (of tables, of underlined headings, and so on) probably long before John Gruber’s first day in school. ]


9. […] Your plus sign + for colspan idea seems worthwhile to explore further.

Yes, I think (or hope) so too.


10. […] When I started to write my unfinished CM table syntax proposal, I tried to reuse horizontal rules, thinking that existing implementations already supported that.

I think this example (using a “-”-only line to separate rows) looks kind of weird: I wouldn’t format an “ASCII art” table like that, because the horizontal rule “cuts” the table into two pieces in a very visually “dominant” way. Are there any reasons why this format should be supported?


11. No, you’re creating a straw man: comma, period, apostrophe, backtick and even hash sign are completely don’t apply here at all. **** and ____ are equivalent to ---- in generating horizontal rules outside “column blocks”.

As I said, I don’t follow your transferring of the “horizontal rule” syntax variations to the table rule line syntax. In short: why would one want to use asterisks “*” between two table rules? Or low lines “_”? They would have the same meaning anyway, right? IMO the variety offered for the current CommonMark horizontal rules is a direct reflection of the typographical variety used in “section breaks”, as discussed above; and it doesn’t apply to the graphical means to make the border between table cells visible.


11. [cont.] Normal CM text is written in an implicit 1-column block. Pipes would constitute explicit column blocks. An explicit 1-column block could be treated in a special way (like Pandoc does). I was arguing that the result of “horizontal rules” would be slightly different in explicit column blocks, i.e.

instead of
in HTML.

You mean “why can’t I write a horizontal rule in a 1-column block (aka “degenerate table” like I can eg in a block quote”? Like this in a block quote:

> Para para para. Stand by for a short break:
>
> * * *
>
> And we're in the next para.

I could be wrong, but wouldn’t the current “table syntax” shebang imply
that the same happens for this:

| Para para para. Stand by for a short break:
|
| * * *
|
| And we're in the next para.

At least I’m with you that this should transform into

  • a paragraph,
  • a horizontal rule,
  • another paragraph

nested into a table (one cell, or three cells) or even not in a table at all. But it is important that the same line breaking rules apply as for a table: the two paragraphs in the “table” example would receive “hard line breaks” in the output (reflecting the line breaks in the input)—in contrast to the block quote rules.


12. […]

This is one more HTML table feature which could, and maybe should, be
supported in the future.


13. Try foo | bar

I did, and I would agree with Kramdown on this. Not because I love Kramdown so much, but because it is a “natural” consequence of the simple rules set up so far.

I’m not sure if this incompatibility with most other
Markdown syntaxes would pose a frequently occuring problem. If so, the
rules should raise the bar for a block to be recognized as a table—and I think this is one direction you’ve been aiming at multiple times.

So say these restrictions:

  1. A table block must be “recognizable” within the first N lines (N = 2 by default).

  2. There must be among these lines

    • a table rule line (with one or more “-” or “=” in it), and/or
    • a line with a leading|character on the left.

There are obviously a lot of ways to “fine-tune” the rules like this, which is one of the reasons why I’d like to have a prototype to experiment with rather sooner than after a nailed-down and finished syntax which encompasses all kinds of fancy table features.


14. Syntax for a caption (i.e. table heading) is missing. (I forgot to mention that before.)

There’s also no syntax for the 11 element types (apart from <tgroup>) in the content model of a DocBook CALS table … ;-p

Kidding aside: this can wait (as can the CALS table model).


tl;dr

  1. A “parametrized rule” to limit back-tracking can easily be stated and implmented, and would probably not unduly restrict authors.

  2. There is a cornucopia of “table structure features” which could be
    desired in a CommonMark syntax (to be expressable). We can’t have them all, at least not from the start.

  3. The analogy between “horizontal rules” and table rule lines is not that close IMO, due on the different typographical background reflected (or to be mimicked) in both.

  4. The “degenerate” table case (a 1 × 1-sized table) should follow the same syntax rules, and should treat line breaks in the same way etc as a regular, “non-degenerate” table. Whether the degenerate table is actually mapped into a table element in the output is a different question.

  5. I think the two most important “table features” to accommodate next into the syntax are:

  • “column spans”, ie cells extending across multiple table columns (this would use/explore the use of “+” to mark this up);

  • “horizontal alignment” of cell content (either per cell, or for a whole column).

  1. Single row and multi-row table headers and footers should be specified and dealt with in one go.

  2. “row spans”, ie cells extending vertically across multiple columns should be marked up using a syntax which is consistent with “col spans”. Obviously.

  3. It would be useful to have a “common table model” as the maximum feature set to express in a table syntax: this should encompass features eg from HTML, DocBook, CALS, LaTeX and whatnot. Then use this model to prioritize table features, and to avoid the premature introduction of conflicts with some feature needed in the future.

0 Likes