Side Thoughts: Promoting pipe tables as a potential alternative to .CSV format (e.g. .PSV ?)

mofosyne · September 11, 2015, 3:24pm

Some of the common gripes I have about CSV is the limited ability to make a distinction between a row being a “header row” and a “data row”.

If we ever come to an agreement on a simple enough commonmark table parser (or maybe even before). Would it also be a good idea to promote a new “PSV” format? In which a file may look like, say… github style markdown table?

It doesn’t have to be a pipe format, but it should ideally be a format that is likely to be used by commonmark. The idea is that it would be included as a possible input format for python, matlab, excel, and any other spreadsheet software as a competing alternative to csv.

# Spreadsheet Title
Spreadsheet comments

## Worksheet Name
Sheet comments

| Tables   | Are           | Cool  |  
|----------|---------------|-------| 
| col 1 is | ok            | $1600 | 
| col 2 is | cool          | $12   | 
| col 3 is | normal        | $1    |

(Source of terminology “spreadsheet” and “worksheet” is based off Excel, as sourced from here )

Yea I know there is tab separation, but tabs are difficult to distinguish from spaces. Also csv files does not make headers visually obvious.

Plus with smart design of the parser (proposed in halfbakery below), you could also have multiple sheets (along with sheet comments). So it becomes almost akin to an ASCII readable spreadsheet. Of course you do want to keep the base parser as basic as possible however, to encourage adoption.

At it’s most simplest, the most basic parser should be able to parse the table, locate the header, and ignore all other comments and tables gracefully.

If this is a good idea and there is enough support, we could spread the concept via a website (like how json did it), which would provide a simple visual explanation of constructing such parser. And maybe if popular enough that people write parsers for it, we can also include libs for popular languages like python etc…

Header Handling:

Standard Single Header

| Tables   | Are           | Cool  |  
|----------|---------------|-------| 
| col 1 is | ok            | $1600 | 
| col 2 is | cool          | $12   |

Multirow Header

|==========|===============|=======|
| Tables   | Are           | Cool  |  
| Tables   | Are           | Cool  |  
|==========|===============|=======|
| col 1 is | ok            | $1600 | 
| col 2 is | cool          | $12   |

Multicell Header

|==========|===============|=======|
| Tables   | Are           | Cool  |  
: Tables   : Are           : Cool  :  
|==========|===============|=======|
| col 1 is | ok            | $1600 | 
| col 2 is | cool          | $12   |

Multicell rows

| Tables   | Are           | Cool  |  
|----------|---------------|-------| 
| col 1 is | * apples      | $1600 | 
:          : * oranges     :       :
| col 2 is | cool          | $12   | 
:          : or hot        :       :

Non data field?

Not sure how to deal with this, but maybe if = is for header divider, and - is for non data field below?

|==========|===============|=======|
| Tables   | Are           | Cool  |  
|==========|===============|=======|
| col 1 is | ok            | $1600 | 
| col 2 is | cool          | $12   | 
|----------|---------------|-------| 
||                  TOTAL: | $1612 |

But then how do you declare the logic? e.g. SUM(A3:E3) in excel? Or should that be considered beyond the scope of PSV?

What if we also used !?

|==========|===============|=======|
| Tables   | Are           | Cool  |  
|==========|===============|=======|
| col 1 is | ok            | $1600 | 
| col 2 is | cool          | $12   | 
|----------|---------------|-------| 
||                  TOTAL: | $1612 |
!!              ! =SUM(col:"Cool") !

Hmmm… maybe out of scope of this. Seems rather ugly to try and shove in.

Value Type detection

Supports all datatype that json can handle. But also record a type guess e.g. ( $50 is seen as {"value":50,"type":"currency"} ). Can either be autodetected from first data field, or declared in header field. e.g.

|=======|===============|========|
| ID    | Comment       | Cool   |  
: (key) :               : ($USD) :  
|=======|===============|========|
| 1     | ok            | $1600  | 
| 2     | cool          | $12    |

other values declaration could be like (SI:cm^2) which says SI units of centimetre squared etc…

Difference from typical markdown tables

: alignment characters is ignored. This is a data format like csv.
: in start of a line while in table mode means multicell continuation.
" can be used to include strings with | or : safely.

Originally made the proposal in halfbakery:

http://www.halfbakery.com/idea/PSV_20-_20Pipe_20Separated_20Value_20format_20(markdown_20table_20style!)#1441898267

edit: Updated with these consideration. Side Thoughts: Promoting pipe tables as a potential alternative to .CSV format (e.g. .PSV ?)

zzzzBov · September 11, 2015, 6:43pm

I like the idea. It would keep the discussion of syntax for the PSV format separate from markdown making everything more atomic.

chrisalley · September 11, 2015, 11:26pm

The proposal seems mostly good. I like the idea of reusing Markdown headers and table syntax, rather than inventing yet another syntax.

As mentioned in the tables topic, what concerns me about Github Flavored Markdown tables is the column alignment feature. There are other ways you could achieve the same result without mixing presentation with content (and without using classes):

Use nth-of-type in CSS to float the content of particular <td> columns left or right.
The application could look at the data in the column and automatically align it left or right based on the data type that is found.

I don’t recall explicitly defining column alignments in CSV. Does Excel use the second approach?

Another question: if a PSV document contains content outside of the table (e.g. paragraphs and lists) would these be ignored or added to a cell on their own?

mofosyne · September 12, 2015, 4:50am

What to do about Alignment?

Well the other alternative approach is to simply say that you cannot use : to align in PSV (Or just get the parser to recognize but ignore it). After all, the PSV format would be mostly a data interchange format. Oh and btw CSV doesn’t do alignment.

How is external list and comments treated?

On external paragraphs and lists. If you seen my example above, I think it would be mostly treated as comments. It really should not be rendered as a cell, but rather as a “comment display” (think help file) next to a spreadsheet that would look like:

# Dog Company
Below is all kinds of sheets related to dog company

## Product List
Here is a range of products
* ID : Product ID
* Name : Product Name
* Price : Product cost to consumer

    [DISPLAY SHEET]

## Payroll
Here is the payroll
* ID : Employee ID
* Name : Full name of employee
* Wage : How much to pay employee

    [DISPLAY SHEET]

## Expenses
Here is the Expenses
* ID : Expense ID
* Comment : Expense Comment
* Cost : How much did it cost?

    [DISPLAY SHEET]

Where clicking on “DISPLAY SHEET” would switch the currently displayed sheet to the selected display sheet. Alternatively switching worksheet would jump to the relevant comment. What I envision people would put into the comments is field descriptions.

But in most cases, the comments are ignored by the table parser (if we are only interested in the data in say “payroll worksheet”.

But… maybe we could add extra capabilities like “where is the key field”. Does add more complexity tho.

Brian_Lalonde · September 15, 2015, 5:59am

Using numeric position of columns in each table to externally map alignment (usually to semantically denote numeric columns) makes about as much sense as using word position in a paragraph to map emphasis. It adds synchronization overhead that will often fall out of sync, without really adding any value, since each table still needs to be managed separately.

Automatic data selection makes much more sense, and is more likely to support other cultures than specific alignment instructions.

Of course, right-aligning a numeric table column is much more work than it should be in HTML too, and if that were addressed (possibly by using numeric column detection), it could greatly simplify Markdown’s role.

randomstuff · September 15, 2015, 8:31am

You might want to have a look at the tabular-data / CSV-on-the-web W3C working group. There might be some rfeferences and considerations (metadata, etc) worth looking at.

chrisalley · September 15, 2015, 9:04am

The application that the data is imported into could handle the column styling, rather than a user manually writing the CSS and keeping it in sync. Excel keeps track of presentational enhancements made using the GUI after importing a CSV file and saves these to it’s own file format. An app that supported PSV importing could use CSS or some other style format.

It would indeed be unsustainable to manually keep the styling in sync without some kind of explicit link between each table column and the associated styling.

Want are you intending this to be @mofosyne - a simple data exchange format like CSV or a full spreadsheet file format like XLS/XLSX that includes the styles as well?

mofosyne · September 15, 2015, 10:27am

I would like to keep it as a data exchange format (think json for tables), whose table is hopefully use-able in a commonmark document. (ergo probbly share the same parser).

Well as @zzzzBov mentioned, it would also be a good opportunity to experiment with a “table parser” as a human readable tabular data exchange format, to inform the design of a table parser for commonmark. (plus could help with adoption of table syntax if its used as both a presentation and data exchange format.)

People style left and right and ruling, to impart a meaning to their table (e.g. header fields). So a parser should ideally be smart enough to identify extra CSS styles that should be added to allow for better target ability (e.g. this field is currency. This field is a header. This field is for database ID field)

tl;dr: This would be more of a data only format, which is more like how markdown would do it. (But to make external styles useful, we need to identify why people would typically style their table and ensure that there is a css hook for it. E.g. key field, headers, grouped fields etc… ergo, what is the semantic we need to express in a ASCII table. And can it be done cleanly?)

tl;dr: most of all, it needs to be easy to code a parser for, if it’s going to be used as an exchange format. So have a version with the most minimal parser possible too.

If people really want portable styling… I could imagine something like PSVz as a zipped version of PSV that contains both the table.PSV and a tableStyle.css in the same zip file. etc…

mofosyne · September 15, 2015, 10:44am

why do people align/style tables anyway? Is there a functional purpose?

Brian_Lalonde · September 15, 2015, 7:22pm

@chrisalley I suppose it depends on whether the data is external to the document. In that case, a mapping would be useful. That would need to be a Commonmark extension I think though.

I don’t think a full CALS table model is appropriate to the goals of markdown–that’s why embedded HTML is supported.

mofosyne · September 17, 2015, 1:20pm

yea, you want to cover just the most common use cases. Which I think is headers rows and data rows. But is there other common models that people really need in a lightweight PSV parser?

mofosyne · November 9, 2015, 4:38am

Just some extra thoughts:

Would it be safe to use this for .psv or should it adopt a separate extension, since it is is a more strictly defined standard?

It seems thought I would just hijack the .psv extension anyway. Since that’s what it seems commonmark is doing anyhow with .md

Strict mode (markdown tables with header auto detect) vs Compatibility mode (Cannot assume header, only one sheet, no starting | character for each line ) . This can perhaps be distinguished via the presence of a # in the first few lines, without encountering any |.

Thought it would be nice to eventually stick to strict mode. (Or maybe we make compatibility mode, an optional thing?)

: Can be safely used as a deliminator for subsequent multiline cells, as long as the last line used |
Since this is aimed at being simple parser competitor to CSV. It should probably ignore all styling :.
Perhaps could have multiline headers cells, by having a starting divider and detecting an ending divider.
How could we have non data fields like TOTAL?

mofosyne · January 24, 2018, 3:23pm

Decided to start this group now https://github.com/psv-format/psv/blob/master/README.md can add people to this organisation if anyone is interested.

I think for now, we be in the research phase of finding github projects related to dealing with pipe tables. Then we can start to define the table format to use most use cases.

For the initial implementation, let’s make a C and javascript implementation that converts from tables to json?

Crissov · January 26, 2018, 12:20pm

Suggestions:

Restrict heading format to underlined Setext style, i.e. just two levels: ==== (worksheet) and ---- (spreadsheet).
Use the colon : (alternatively the broken bar ¦) and the vertical line (vulgo: pipe) | as horizontal cell separators, no plus +.
Use equal sign = and hyphen-minus (or dash) - in separator lines.
The double or broken marks, colon and equals, are used to delineate headers (and footers), the single or line marks, pipe and dash, for normal cells.
Horizontal tabulators augment or replace pipes, line breaks or empty lines augment or replace dashes.
Header cells must contain a label and may contain data type or formatting hints for their column or row data. Footer cells should contain a symbolic aggregation formula.
A table consists of an optional head in both directions, at least one mandatory body and an optional foot in both directions.
Simple cells cannot contain block content (paragraphs, lists, headings, quotations, fences). They can contain any Markdown inline formatting. Links should use collapsed or shortcut reference format.
Cells can span multiple rows or columns or both.
Individual cells are addressable in links with case-sensitive standard spreadsheet A1 notation preceded by a hash sign #, e.g. #A1. IDs based upon column and row labels, e.g. #[my_row]:[my_col], is not supported by default.
(Internal) relative URLs work inside autolinks, e.g. <#A1>.
Cell content can be embedded elsewhere with image link syntax, e.g. ![](#A1).
Outer lines are usually optional.
… (work in progress)

mofosyne · January 26, 2018, 1:35pm

Oh btw do you have a github account? I’ll add you in so we can start writing a draft spec at least (hopefully could be the basis for the common-mark full table extension? Obviously the full commonmark one will do more than ours. That being said… is there any draft to steal from?)

Could work. My initial idea is that its faster looking for # in the first char of each line. Will setext increase complexity or slow processing? (Or is it better from a maintenance perspective. e.g. viewing in github?)
I get how | works e.g. |------|-----|-------|. But what do you mean by : as horizontal separator? Also I do not see any broken bar in my US keyboard. Is : a row span indicator?
Should be okay to accept = and - for separators in parser input. Though if they are both the same meaning, then - should be the canonical representation for all psv output.
I have not seen header and footers in github flavoured markdown. Is this something that would be a common enough use case, against just using json? What will footer look like?
?
I like this idea. I wonder if the {} consistent attribute syntax could be applied to tables… I haven’t seen any ideas on how it can be done for this, especially table fields (e.g. {field1=integer}). Don’t yet know of a good approach for this.
How do you make an optional field header? Is it assumed that the header is always the first cell plus separator? (That should be clear in the spec). Is that a bad idea? Since it requires holding the first line in memory until it can be confirmed if its a header or body.
I think for *.psv, we don’t really care if the cell content is markdown or not, as its just treated as a string. We could add "" quotation around the value if it should be a string and not be autodetected as a number, boolean, or null. But is a good thing to note as recommendation for commonmark.
Cell spanning rows and columns is going to be an interesting problem to approach. Most pipe table implementation ignore it.
How is that going to be represented when a *.psv is outputted in json? Is this a recommendation for commonmark instead?
Commonmark recommendation only?
Commonmark recommendation only?
Not sure what you are referring to?

Crissov · February 2, 2018, 10:16am

I’m preparing a document that describes my ideas for PSV in more detail.

Table bodies and foots

HTML supports mutliple <tbody> elements per table and also an optional <tfoot>, while GFM pipe tables support a mandatory <thead> and no or one <tbody>. These are row-groups, of course, and the CSS display value is the same for all f them. I see 5 major variants to improve on this, most of which have two sub-variants. GFM is a subset of variant 1, I prefer 3b which is a superset of 1.

               ------              ======                                  ====== 
 Head           Head                Head                Head                Head
------         ------              ------              ======              ====== 
 Body           Body                Body                Body                Body   
------         ------              ------              ======              ======  
 Foot           Foot                Foot                Foot                Foot 
               ------              ======                                  ======

^^^^^^    ^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^

          ------    ------    ======    ======                        ======    ======
 Head      Head      Head      Head      Head      Head      Head      Head      Head 
------    ------    ------    ------    ------    ======    ======    ======    ======
 Body      Body      Body      Body      Body      Body      Body      Body      Body 
------              ------              ------    ------    ======    ------    ======
 Body      Body      Body      Body      Body      Body      Body      Body      Body 
------    ------    ------    ------    ------    ======    ======    ======    ======  
 Foot      Foot      Foot      Foot      Foot      Foot      Foot      Foot      Foot 
          ------    ------    ======    ======                        ======    ======

  1         2a        2b        3a        3b        4a        4b        5a        5b

Block content, line spanning

I think I’ve cracked row spans for GFM pipe tables, but am not completely sure how to do column spans yet. The general idea for rowspan is to fill all cells that do not continue from the row above with dashes.

I think I’ve cracked block contents for GFM pipe tables, but am not sure how to do row and column spans yet. The general idea is to fill all cells that do not continue from the row above with dashes.

|  head  |  head  |
| ------ | ------ |
| 1. one | cell   |
| 2. two | ------ |
| cell   | cell   |
| cell   | 1. one |
| ------ | 2. two |
| ------ | 3. end |
.
<table>
<thead>
<tr><th>head</th><th>head</th></tr>
</thead>
<tbody>
<tr><td><ol><li>one</li><li>two</li></ol></td><td>cell</td></tr>
<tr><td>cell</td><td>cell</td></tr>
<tr><td>cell</td><td><ol><li>one</li><li>two</li><li>end</li></ol></td></tr>
</tbody>
</table>

head	head
one two	cell
cell	one two
cell	cell

If you do not have any single-line cell in a row you would need to add another bogus one. The simplest GFM-conformat solution is simply appending a dash at the end of all lines.

|  head  |  head  |-
| ------ | ------ |-
| 1. one | 1. one |-
| 2. two | 2. two |-
| ------ | 3. end |-
.
<table>
<thead>
<tr><th>head</th><th>head</th></tr>
</thead>
<tbody>
<tr>
<td><ol><li>one</li><li>two</li></ol></td>
<td><ol><li>one</li><li>two</li><li>end</li></ol></td>
</tr>
</tbody>
</table>

head	head
one two	one two end

Without the empty column at the end, which should be ignored on output, the parsing would be different.

|  head  |  head  |
| ------ | ------ |
| 1. one | 1. one |
| 2. two | 2. two |
| ------ | 3. end |
.
<table>
<thead>
<tr><th>head</th><th>head</th></tr>
</thead>
<tbody>
<tr><td>1. one</td><td>1. one</td></tr>
<tr><td><ol><li>two</li></ol></td><td><ol><li>two</li><li>end</li></ol></td></tr>
</tbody>
</table>

head	head
1. one	1. one
two	two end

The problem with this approach is that a parser may have to look at many rows in advance to determine whether there is a valid block context.

mofosyne · February 3, 2018, 8:55am

Dev Branch for psv.js online at /psv-format/psv.js

@Crissov I have now created a development branch for the javascript implementation of psv in https://github.com/psv-format/psv.js/tree/development . How would you like to communicate? Shall we stick to this forum? I decided to lift the framework of commonmark.js so we don’t have to rethink to deeply about organising stuff like spec test etc… and just hopefully jump into sketching a proof of concept.

I noticed that commonmark js implementation is split into multiple modules

module.exports.Node = require('./node');
module.exports.Parser = require('./blocks');
module.exports.HtmlRenderer = require('./render/html');
module.exports.XmlRenderer = require('./render/xml');

I wonder how shall psv be organised. Considering that I think we are focusing mostly on “psv” to “json”, and “json” to “psv” if I’m correct. Unless anyone can interject with other data formats that people would rather we support as well… so with that in mind, this is my general sketch of the index.js in ./lib/

module.exports.ParseWorkBook = require('./workbook');
module.exports.ParseSheet = require('./sheet');
module.exports.ParseTable = require('./table/parse-table');
module.exports.JsonRender = require('./render/psv');
module.exports.JsonRender = require('./render/json');
// module.exports.XmlRenderer = require('./render/xml'); // TODO
// module.exports.YAMLRenderer = require('./render/yaml'); // TODO
// module.exports.CBORRenderer = require('./render/cbor'); // TODO

This is because in excel, you see that a workbook has multiple “sheets” in it. Each sheet can then have comments, metadata, schemas and of course a table.

Complexity Consideration

Also what’s your thoughts about complexity? Should we aim to minimise complexity in such a way that we can explain how to make a PSV reader via a simple flow chart? If so then, maybe we want to avoid the need to look forward multiple rows to determine block context?

Keeping psv in lockstep with commonmark

I wonder how we can keep psv in lockstep with commonmark, so that once commonmark get’s pipe tables… that a psv document can at least be mostly readable as a commonmark document as well.