Plain text outlining and the destructive tab → 4_space preprocess

The proposed tab → 4_space preprocess is lossy, and specifically drops tab-nesting information that is usable for the kind of plain text outlining that can be done in @foldingtext, for example, and which seems likely to become an important use of MD.

The Gruber spec clearly doesn’t anticipate outlining, and creates an ambiguous frontier between code lines and tab-nested list or body lines.

MD document as outline (any node with hash-defined, or tab-indent-defined children can be collapsed, hoisted, moved with children etc in @foldingtext) is clearly an advance, and one which is dependent on preserving a distinction between tabs and spaces.

The destructive tab to space preprocess seems retrogressive, and parts company with plain text outlining.

If this spec is going to become any more than a MacFarlaneMarkDown™ niche, I think the conflation of tabs with spaces might be worth reviewing.

Rob

I’m not that familiar with FoldingText, but I don’t see what the issue is here. Are you saying that FoldingText relies on literal tabs characters in the HTML? That seems like a design problem on their end.

I don’t get what the issue is with Markdown being “lossy” either. In a sense, Markdown is lossy by design: there are multiple ways of producing identical HTML output, so it’s impossible to recover the Markdown information from the HTML.

I agree, actually, and in pandoc I’ve set up the parser to handle
input with tabs. (There’s an option to disable tab expanding.)

This is not an idiosyncratic preference of mine – we’re just following
what just about every other Markdown implementation does, and what
Gruber’s syntax document explicitly calls for.

I am thinking not of HTML generation (many of my MD outlines become diagrams or MS Word etc rather than HTML), but rather (to use your own formulation) of the use of MD as:

a plain text format for writing structured documents.

and of your proposal for:

a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests to validate Markdown implementations against this standard.

I am very much in favour of Haskell-intelligible rigour, and, indeed, of standards, but I would resist any reductive definition which had the effect of limiting (in respect of tab-indentation and code-vs-indented-body ) the range of documents correctly interpreted by compliant tools.

Gruber-canonicity seems less valuable to me than that the full representation of nesting structure in plain text documents. (In fact I happen to see Gruber’s perfectly understandable failure to anticipate plain text outlining as something more in need of fixing than of unduly reverent fossilisation).

But as long as Pandoc and other tools don’t insist on reading my nested lines as code, I will be happy.

Rob

Then why does the preprocessor matter? As far as I understand it

Tabs in lines are expanded to spaces, with a tab stop of 4 characters

is saying that a Markdown processor should first replace tabs with spaces before doing further processing, not that your plain Markdown files cannot have hard tabs in them.

What is your use case that breaks due to the soft tab step?

To concretise, the document-structuring ambiguity arises at this point of Gruber’s first formulation:

To produce a code block in Markdown, simply indent every line of the block by at least 4 spaces or 1 tab.

(my emphasis)

The conjunction is unfortunate – it sacrifices (for no clear gain) a valuable distinction in the representation of document structure – between code blocks (indent terminating with four spaces) and nested body text (indent consisting only of tabs).

Thus, for example, the last line here:

# Macro point (all nesting below by tab)
- Supporting point
	- sub supporting
		- sub sub supporting

				Subordinate note

Can be usefully parsed by FoldingText as subordinate (nested) body text if it is purely tab-indented

┌ Tree
┠┰∅ [0, root]
┇┠┰# Macro point (all nesting below by tab) [1, heading]
┇┇┠┰- Supporting point [2, unordered]
┇┇┇┠┰▸- sub supporting [3, unordered]
┇┇┇┇┠┰▸▸- sub sub supporting [4, unordered]
┇┇┇┇┇┠ ∅ [5, empty]
┇┇┇┇┇┠ ▸▸▸▸Subordinate note [6, body]

Whereas if its indent terminates with (or consists of) 4 space multiples, the last line can be parsed as code:

┌ Tree
┠┰∅ [0, root]
┇┠┰# Macro point (all nesting below by tab) [1, heading]
┇┇┠┰- Supporting point [2, unordered]
┇┇┇┠┰▸- sub supporting [3, unordered]
┇┇┇┇┠┰▸▸- sub sub supporting [4, unordered]
┇┇┇┇┇┠ ∅ [5, empty]
┇┇┇┇┇┠ ▸▸▸    a line of code [6, codeblock]

A standardisation which locked out this simple scope for finer representation of document structure, would, I think, be retrogressive.

I disagree - your example, assuming first that we are not preprocessing tabs into spaces, should be parsed as a code block consisting of a tab character followed by “Subordinate note”. In other words:

	Subordinate note

It should be parsed like that because if you take the line, and remove the indentation given by the current list, you end up with two tab characters followed by “Subordinate note”. And according to that original specification, a code block is produced by tab characters, so the first tab character is consumed to produce the code block.

The only significant difference there is that the output code block has a tab character while the code block from the stmd definition would have 2 spaces.

I don’t see how that adds anything.

should be parsed as a code block

Fortunately, FoldingText disagrees with you : - )

“should be” code if what you value is reverence for the foundational text, but is rather more usefully interpreted as nested body text if what you value is plain text outlining.

Spaces suffice for indicating code blocks. I see no valid formal argument for wasting the document-structuring potential of tabs.

(The huge pragmatic value of FoldingText and its parser seems persuasive enough to me – folding and hoisting any tab-nested or hash-subordinated child elements – you should try it : - )