Metadata in documents

jonschlinkert · November 19, 2014, 5:25am

I strongly oppose this feature and hope that we can see through the flaws in this idea enough to drop it. As a standard, markdown should stay simple and focused on ease of implementation and universal compatibility. By introducing a data format - of any kind and any level of complexity - we will be introducing a feature that complicates this medium and cripples how other libraries can work with it.

Markdown can stand on its own, but metadata cannot. It must have an ultimate purpose, like being passed to a template engine as context to be used templates, being passed to renderers/parsers as an options object, whatever is required for the use case. Given that, we need to allow implementors to use whatever solution makes the most sense for parsing metadata and use the markdown parser they want for parsing markdown.

By implementing metadata, markdown will now have “compatibility issues” As it stands, markdown has a clear purpose in life, which makes it easy to see how it fits into any application. This will not be the case if data enters the picture. The problem is that, regardless of best intentions, this feature will never be able to satisfy the needs of every user, parser, renderer, template engines, or implementor who might need such data. This means that other solutions will still need to be implemented for parsing data, which not only complicates decisions and implementation strategies, but it will virtually guarantees confusion with users who want to use both this solution and the implementor’s solution, or some combination of those things.

Markdown is not a data format, but it will be if this is implemented. We’ll need to decide which data format is correct, how much is “just enough”, who the consumers will be, etc. and this slippery slope will ultimately lead to religious battles over how much data is too much and: 1) why “my favorite data format isn’t supported”, 2) “can I use this data along with my jekyll front matter, or instead of it? because then I can’t use all of jekyll’s features”, etc. etc.

Data formats are use-case specific, and should not be related to “file type”: e.g. there are many document data and front matter parsers for many use cases, and none of them have any specific relationship to markdown. Why are we trying to create one? In other words, since front matter parsers will parse front matter from any file type (e.g. markdown, handlebars templates, HTML documents, whatever), if this feature is implemented, how should users format their data when both templates and markdown files are used? Should they ask the front matter parsing library to adopt the format you decide on here for handlebars templates (not going to happen)?

Parsing front matter is trivial. One can write a front-matter parser to extract data from a document in ~20 sloc, the result of which provides them with a nice, clean string of pure markdown, and an object of data that was create from whatever language the implementer preferred to use. By implementing this feature in markdown, you will greatly complicate this task by necessitating strategies for data conflict resolution and so on.

chrisalley · November 19, 2014, 6:21am

You raise some excellent points @jonschlinkert.

Is there a reason why meta data needs to be placed in the same file? A separate yaml file that points to the Markdown file could solve the Jekyll use case at least. Separation of concerns.

vitaly · November 19, 2014, 1:07pm

Yes. That’s convenient sometime. For example, in blog posts (title).

But i’m not sure such posts should be parsed directly by markdown parser, without preprocessor.

mofosyne · November 19, 2014, 2:20pm

I think it’s much safer if it’s in a block generic directive, since displaying metadata between platforms (html, paper, etc…) is highly variable, and data (and thus metadata) in general are much more fragile than normally human typed text.

Maybe metadata should be thought of as “recommended” best practices when used in the context as settings for various generic directives (basically restricted YAML syntax). As for metadata used for embedding metadata within a document (rather than risking it getting lost by placing it as a separate file), it is best used with it’s own generic directive.

For those that need it maybe we can support these generic directives:

!metadata :~ Included in all parsers, but a stripped down YAML, to cover most obvious use cases. This should remain small and mostly unchanged thought the life of commonmark.

!YAML :~ optional extension (included in fatter parsers like pandoc) of full YAML

!json :~ optional extension of full json

So basically, keep default metadata syntax as small as possible, and make full metadata support optional. Hopefully addressing jonschlinkert concerns that this would greatly make commonmark too complex and unwieldy.

Knagis · November 19, 2014, 6:38pm

I think that CommonMark spec should contain at least some very simple and minimal metadata format so that applications that rely on a CommonMark parser would be able to use at least a trivial key-value pairs out-of-the-box - for example ^([0-9a-z]+):([^\n]*)$ - if the app needs, it can just store one value with base64 encoded data or JSON object or anything else - for everyone else it will be just a string.

Since metadata is in fact application-specific, an application that requires something very complex can be expected to implement that on its own but it would help other developers if the minimum is already in place.

Another nice thing the spec could do is to list common metadata keys, such as Author, Title, Description etc. - so that content management systems have a reference from where to read such attributes.

jgm · November 19, 2014, 11:15pm

I think I agree with all of @jonschlinkert’s comments here.

But since it will presumably be common to have “hybrid” documents with some metadata at the top, followed by CommonMark text, I wonder if it would make sense to have the spec define a recognizer for front-matter metadata, so that all CommonMark parsers would know to skip this and go right to to the text.

For example: if the document starts with a line containing ---, skip to the next line containing just --- or ..., and start parsing CommonMark after that.

This would fall far short of specifying a metadata format. Between the opening and closing metadata signs, you could have anything you like – so, you could use YAML, or JSON, or XML, or lua tables, or a custom key-value store. Parsing this would be application-specific, but conforming CommonMark parsers would know to skip it.

The advantage is that, with this feature, you could run your hybrid metadata/CommonMark file through any CommonMark parser and get good results, not the garbage that would result if the metadata were parsed as CommonMark.

codinghorror · November 21, 2014, 6:20am

There are several topics here reqesting an official “Do Not Even Attempt to Parse This Section” delimiter or block element.

That seems like the safer, saner choice.

mofosyne · November 22, 2014, 10:52am

Jekyll is the most popular program for static site generation according to https://www.staticgen.com/

As widely known here, it uses --- to --- for encasing YAML entries.

Any other programs that adopts the same format? The concept of treating --- to --- or ... as a ‘do not parse’ section for commonmark makes sense, but it pretty much a ‘do not parse’ command that could only be safely included in the top due to potential clash with --- horizontal rule (unless I am mistaken). Should there be a more general delimiter or fencing character for “do not parse” command?

Either way, Jekyll style “do not parse” section is a good approach for dealing with metadata in an impartial manner.

extra: could perhaps avoid the ‘clash with horizontal rule’ by simply disallowing for empty newlines between ---

jgm · November 22, 2014, 4:56pm

+++ mofosyne [Nov 22 14 11:04 ]:

Any other programs that adopts the same format?

Among those known to me: Pandoc, Hakyll, Gitit.

chrisalley · November 22, 2014, 8:54pm

Some other implementations were mentioned by @lu_zero in the document titles topic.

Also, Middleman.

mofosyne · November 23, 2014, 1:07am

Okay, well it seems pretty clear then. “Do not parse or show” :

http://talk.commonmark.org/t/jekyll-style-do-not-show-sections/918

This is not quite a “Do not parse, but show” island. But rather this is “Do not parse or show” island. We need a different syntax for those who wants a “no parsing but show in html”.

The “Do not parse or show” will be useful for comments, application specific metadata, etc…

It is recommended to have a implementers guide for a general concensous for optional metadata interpretion, so that most simple documents can have readable metadata (e.g. stripped down YAML). But the core of common mark will not include metadata interpretation (could have a hook thought).

chrisalley · November 23, 2014, 2:50am

If you want to render the data as HTML, why not just use a description list?

mofosyne · November 23, 2014, 3:19am

That’s fine. What I meant, is for those who just want text to directly go straight to html/doc/etc… without any parsing (but not hidden for external parsing). (e.g. a no markdown island). Not visual metadata.

chrisalley · November 23, 2014, 4:19am

No-Markdown islands are altogether different kind of data, closer to code blocks than the type of meta data used by Jekyll. For reference, there’s already a topic about no-Markdown islands. I agree that (since these two features are quite different) they should have different syntax.

mofosyne · November 23, 2014, 4:22am

Okay:

For metadata:

Use external metadata parser via ignoring anything between jekyll style fencing:

http://talk.commonmark.org/t/jekyll-style-do-not-show-sections/

For no markdown island:

http://talk.commonmark.org/t/no-markdown-islands/

eugeneware · November 24, 2014, 11:04am

FWIW I’ve added a YAML metadata parser for the remarkable markdown parser here: https://github.com/eugeneware/remarkable-meta.

I’ve taken the approach of just use --- separators at the top of the file for now, though this could be configurable.

Rick · November 29, 2014, 11:06pm

Static website generators markers for YAML

Pelican uses — and —

Hugo uses — and …

RD

jonschlinkert · December 24, 2014, 11:48am

based on my previous comment, in case it’s useful or helps to distinguish between what is necessary in markdown or can be handled by an “external” tool, I created a lib called gray-matter for parsing front-matter from markdown files. YAML is the most popular front-matter language, but gray-matter can also parse coffee-front-matter and JSON front matter. It’s very stable, it’s the fastest implementation I’ve tested, and it’s used on hundreds of projects (including Assemble)

mofosyne · December 24, 2014, 3:49pm

I really liked how you allowed for different languages in grey-matter via options switches.

Just one suggestion. Can you allow for optional descriptive field after the ‘language specifier’? This is most useful, for repeated metadata within the documents. E.g. Slideshow apps might need to set different stylesheet for each slides, so need to be able to distinguish between different metadata boxes.

---yaml: slide01 ---
CSS: style.css
---

jonschlinkert · December 24, 2014, 3:55pm

Thanks!

Can you allow for optional descriptive field after the ‘language specifier’?

We could, but it depends on the specifics. I’ve thought about a need for something similar, would you want to continue the discussion on a gray-matter feature request? might be better to move the discussion about it there