Metadata in documents

I think I agree with all of @jonschlinkert’s comments here.

But since it will presumably be common to have “hybrid” documents with some metadata at the top, followed by CommonMark text, I wonder if it would make sense to have the spec define a recognizer for front-matter metadata, so that all CommonMark parsers would know to skip this and go right to to the text.

For example: if the document starts with a line containing ---, skip to the next line containing just --- or ..., and start parsing CommonMark after that.

This would fall far short of specifying a metadata format. Between the opening and closing metadata signs, you could have anything you like – so, you could use YAML, or JSON, or XML, or lua tables, or a custom key-value store. Parsing this would be application-specific, but conforming CommonMark parsers would know to skip it.

The advantage is that, with this feature, you could run your hybrid metadata/CommonMark file through any CommonMark parser and get good results, not the garbage that would result if the metadata were parsed as CommonMark.

5 Likes

There are several topics here reqesting an official “Do Not Even Attempt to Parse This Section” delimiter or block element.

That seems like the safer, saner choice.

2 Likes

Jekyll is the most popular program for static site generation according to https://www.staticgen.com/

As widely known here, it uses --- to --- for encasing YAML entries.

Any other programs that adopts the same format? The concept of treating --- to --- or ... as a ‘do not parse’ section for commonmark makes sense, but it pretty much a ‘do not parse’ command that could only be safely included in the top due to potential clash with --- horizontal rule (unless I am mistaken). Should there be a more general delimiter or fencing character for “do not parse” command?

Either way, Jekyll style “do not parse” section is a good approach for dealing with metadata in an impartial manner.


extra: could perhaps avoid the ‘clash with horizontal rule’ by simply disallowing for empty newlines between ---

+++ mofosyne [Nov 22 14 11:04 ]:

Any other programs that adopts the same format?

Among those known to me: Pandoc, Hakyll, Gitit.

Some other implementations were mentioned by @lu_zero in the document titles topic.

Also, Middleman.

Okay, well it seems pretty clear then. “Do not parse or show” :

http://talk.commonmark.org/t/jekyll-style-do-not-show-sections/918

This is not quite a “Do not parse, but show” island. But rather this is “Do not parse or show” island. We need a different syntax for those who wants a “no parsing but show in html”.

The “Do not parse or show” will be useful for comments, application specific metadata, etc…

It is recommended to have a implementers guide for a general concensous for optional metadata interpretion, so that most simple documents can have readable metadata (e.g. stripped down YAML). But the core of common mark will not include metadata interpretation (could have a hook thought).

If you want to render the data as HTML, why not just use a description list?

That’s fine. What I meant, is for those who just want text to directly go straight to html/doc/etc… without any parsing (but not hidden for external parsing). (e.g. a no markdown island). Not visual metadata.

No-Markdown islands are altogether different kind of data, closer to code blocks than the type of meta data used by Jekyll. For reference, there’s already a topic about no-Markdown islands. I agree that (since these two features are quite different) they should have different syntax.

Okay:

For metadata:

Use external metadata parser via ignoring anything between jekyll style fencing:

http://talk.commonmark.org/t/jekyll-style-do-not-show-sections/

For no markdown island:

http://talk.commonmark.org/t/no-markdown-islands/

FWIW I’ve added a YAML metadata parser for the remarkable markdown parser here: https://github.com/eugeneware/remarkable-meta.

I’ve taken the approach of just use --- separators at the top of the file for now, though this could be configurable.

1 Like

Static website generators markers for YAML

Pelican uses — and —

Hugo uses — and …

RD

1 Like

based on my previous comment, in case it’s useful or helps to distinguish between what is necessary in markdown or can be handled by an “external” tool, I created a lib called gray-matter for parsing front-matter from markdown files. YAML is the most popular front-matter language, but gray-matter can also parse coffee-front-matter and JSON front matter. It’s very stable, it’s the fastest implementation I’ve tested, and it’s used on hundreds of projects (including Assemble)

I really liked how you allowed for different languages in grey-matter via options switches.

Just one suggestion. Can you allow for optional descriptive field after the ‘language specifier’? This is most useful, for repeated metadata within the documents. E.g. Slideshow apps might need to set different stylesheet for each slides, so need to be able to distinguish between different metadata boxes.

---yaml: slide01 ---
CSS: style.css
---

Thanks!

Can you allow for optional descriptive field after the ‘language specifier’?

We could, but it depends on the specifics. I’ve thought about a need for something similar, would you want to continue the discussion on a gray-matter feature request? might be better to move the discussion about it there

1 Like

I like the terseness and “calmness” of pandoc-style headers, or “Metadata Blocks”: Just (up to) three lines right in front of the Markdown typescript, each line beginning with a “%”. This looks like this:

% Document Title
% A. Uthor
% 2015-01-01
... Document content begins here ...

I have implemented this kind of “Meta-Information” in my clone of cmark (in cm2html, to be exact), which puts the meta-information into Dublin Core <META> elements in the HTML <HEAD>, and uses the title for the <TITLE> element too (if not overridden by a command-line option specifying the <TITLE> to use this time).

The resulting HTML header is shown below. I particularly longed for this feature (copied from the discount parser I use too) because ISO HTML requires—among other things—each HTML document to have a <TITLE> to be valid, and there’s no good way to specify the title otherwise (well, maybe a parser could use the first section header text, but I wouldn’t find that a great work-around).

<!DOCTYPE HTML PUBLIC "ISO/IEC 15445:2000//DTD HTML//EN">
<HTML>
<HEAD>
  <META name="GENERATOR"
        content="cmark 0.22.0 (https://github.com/tin-pot/cmark.git d57b73fedd68)">
  <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <LINK rel="schema.DC"   href="http://purl.org/dc/elements/1.1/">
  <META name="DC.format"  scheme="DCTERMS.IMT"      content="text/html">
  <META name="DC.type"    scheme="DCTERMS.DCMIType" content="Text">
  <META name="DC.title"   content="Document Title">
  <META name="DC.creator" content="A. Uthor">
  <META name="DC.date"    content="2015-01-01">
  <LINK rel="stylesheet"  type="text/css"
        href="default.css">
  <TITLE>Document Title</TITLE>
</HEAD>

Thanks for your comments! They cover exactly what I’m struggling right now.

Background.
I use markdown to store notes, clippings from articles, ideas… So I need something fast and uncomplicated. MD is perfect for this. And CommonMark is even better.

To distinguish reference data (URLs, the language of the note, who said that, etc.) I simply put a paragraph starting with ‘~’ somewhere in the text. This note refers to the enclosing header or to the whole file if appears before any header.

Anything more complicated (or with too much structure) will not be used for my peculiar use case, I’m sure. Also a visually intrusive markup, I think ,distracts too much from the real content.

So I suggest the language could define a begin of line marker to say “Ignore this” or “This is special” or “This is a comment”. The standard parsers (pandoc…) could ignore it, but libcmark could create a node for it marked as “CMARK_NODE_CUSTOM_BLOCK”. Then is my database loading code (for example) that use this information with no need to look at each paragraph first character to see if it is a ‘~’.

Thanks for the discussion!

Pandoc % block is too limited.
What if I want to mark a section with the language used? The pandoc block is OK for your use case, but then I need a different mechanism for this one.

% Title: multi-language
% language: fr
<text in French>

# Article
% language: en
<text in English>

# Articolo
% language: it
<text in Italian>

I propose that we overload the fenced code block info string to indicate to post-processors that a code block may be interpreted:

```yaml #!
---
foo: bar
---
``` 

Whilst it is a little more verbose than some of the alternatives it has several benefits:

  • No updates to any parsers
  • When not executed via a post-processor will render as a code block
  • Transparently indicates to post-processors the language to interpret the code block as
  • Does not require many changes to the spec except documenting that #! in an info string is a post-processor directive
  • Trivial for post-processors to parse (yaml.safeLoadAll(node.literal))
  • Allows for interpreter hints #!js-yaml or #!yaml-lite etc
  • Can be embedded anywhere in the document

I think this addresses several of the concerns in this thread, I wonder what people think…

2 Likes

@tmpfs I prefer the Jekyll-style metadata blocks because “a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.” The list item points that you mention are good for parsers, but adding additional syntax around the metadata block makes the document less readable for humans.