Metadata in documents

+++ mofosyne [Nov 22 14 11:04 ]:

Any other programs that adopts the same format?

Among those known to me: Pandoc, Hakyll, Gitit.

Some other implementations were mentioned by @lu_zero in the document titles topic.

Also, Middleman.

Okay, well it seems pretty clear then. “Do not parse or show” :

This is not quite a “Do not parse, but show” island. But rather this is “Do not parse or show” island. We need a different syntax for those who wants a “no parsing but show in html”.

The “Do not parse or show” will be useful for comments, application specific metadata, etc…

It is recommended to have a implementers guide for a general concensous for optional metadata interpretion, so that most simple documents can have readable metadata (e.g. stripped down YAML). But the core of common mark will not include metadata interpretation (could have a hook thought).

If you want to render the data as HTML, why not just use a description list?

That’s fine. What I meant, is for those who just want text to directly go straight to html/doc/etc… without any parsing (but not hidden for external parsing). (e.g. a no markdown island). Not visual metadata.

No-Markdown islands are altogether different kind of data, closer to code blocks than the type of meta data used by Jekyll. For reference, there’s already a topic about no-Markdown islands. I agree that (since these two features are quite different) they should have different syntax.


For metadata:

Use external metadata parser via ignoring anything between jekyll style fencing:

For no markdown island:

FWIW I’ve added a YAML metadata parser for the remarkable markdown parser here:

I’ve taken the approach of just use --- separators at the top of the file for now, though this could be configurable.

1 Like

Static website generators markers for YAML

Pelican uses — and —

Hugo uses — and …


1 Like

based on my previous comment, in case it’s useful or helps to distinguish between what is necessary in markdown or can be handled by an “external” tool, I created a lib called gray-matter for parsing front-matter from markdown files. YAML is the most popular front-matter language, but gray-matter can also parse coffee-front-matter and JSON front matter. It’s very stable, it’s the fastest implementation I’ve tested, and it’s used on hundreds of projects (including Assemble)

I really liked how you allowed for different languages in grey-matter via options switches.

Just one suggestion. Can you allow for optional descriptive field after the ‘language specifier’? This is most useful, for repeated metadata within the documents. E.g. Slideshow apps might need to set different stylesheet for each slides, so need to be able to distinguish between different metadata boxes.

---yaml: slide01 ---
CSS: style.css


Can you allow for optional descriptive field after the ‘language specifier’?

We could, but it depends on the specifics. I’ve thought about a need for something similar, would you want to continue the discussion on a gray-matter feature request? might be better to move the discussion about it there

1 Like

I like the terseness and “calmness” of pandoc-style headers, or “Metadata Blocks”: Just (up to) three lines right in front of the Markdown typescript, each line beginning with a “%”. This looks like this:

% Document Title
% A. Uthor
% 2015-01-01
... Document content begins here ...

I have implemented this kind of “Meta-Information” in my clone of cmark (in cm2html, to be exact), which puts the meta-information into Dublin Core <META> elements in the HTML <HEAD>, and uses the title for the <TITLE> element too (if not overridden by a command-line option specifying the <TITLE> to use this time).

The resulting HTML header is shown below. I particularly longed for this feature (copied from the discount parser I use too) because ISO HTML requires—among other things—each HTML document to have a <TITLE> to be valid, and there’s no good way to specify the title otherwise (well, maybe a parser could use the first section header text, but I wouldn’t find that a great work-around).

        content="cmark 0.22.0 ( d57b73fedd68)">
  <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <LINK rel="schema.DC"   href="">
  <META name="DC.format"  scheme="DCTERMS.IMT"      content="text/html">
  <META name="DC.type"    scheme="DCTERMS.DCMIType" content="Text">
  <META name="DC.title"   content="Document Title">
  <META name="DC.creator" content="A. Uthor">
  <META name=""    content="2015-01-01">
  <LINK rel="stylesheet"  type="text/css"
  <TITLE>Document Title</TITLE>

Thanks for your comments! They cover exactly what I’m struggling right now.

I use markdown to store notes, clippings from articles, ideas… So I need something fast and uncomplicated. MD is perfect for this. And CommonMark is even better.

To distinguish reference data (URLs, the language of the note, who said that, etc.) I simply put a paragraph starting with ‘~’ somewhere in the text. This note refers to the enclosing header or to the whole file if appears before any header.

Anything more complicated (or with too much structure) will not be used for my peculiar use case, I’m sure. Also a visually intrusive markup, I think ,distracts too much from the real content.

So I suggest the language could define a begin of line marker to say “Ignore this” or “This is special” or “This is a comment”. The standard parsers (pandoc…) could ignore it, but libcmark could create a node for it marked as “CMARK_NODE_CUSTOM_BLOCK”. Then is my database loading code (for example) that use this information with no need to look at each paragraph first character to see if it is a ‘~’.

Thanks for the discussion!

Pandoc % block is too limited.
What if I want to mark a section with the language used? The pandoc block is OK for your use case, but then I need a different mechanism for this one.

% Title: multi-language
% language: fr
<text in French>

# Article
% language: en
<text in English>

# Articolo
% language: it
<text in Italian>

I propose that we overload the fenced code block info string to indicate to post-processors that a code block may be interpreted:

```yaml #!
foo: bar

Whilst it is a little more verbose than some of the alternatives it has several benefits:

  • No updates to any parsers
  • When not executed via a post-processor will render as a code block
  • Transparently indicates to post-processors the language to interpret the code block as
  • Does not require many changes to the spec except documenting that #! in an info string is a post-processor directive
  • Trivial for post-processors to parse (yaml.safeLoadAll(node.literal))
  • Allows for interpreter hints #!js-yaml or #!yaml-lite etc
  • Can be embedded anywhere in the document

I think this addresses several of the concerns in this thread, I wonder what people think…


@tmpfs I prefer the Jekyll-style metadata blocks because “a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.” The list item points that you mention are good for parsers, but adding additional syntax around the metadata block makes the document less readable for humans.

I see your point of view but I have a few issues with the standard YAML frontmatter approach:

  • Restricts the meta data format to YAML
  • Can only be embedded at the beginning of the document
  • Confusing as we would have --- to mean YAML frontmatter, thematic break and level 2 setext heading

As noted elsewhere with the YAML frontmatter approach there is no real need to specify anything as it can be trivially parsed by a pre-processor.

However I think there could be some value in defining a commonmark extension that allows embedding arbitrary data in arbitrary formats anywhere in a document.

Just to throw it out there, to improve legibility rather than take inspiration from the shebang (I figured most people using this functionality would see the shebang as something that would be interpreted) how about a single period . as in source:

```json .
{"meta": "foo"}

I think is more readable and also implies the code block would be interpreted.

I also believe it would be useful if an author could define structured data in multiple places in the document.

I think that there are many different uses for markdown and we shouldn’t be restricted by it always being accessible to a layperson.

If a technical person is creating a markdown blog post then some additional meta data is useful. If I want to share a recipe with my family then I would use plain markdown.

I think we should be able to support both use cases.