Metadata in documents

I like the terseness and “calmness” of pandoc-style headers, or “Metadata Blocks”: Just (up to) three lines right in front of the Markdown typescript, each line beginning with a “%”. This looks like this:

% Document Title
% A. Uthor
% 2015-01-01
... Document content begins here ...

I have implemented this kind of “Meta-Information” in my clone of cmark (in cm2html, to be exact), which puts the meta-information into Dublin Core <META> elements in the HTML <HEAD>, and uses the title for the <TITLE> element too (if not overridden by a command-line option specifying the <TITLE> to use this time).

The resulting HTML header is shown below. I particularly longed for this feature (copied from the discount parser I use too) because ISO HTML requires—among other things—each HTML document to have a <TITLE> to be valid, and there’s no good way to specify the title otherwise (well, maybe a parser could use the first section header text, but I wouldn’t find that a great work-around).

<!DOCTYPE HTML PUBLIC "ISO/IEC 15445:2000//DTD HTML//EN">
<HTML>
<HEAD>
  <META name="GENERATOR"
        content="cmark 0.22.0 (https://github.com/tin-pot/cmark.git d57b73fedd68)">
  <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <LINK rel="schema.DC"   href="http://purl.org/dc/elements/1.1/">
  <META name="DC.format"  scheme="DCTERMS.IMT"      content="text/html">
  <META name="DC.type"    scheme="DCTERMS.DCMIType" content="Text">
  <META name="DC.title"   content="Document Title">
  <META name="DC.creator" content="A. Uthor">
  <META name="DC.date"    content="2015-01-01">
  <LINK rel="stylesheet"  type="text/css"
        href="default.css">
  <TITLE>Document Title</TITLE>
</HEAD>

Thanks for your comments! They cover exactly what I’m struggling right now.

Background.
I use markdown to store notes, clippings from articles, ideas… So I need something fast and uncomplicated. MD is perfect for this. And CommonMark is even better.

To distinguish reference data (URLs, the language of the note, who said that, etc.) I simply put a paragraph starting with ‘~’ somewhere in the text. This note refers to the enclosing header or to the whole file if appears before any header.

Anything more complicated (or with too much structure) will not be used for my peculiar use case, I’m sure. Also a visually intrusive markup, I think ,distracts too much from the real content.

So I suggest the language could define a begin of line marker to say “Ignore this” or “This is special” or “This is a comment”. The standard parsers (pandoc…) could ignore it, but libcmark could create a node for it marked as “CMARK_NODE_CUSTOM_BLOCK”. Then is my database loading code (for example) that use this information with no need to look at each paragraph first character to see if it is a ‘~’.

Thanks for the discussion!

Pandoc % block is too limited.
What if I want to mark a section with the language used? The pandoc block is OK for your use case, but then I need a different mechanism for this one.

% Title: multi-language
% language: fr
<text in French>

# Article
% language: en
<text in English>

# Articolo
% language: it
<text in Italian>

I propose that we overload the fenced code block info string to indicate to post-processors that a code block may be interpreted:

```yaml #!
---
foo: bar
---
``` 

Whilst it is a little more verbose than some of the alternatives it has several benefits:

  • No updates to any parsers
  • When not executed via a post-processor will render as a code block
  • Transparently indicates to post-processors the language to interpret the code block as
  • Does not require many changes to the spec except documenting that #! in an info string is a post-processor directive
  • Trivial for post-processors to parse (yaml.safeLoadAll(node.literal))
  • Allows for interpreter hints #!js-yaml or #!yaml-lite etc
  • Can be embedded anywhere in the document

I think this addresses several of the concerns in this thread, I wonder what people think…

2 Likes

@tmpfs I prefer the Jekyll-style metadata blocks because “a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.” The list item points that you mention are good for parsers, but adding additional syntax around the metadata block makes the document less readable for humans.

I see your point of view but I have a few issues with the standard YAML frontmatter approach:

  • Restricts the meta data format to YAML
  • Can only be embedded at the beginning of the document
  • Confusing as we would have --- to mean YAML frontmatter, thematic break and level 2 setext heading

As noted elsewhere with the YAML frontmatter approach there is no real need to specify anything as it can be trivially parsed by a pre-processor.

However I think there could be some value in defining a commonmark extension that allows embedding arbitrary data in arbitrary formats anywhere in a document.

Just to throw it out there, to improve legibility rather than take inspiration from the shebang (I figured most people using this functionality would see the shebang as something that would be interpreted) how about a single period . as in source:

```json .
{"meta": "foo"}
```

I think is more readable and also implies the code block would be interpreted.

I also believe it would be useful if an author could define structured data in multiple places in the document.

I think that there are many different uses for markdown and we shouldn’t be restricted by it always being accessible to a layperson.

If a technical person is creating a markdown blog post then some additional meta data is useful. If I want to share a recipe with my family then I would use plain markdown.

I think we should be able to support both use cases.

I’m not sure that we can. You can get a technical person to understand a non-technical document, but a non-technical person is going to be confused or put off by seeing technical complexity mixed in with a regular Markdown document. So long as we allow complex syntax, some documents are inevitably going to include that syntax with the (wrong) assumption that a non-technical person will just ignore the complex syntax.

My point is that in the first use case if no layperson will see the document, ie: it is published to HTML for public consumption then we should be able to specify this extension to fulfil that use case.

We know this use case exists as there are various markdown -> blog publishing tools.

The idea we should strive for is that “technical features” targeting technical users can and should be more complex. But at the same time clearly distinguished from the normal markdown syntax for normal markdown user, via some “start” and “stop” markers.

An example of this is # h1 heading { #idname .classname var=1 } and how it clearly distinguished the much less intuitive anchor and stying classnames from the normal intuitive markdown syntax via {}.

What this allows for is for normal and expert users to share the same markdown vocab for the basics, but for more advance features the more technical users can have a more expressive syntax (at the cost of intuitiveness). So we get the best of both without impacting on low or high technically abled writers.

This also allows for defining how to ignore the more technical features (or stripping it), to provide a more compact markdown display that doesn’t implement the more advance features.

In which case we could still overload the fenced code block just be more specific:

```yaml {meta}
---
data:
  - foo
---
```

What do you think about something like that?

@jgm I would like to “revive” this, now that we support CommonMark. At a minimum I want to allow you a “user option” for “traditional multiline” support, but need some metadata to accompany it.

Simplest solution in my mind would be to add this to the top of any documents you create.

<!-- softbreak: false -->

We could also go the full hog here, but this chews up premium real estate in the editor.

---
softbreak: false
---

I guess the editor could be made smart enough just to hide the metadata and expose it via the options dialog.

Not sure what to do here, what is your pref.

1 Like

An implementation can always decide to treat some specially marked initial lines of the document as metadata, rather than as part of the CommonMark content. I don’t think there’s any need to standardize on a format for this, since different applications will have different needs. So I still think that metadata should not be part of the spec.

3 Likes

Also to prevent metadata from being rendered, HTML comments can already be used <!-- my metadata here -->. This is more likely to work everywhere, even in non-CommonMark Markdown implementations.

Yes it certainly works, but it is not interoperable. newlines handling is already optional per spec but there is no way having the document declare it.

I do understand though that there may be bigger fish to fry and that we can just not have interoprability here.

The damper that @jonschlinkert placed on this idea, while understandable regarding data structure, fails to recognize that we’re just looking for “delimiter-ization” for a different portion of the page.

I would guess that nearly every markdown author, at one point or another needs some kind of invisible “meta” area.

Consider the idea that markdown advertises (and why everyone uses it):

Markdown = Simple characters delimit page parts.

In the same way that the # character delimits an H1, we’re just asking for an agreement on (and standardization of) which character(s) delimit:

  1. meta data
  2. no processing
  3. vcard

The problem with YAML is that the “—” is already used for multiple things. And other proposals don’t feel markdown-y enough.

No one complains about the “fencing” idea, so stick with it, and don’t start deviating from existing markdown paradigms.

I propose

  1. meta data - Enclosed by brackets. Allowing multiple metas – or peppering meta data throughout the document. Or it could also be leveraged as a hidden comment-like section.

    Example:

     {{{
     	title : My Page
     	sally: barfed
     	bob: is reversible
     }}}
    

    Example:

     {{{
     	{
     		"author" : "Patrik Star",
     		"uuid" : 1234566,
     		"whatever" : "you want",
     		"foo" : {
     					"bar" : [1.2, 4.5]
     				}
     	}
     }}}
    
  2. no processing - (Island) Fenced by exclamation points.

    Example:

     !!!"whatever" : "you want","foo" : "bar"!!!
    
  3. vcard - fenced by @ symbols, classic key=value pairs.

    Example:

     @@@
     name = bob
     phone = 555-1212
     email = bob@example.com
     @@@
    

Again, the idea here is just to allow CommonMark to establish the marks used to delimit different kinds of data.

I believe that the solidifying of the spec could be sped up dramatically if we remember to KISS.

2 Likes

Note that many provided examples can be solved with the current standard by using JSON-LD:

<script type="application/ld+json">
{ "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "Blogging Like a Hacker",
  "description": "How to blog like a hacker and win at life",
  "datePublished": "2018-05-19T21:11:36Z",
  "keywords": "blogging, hackery" }
</script>

It is in widespread use, well-defined at schema.org, and includes many more fields than the <meta> attribute.

(It is what I do in my blog, and I use that metadata for automated generation of the HTML and RSS feed.)

For other elements, other script types are possible, such as text/yaml.

If we adding the following rule to the spec, would this not cover the vast majority of cases?

If the first line of the document begins with a thematic break, treat this as a “do not parse” starting delimiter instead and ignore the subsequent lines until the next thematic break which is treated as a closing delimiter for the “do not parse” section.

This allows custom front matter to be added via a post-processor without it being rendered in the output (of regular CommonMark implementations). Third party implementations can implement the post-processor however they want (and come up with common syntax extensions for their specific use cases).

Looks like the full Common Mark spec finally landed in VS Code as of roughly this week. The front matter yaml used to be supported according to @chrisalley’s last message and now it is broken. Hmmmm…

Anyone else notice this?

1 Like