Metadata in documents

Okay:

For metadata:

Use external metadata parser via ignoring anything between jekyll style fencing:

http://talk.commonmark.org/t/jekyll-style-do-not-show-sections/

For no markdown island:

http://talk.commonmark.org/t/no-markdown-islands/

FWIW Iā€™ve added a YAML metadata parser for the remarkable markdown parser here: https://github.com/eugeneware/remarkable-meta.

Iā€™ve taken the approach of just use --- separators at the top of the file for now, though this could be configurable.

1 Like

Static website generators markers for YAML

Pelican uses ā€” and ā€”

Hugo uses ā€” and ā€¦

RD

1 Like

based on my previous comment, in case itā€™s useful or helps to distinguish between what is necessary in markdown or can be handled by an ā€œexternalā€ tool, I created a lib called gray-matter for parsing front-matter from markdown files. YAML is the most popular front-matter language, but gray-matter can also parse coffee-front-matter and JSON front matter. Itā€™s very stable, itā€™s the fastest implementation Iā€™ve tested, and itā€™s used on hundreds of projects (including Assemble)

I really liked how you allowed for different languages in grey-matter via options switches.

Just one suggestion. Can you allow for optional descriptive field after the ā€˜language specifierā€™? This is most useful, for repeated metadata within the documents. E.g. Slideshow apps might need to set different stylesheet for each slides, so need to be able to distinguish between different metadata boxes.

---yaml: slide01 ---
CSS: style.css
---

Thanks!

Can you allow for optional descriptive field after the ā€˜language specifierā€™?

We could, but it depends on the specifics. Iā€™ve thought about a need for something similar, would you want to continue the discussion on a gray-matter feature request? might be better to move the discussion about it there

1 Like

I like the terseness and ā€œcalmnessā€ of pandoc-style headers, or ā€œMetadata Blocksā€: Just (up to) three lines right in front of the Markdown typescript, each line beginning with a ā€œ%ā€. This looks like this:

% Document Title
% A. Uthor
% 2015-01-01
... Document content begins here ...

I have implemented this kind of ā€œMeta-Informationā€ in my clone of cmark (in cm2html, to be exact), which puts the meta-information into Dublin Core <META> elements in the HTML <HEAD>, and uses the title for the <TITLE> element too (if not overridden by a command-line option specifying the <TITLE> to use this time).

The resulting HTML header is shown below. I particularly longed for this feature (copied from the discount parser I use too) because ISO HTML requiresā€”among other thingsā€”each HTML document to have a <TITLE> to be valid, and thereā€™s no good way to specify the title otherwise (well, maybe a parser could use the first section header text, but I wouldnā€™t find that a great work-around).

<!DOCTYPE HTML PUBLIC "ISO/IEC 15445:2000//DTD HTML//EN">
<HTML>
<HEAD>
  <META name="GENERATOR"
        content="cmark 0.22.0 (https://github.com/tin-pot/cmark.git d57b73fedd68)">
  <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <LINK rel="schema.DC"   href="http://purl.org/dc/elements/1.1/">
  <META name="DC.format"  scheme="DCTERMS.IMT"      content="text/html">
  <META name="DC.type"    scheme="DCTERMS.DCMIType" content="Text">
  <META name="DC.title"   content="Document Title">
  <META name="DC.creator" content="A. Uthor">
  <META name="DC.date"    content="2015-01-01">
  <LINK rel="stylesheet"  type="text/css"
        href="default.css">
  <TITLE>Document Title</TITLE>
</HEAD>

Thanks for your comments! They cover exactly what Iā€™m struggling right now.

Background.
I use markdown to store notes, clippings from articles, ideasā€¦ So I need something fast and uncomplicated. MD is perfect for this. And CommonMark is even better.

To distinguish reference data (URLs, the language of the note, who said that, etc.) I simply put a paragraph starting with ā€˜~ā€™ somewhere in the text. This note refers to the enclosing header or to the whole file if appears before any header.

Anything more complicated (or with too much structure) will not be used for my peculiar use case, Iā€™m sure. Also a visually intrusive markup, I think ,distracts too much from the real content.

So I suggest the language could define a begin of line marker to say ā€œIgnore thisā€ or ā€œThis is specialā€ or ā€œThis is a commentā€. The standard parsers (pandocā€¦) could ignore it, but libcmark could create a node for it marked as ā€œCMARK_NODE_CUSTOM_BLOCKā€. Then is my database loading code (for example) that use this information with no need to look at each paragraph first character to see if it is a ā€˜~ā€™.

Thanks for the discussion!

Pandoc % block is too limited.
What if I want to mark a section with the language used? The pandoc block is OK for your use case, but then I need a different mechanism for this one.

% Title: multi-language
% language: fr
<text in French>

# Article
% language: en
<text in English>

# Articolo
% language: it
<text in Italian>

I propose that we overload the fenced code block info string to indicate to post-processors that a code block may be interpreted:

```yaml #!
---
foo: bar
---
``` 

Whilst it is a little more verbose than some of the alternatives it has several benefits:

  • No updates to any parsers
  • When not executed via a post-processor will render as a code block
  • Transparently indicates to post-processors the language to interpret the code block as
  • Does not require many changes to the spec except documenting that #! in an info string is a post-processor directive
  • Trivial for post-processors to parse (yaml.safeLoadAll(node.literal))
  • Allows for interpreter hints #!js-yaml or #!yaml-lite etc
  • Can be embedded anywhere in the document

I think this addresses several of the concerns in this thread, I wonder what people thinkā€¦

2 Likes

@tmpfs I prefer the Jekyll-style metadata blocks because ā€œa Markdown-formatted document should be publishable as-is, as plain text, without looking like itā€™s been marked up with tags or formatting instructions.ā€ The list item points that you mention are good for parsers, but adding additional syntax around the metadata block makes the document less readable for humans.

I see your point of view but I have a few issues with the standard YAML frontmatter approach:

  • Restricts the meta data format to YAML
  • Can only be embedded at the beginning of the document
  • Confusing as we would have --- to mean YAML frontmatter, thematic break and level 2 setext heading

As noted elsewhere with the YAML frontmatter approach there is no real need to specify anything as it can be trivially parsed by a pre-processor.

However I think there could be some value in defining a commonmark extension that allows embedding arbitrary data in arbitrary formats anywhere in a document.

Just to throw it out there, to improve legibility rather than take inspiration from the shebang (I figured most people using this functionality would see the shebang as something that would be interpreted) how about a single period . as in source:

```json .
{"meta": "foo"}
```

I think is more readable and also implies the code block would be interpreted.

I also believe it would be useful if an author could define structured data in multiple places in the document.

I think that there are many different uses for markdown and we shouldnā€™t be restricted by it always being accessible to a layperson.

If a technical person is creating a markdown blog post then some additional meta data is useful. If I want to share a recipe with my family then I would use plain markdown.

I think we should be able to support both use cases.

Iā€™m not sure that we can. You can get a technical person to understand a non-technical document, but a non-technical person is going to be confused or put off by seeing technical complexity mixed in with a regular Markdown document. So long as we allow complex syntax, some documents are inevitably going to include that syntax with the (wrong) assumption that a non-technical person will just ignore the complex syntax.

My point is that in the first use case if no layperson will see the document, ie: it is published to HTML for public consumption then we should be able to specify this extension to fulfil that use case.

We know this use case exists as there are various markdown -> blog publishing tools.

The idea we should strive for is that ā€œtechnical featuresā€ targeting technical users can and should be more complex. But at the same time clearly distinguished from the normal markdown syntax for normal markdown user, via some ā€œstartā€ and ā€œstopā€ markers.

An example of this is # h1 heading { #idname .classname var=1 } and how it clearly distinguished the much less intuitive anchor and stying classnames from the normal intuitive markdown syntax via {}.

What this allows for is for normal and expert users to share the same markdown vocab for the basics, but for more advance features the more technical users can have a more expressive syntax (at the cost of intuitiveness). So we get the best of both without impacting on low or high technically abled writers.

This also allows for defining how to ignore the more technical features (or stripping it), to provide a more compact markdown display that doesnā€™t implement the more advance features.

In which case we could still overload the fenced code block just be more specific:

```yaml {meta}
---
data:
  - foo
---
```

What do you think about something like that?

@jgm I would like to ā€œreviveā€ this, now that we support CommonMark. At a minimum I want to allow you a ā€œuser optionā€ for ā€œtraditional multilineā€ support, but need some metadata to accompany it.

Simplest solution in my mind would be to add this to the top of any documents you create.

<!-- softbreak: false -->

We could also go the full hog here, but this chews up premium real estate in the editor.

---
softbreak: false
---

I guess the editor could be made smart enough just to hide the metadata and expose it via the options dialog.

Not sure what to do here, what is your pref.

1 Like

An implementation can always decide to treat some specially marked initial lines of the document as metadata, rather than as part of the CommonMark content. I donā€™t think thereā€™s any need to standardize on a format for this, since different applications will have different needs. So I still think that metadata should not be part of the spec.

3 Likes