Metadata in documents

Who is Trans? I don’t know him as yaml spec developper.

poster from yaml-core@lists.sourceforge.net mailing list. I don’t think he is an official yaml spec developer. I think he is just proposing an alternative.

Perhaps we can shoot trans’ diet-YAML EBNF to an actual yaml spec dev and see if it makes sense to them. My biggest concerns is if it can ignore data entries it doesn’t recognise (since its a restricted subset of YAML)

+1 on restricting yaml to basic types + arrays.

1 Like

May be you don’t know, every hobbyst propose alternatives :slight_smile: . JSON5, TOML, and so on. It worth do discuss serious things with yaml spec authors. Everything else is a waste of time.

The best we can do at our side - collect use cases first.

That’s difficult, and i don’t see practical reasons, except for text highlight. There are many ways to “break” yaml with identation, without complex types:

a: b
c

try it here YAML parser for JavaScript - JS-YAML

2 Likes

+++ Vitaly Puzrin [Nov 12 14 10:27 ]:

I completely understand the worries about the complexity of full YAML.

At first glance, it can be enougth if metadata contains stupid pairs:

---
foo: bar
baz: bad
---

One reason I ended up supporting both lists and objects in pandoc (which actually just uses a real YAML parser) is that both seem useful in document metadata, esp. for books:

For example, the following maps on to standard EPUB metadata:

+++ mofosyne [Nov 12 14 14:09 ]:

There is no official support for such a format at this time, but I have started a initial project to create such a standard code named “Diet YAML”.

GitHub - openbohemians/diet-yaml: A Low Calorie YAML Alternative

The idea is simply to take YAML as is and remove the “extraneous” features that are unnecessary for use as a basic configuration file format.


In that link, his EBNF (said to be work in progress) looks like

YAML ::= Start Data End
Start ::= ( “\n—” | “” )
End ::= ( “\n…” | “\n—” | “” )
Data ::= (Scalar | Sequence | Mapping )
Scalar ::= (Number | String | Date | Boolean | Nil)
Sequence ::= ( “[” Data (“,” Data)* “]” | OptionalTab “-” Data (“\n” OptionalTab “-” Data)* )
Mapping ::= ( “{” Key “:” Data (“,” Key “:” Data)* “}” | Tab Key “:” Data (“\n” Tab Key “:” Data)* )
OptinalTab ::= Space*
Tab ::= Space+
String ::= ‘"’ .* ‘"’ | [^-] .+
Number ::= (“+” | “-”)? [0-9]* (“.” [0-9]+)?
Date ::= [0-9][0-9][0-9][0-9] “-” [0-1][0-9] “-” [0-3][0-9] ( [0-2][0-9] “:” [0-5][0-9] “:” [0-5][0-9] )?
Boolean ::= “true” | “false”
Nil ::= “~”
Space ::= " "

Based on observation, this one supports the --- ... fencing. Each entry can be either a Scalar key:value pair, or a sequence/list (via -), or map (restricted to scalar entries). Only supports a limited set of core types (Strings, Number, Date, Boolean, Null, space).


This would not be hard to parse, but seems to lack what I think is a
crucial feature: the ability to specify multiline strings using |:

title: My Article
abstract: |
  This is the abstract of my
  article.  It can go on and on.

  It can even have two paragraphs,
  or a list:

  - one
  - two
1 Like

Of cause, multiline strings and quoting can not be candidates for removal. That’s why i asked who is that guy who proposed that notation. I can prepare better summary, if you wish, in couple of days. At first glance, this can be removed:

  • omap, pairs,
  • merge
  • anchors
  • custom types
  • binary type
  • binary and octal numbers, ‘infinity’
  • directives
  • writer
  • (? not sure) scientific floats (+1.2e5)
  • (? not sure) explicits (!!float 123, !!str true)

That does not touch markup anyhow, only remove features. And still can be parsed with full-weight implementations.

1 Like

base64 binary could be useful for some - adding a preview image, thumbnail, cover, icon, avatar etc. to the document.

1 Like

Thanks. I’m not familiar with books preparation process.

It would be good to have a list of YAML capabilities to support natively. I can see binary data being useful in certain context.

Well, if you can get a list of core features that needs to be kept, then that can be a start (And maybe we can forward that to either YAML or diet-YAML team.

A good yardstick could perhaps be if it can convert from/to json/json5 at the least (if I remember json is not very large of a specification).


Alternative proposed name for this stripped down YAML:

cYAML - Core YAML - C YAML : Small fast minimally specced YAML with fast C reference parser. (I think at minimum, you should be able to convert between JSON and YAML)

1 Like

I want to make the point that there are two ways in which we might
think of a YAML metadata section fitting in to documents.

A. One is to consider think of YAML metadata blocks as part of CommonMark,
adding it to the spec, etc. With this method, metadata would be part of
the AST produced by a CommonMark parser. (That’s how it is in pandoc.)

B. Another is to think of a document like:

---
title: My title
author: Me
---

Starts here...

as really the combination of two separate documents – one, a YAML
document, which ends at the second ---, the other, a CommonMark
document.

A processor would then divide the two documents, parse the YAML document
with a proper YAML parser (or, if it likes, just skip it), and parse
the rest with a CommonMark parser.

The processor could use the values gleaned from the YAML document in
any way it likes – interpolating them into templates, for example,
and even perhaps running their string contents through a CommonMark
parser. The CommonMark spec wouldn’t need to know about this.
(This is how it is done in Jekyll, for example.)

1 Like

@jgm, this is already the second time there seems to be something missing from your message:

First:

Second:

Does that email-envelope (in the top right corner) mean that those replies come somehow through email, which might eat part of your message…

I have edited my post above. I did use email, and I note that I only indented the code block 3 spaces. For some reason, this caused Discourse to omit everything following. (I also see that if you click on the envelope icon, you can see the unedited contents of the original email.)

I guess your distinction boils down to whether CommonMark is “just” a markup language or also a file format. For example, HTML is clearly the latter, as it also specifies doctypes (<!DOCTYPE html>), and a head element containing metadata.

P.S.

Seems only the poster can do that, at least it doesn’t work for me.

Below from mailing list from Ingy dot Net ingy@ingy.net


Thanks for bringing this up to the yaml-core mailing list. I’m not sure
even where to start. I’ll throw out some random points that come to mind:

  • YAML was designed to be a full, cross-language, data serialization
    language
    • It is just a current state of affairs that people use it mostly for
      trivial purposes like config files
    • There are minimal (not yaml.org approved) YAML implementations,
      that only exist in a particular language like Perl’s YAML Tiny
      https://metacpan.org/release/YAML-Tiny
  • I started the YAML2 discussions https://github.com/yaml/YAML2/wiki 3
    years ago to make YAML less complex without losing its powers
  • I’m working on a Pegex based YAML implementation that will generate
    parsers in all YAML languages from a single grammar
  • There are only 3 major differences between YAML and JSON (at the data
    model level):
    1. References
    2. Tags/types
    3. Non-string mapping keys
  • YAML implementations can be complete, full-stack, or minimal
    text=E2=86=92native

I think that the YAML spec documents cause implementor confusion because it
is unclear what needs to be implemented. These are my opinions on what
should be properly conveyed:

  • The YAML 1.2 syntax as specced is correct. (Though a 2.0 could make
    it simpler)
  • The default schema should only support JSON types: Str, Num, Bool,
    Null, Map, Seq
    • ie no Date, Set, OMap or any other should be made available by
      default
  • Only true/false/null (from JSON) should be implicitly recognized. Not
    the Yes/No/True/False/=E2=80=A6 options.
  • Merge key is something that should only be available as a plugin. This
    was just a idea we threw out, and for some legacy reasons some of the
    implementors implemented it and some did not.

It seems we need a YAML implementors guide. I’m thinking that what you are
seeking could be part of this. I would encourage people not to fork YAML to
a simpler form, but to simply make weaker/simpler implementations according
to an agreed upon guide. Here are some basic thoughts on how this might
look:

  • Format is called YAML
    • .yaml and .yml extensions are used
  • Implementations can be called SimpleYaml or somesuch
  • Basic Loader restrictions:
    • Explicit Tags throw error on parse
    • Flow forms throw error on parse (except empty [] {} which have no
      block form)
    • JSON schema as above
    • Anchor/Alias throw error on parse
    • Non-string (plain/quoted) keys throw errors
    • No stack. Loader =3D=3D Parser=E2=86=92Constructor
  • Dumper restrictions:
    • Dumpers must produce streams loadable by Loader above
    • Streams must be loadable by any more complex loader

In conclusion, there are ways to make YAML simpler on many levels without
forking it. I personally am interested in discussing them.

Consider joining #yaml on irc.perl.org to discuss further.

Cheers, Ingy

I strongly oppose this feature and hope that we can see through the flaws in this idea enough to drop it. As a standard, markdown should stay simple and focused on ease of implementation and universal compatibility. By introducing a data format - of any kind and any level of complexity - we will be introducing a feature that complicates this medium and cripples how other libraries can work with it.

Markdown can stand on its own, but metadata cannot. It must have an ultimate purpose, like being passed to a template engine as context to be used templates, being passed to renderers/parsers as an options object, whatever is required for the use case. Given that, we need to allow implementors to use whatever solution makes the most sense for parsing metadata and use the markdown parser they want for parsing markdown.

By implementing metadata, markdown will now have “compatibility issues” As it stands, markdown has a clear purpose in life, which makes it easy to see how it fits into any application. This will not be the case if data enters the picture. The problem is that, regardless of best intentions, this feature will never be able to satisfy the needs of every user, parser, renderer, template engines, or implementor who might need such data. This means that other solutions will still need to be implemented for parsing data, which not only complicates decisions and implementation strategies, but it will virtually guarantees confusion with users who want to use both this solution and the implementor’s solution, or some combination of those things.

Markdown is not a data format, but it will be if this is implemented. We’ll need to decide which data format is correct, how much is “just enough”, who the consumers will be, etc. and this slippery slope will ultimately lead to religious battles over how much data is too much and: 1) why “my favorite data format isn’t supported”, 2) “can I use this data along with my jekyll front matter, or instead of it? because then I can’t use all of jekyll’s features”, etc. etc.

Data formats are use-case specific, and should not be related to “file type”: e.g. there are many document data and front matter parsers for many use cases, and none of them have any specific relationship to markdown. Why are we trying to create one? In other words, since front matter parsers will parse front matter from any file type (e.g. markdown, handlebars templates, HTML documents, whatever), if this feature is implemented, how should users format their data when both templates and markdown files are used? Should they ask the front matter parsing library to adopt the format you decide on here for handlebars templates (not going to happen)?

Parsing front matter is trivial. One can write a front-matter parser to extract data from a document in ~20 sloc, the result of which provides them with a nice, clean string of pure markdown, and an object of data that was create from whatever language the implementer preferred to use. By implementing this feature in markdown, you will greatly complicate this task by necessitating strategies for data conflict resolution and so on.

6 Likes

You raise some excellent points @jonschlinkert.

Is there a reason why meta data needs to be placed in the same file? A separate yaml file that points to the Markdown file could solve the Jekyll use case at least. Separation of concerns.

Yes. That’s convenient sometime. For example, in blog posts (title).

But i’m not sure such posts should be parsed directly by markdown parser, without preprocessor.

2 Likes

I think it’s much safer if it’s in a block generic directive, since displaying metadata between platforms (html, paper, etc…) is highly variable, and data (and thus metadata) in general are much more fragile than normally human typed text.

Maybe metadata should be thought of as “recommended” best practices when used in the context as settings for various generic directives (basically restricted YAML syntax). As for metadata used for embedding metadata within a document (rather than risking it getting lost by placing it as a separate file), it is best used with it’s own generic directive.

For those that need it maybe we can support these generic directives:

!metadata :~ Included in all parsers, but a stripped down YAML, to cover most obvious use cases. This should remain small and mostly unchanged thought the life of commonmark.

!YAML :~ optional extension (included in fatter parsers like pandoc) of full YAML

!json :~ optional extension of full json

So basically, keep default metadata syntax as small as possible, and make full metadata support optional. Hopefully addressing jonschlinkert concerns that this would greatly make commonmark too complex and unwieldy.

1 Like

I think that CommonMark spec should contain at least some very simple and minimal metadata format so that applications that rely on a CommonMark parser would be able to use at least a trivial key-value pairs out-of-the-box - for example ^([0-9a-z]+):([^\n]*)$ - if the app needs, it can just store one value with base64 encoded data or JSON object or anything else - for everyone else it will be just a string.

Since metadata is in fact application-specific, an application that requires something very complex can be expected to implement that on its own but it would help other developers if the minimum is already in place.

Another nice thing the spec could do is to list common metadata keys, such as Author, Title, Description etc. - so that content management systems have a reference from where to read such attributes.

2 Likes