Embedded audio and video


I don’t understand why audio/video should be covered by the spec. This task can be completely automated by external preprocessor. For example, in nodeca:

  1. Inline links to youtube/vimeo/… are replaced with video title
  2. Links to youtube/vimeo/… on separate paragraph are replaced with video player

Rules can be extended to ANY url pattern, without spec change. So, i don’t know why audio/video files should be special case.

For node.js we did https://github.com/nodeca/embedza to automate things.

Discouse has similar ruby package, “onebox”, but AFAIK it does not supports inline links & flexible fallback (try block format if available, if not - try inline)


Does this link to the song or embed an audio player in the page? It is not clear what the author’s intention was. If we use an explicit syntax, ![](http://example.com/song.mp3) in the case of embeds and <http://example.com/song.mp3> in the case of links, both are possible to describe unambigiously.


@vitaly, how do I distinguish between this cases within the markdown-it plugin? This is the code I wrote some time before. It overrides “md.renderer.rules.link_open” renderer. How can I get if the link I parse is inlined or on a separate paragraph?


In nodeca we do such processing on separate stage, after markdown. We just load resulting html into AST (with jquery/cheerio), and check if link/image is the only content of paragraph or not.

I see no reasons to do everything at once in markdown parser, until you have very special requirements to speed.


Well, I already have working code, and it would be simpler to modify what I have and not rewrite everything from scratch, if it’s possible. Especially if it is faster :slight_smile:


So is it possible within the renderer?


No. Renderer operates with tokens and is not expected to analyze/modify combinations.

You can write plugin for core chain to scan for paragraphs, anylize if content is link_open + text + link_close and do replace. Or manually call parse + scan/replace + render.


Anything new in this ? Any standard way to do now?


Yes I agree.

To spec makers, they don’t need to create a new expression of insert audio/video url, that could go to a complex debate because it’s hard to find a simple and elegant way to do that.

To end users, they don’t need to think more and learn more, just copy/paste the code given by audio/video websites, like how we do now.

To developer, just make extensions to finish that work if they want. Once there is a better way to be created, we can decide whether or not put it into the spec.


Regarding this comment in the Issues we SHOULD resolve before 1.0 release topic:

Should we change the spec so that instead of “images” it talks of media more generally, and allow the image syntax to be used for audio and video? (Renderers could render appropriately for the media.)

My thoughts: for 1.0, rename this to a generic “embedded media” type, with image being the default media type that is rendered and used in the spec examples. The spec could mention that the embedded media syntax may be used to render other types of media based on file extension, the same way that the parser may render soft breaks as hard breaks (with soft breaks being the default). E.g. the spec could state:

A renderer may also provide an option to render the embedded media as a media type other than image, based on the file extension of the media file.

The whitelist of valid URL schemes for autolinks was removed in CommonMark 0.24. Omitting mention of particular file extensions/media types would be consistent with this. A formal list of file extensions is a moving target; keeping this list up to date falls quite far outside the scope of formalising Gruber’s Markdown, but perhaps a later extension could formalise how the media (based on file extension) should be rendered.


You will not be able to select proper wrapper without info about media type. And you can not trust extension in URL. Data from URL should be loaded and analyzed. There are 2 problems:

  • You can not define clear method to resolve media type
  • Download from remote is async by nature. That will be ass pain for parser architecture.

My suggestion is to avoid adding AI to spec.


@vitaly, that seems equally true for images – can we really “know” that a given URL represents an image without inspecting it? Right now when I use a CommonMark compliant parser and embed an MP4 URL, I get an <img> tag, even though the URL clearly provides useful information to the parser suggesting that this is not in fact what the user intended.

I also think it’s important to distinguish between YouTube/Vimeo style embeds (which rely on third party code) and browser-supported HTML5 video/audio playback (which is an open standard, just like <img> tags).

As a markdown user, I would find a system that inserts standards-compliant <video> or <audio> tags for appropriate embed URLs on the basis of extensions more than sufficient for most practical purposes. Having special syntax to force the embed type might be useful for special cases. But if I want to embed a locally uploaded MP4 video in my markdown blog post, that should “just work” on the basis of the URL.

I do agree that any advanced embedding of third party libraries doesn’t belong in the spec, but embedding a video or audio file in a standards compliant manner is, as far as I can tell, not meaningfully different from the already supported image use case. Am I missing something?


@vitaly, that seems equally true for images – can we really “know” that a given URL represents an image without inspecting it?

You missed the end goal - output markup. If it is <img> always, no need to analyze src, because it does not afftects output. But if output can be img/audio/video - some additional criteria required to choose right. Assumptions about file extensions in URL are too weak for spec IMHO.


I think a case could be made not to maintain a list of file extensions mapped against <img>, <video> and <audio> in the spec, just because what file types frequently updated browsers like Chrome do support natively can change pretty often.

The spec should IMO, however, make the recommendation to implementers to vary the output of what’s currently image-only markup based on the file extension of commonly natively supported file formats (with <img> being the default if the extension is not recognized, or if there is none), and provide implementers with sample output for <img>, <video> and <audio>. All three are first-class citizens in HTML5 and it makes no sense to me that markdown would not support them as such.

The behavior that markup like ![some text](https://someurl.com/some.mp3) produces an <img> embed of a URL that’s obviously not an image is counter-intuitive, and the whole point of using a language like markdown is IMO to provide the most intuitive result for the common cases, using markup that people can keep in their head. To the extent that the spec produces obviously counterintuitive results when implemented, I would argue that it needs to be revised not to do so.

I agree with previous commenters who have suggested extending the syntax to override that behavior on an as-needed basis, but if there’s no consensus what that should look like, I would still recommend changing the default behavior of ![x](y) type invocations in the way described above, i.e. make intelligent guesses based on commonly used extensions, and fall back to image markup if no intelligent guess can be made. I don’t think that would qualify as AI.

Although I disagree with you here, let me thank you for all your hard work on markdown-it; it’s a pleasure to use! And again - I may be missing something obvious and am mostly speaking from the perspective of someone using other people’s markdown parsers in a few different codebases.


Such recommendations as definitions with partial/incomplete coverage do not work well in programming. That quickly become “one more standard”, as it already happened with markdown.

I understand people who need audio/video with the same markup, but don’t see ways for correct implementation.

BTW, we use embedza for post-processing to beautify links (not images) an that works very well on practice. And more convenient on forums than use image tags.


Well, I would argue that leaving handling of even basic HTML5 features like <video> and <audio> tags entirely up to implementers is far worse from the perspective of implementation proliferation (and also anachronistic). After thinking about it a bit more, I do think it would be best to include a non-exhaustive list of extensions in the spec, with a note that implementers are free to add formats natively supported by widely used browsers to that list.

The list, AFAICT, would be:

  • <audio>: default for URLs that end with .wav, .mp3, .ogg (see below)
  • <video>: default for URLs that end with .mp4, .ogv, .webm
  • <img>: default for all other URLs

All matching would be case-insensitive.

.ogg is the most challenging since it is a container format associated with both video and audio in the wild, which has led to proliferation of the .ogv extension. This and similar edge cases would be a reason to offer an override in the markup, but in the most common cases, the above matching should work just fine, IMO.


I think all formats natively supported by widely used browsers should be included in the list to avoid ambiguity. As mentioned earlier, this is moving target, but we can attempt to keep the list up to date. As browsers add support for new formats, file extensions could be added to the list using a never remove, only add strategy.


We can discuss about it infinitely. What’s the final decision? We really want to merge @cmrd_senya PR in diaspora* core, waiting for the standard to be decided…


I think relying on the file name extension is much too brittle. Also, any list of file types/extensions will be quickly out-of-date.

I think it would be best to keep ![]() for images and introduce new syntax for “video” and “audio” (and possibly more in the future).

I think that images are used far more frequently, and it wouldn’t hurt if the others were a bit more verbose, e.g. !video[...](...) and !audio[...](...) (as was suggested before in this thread).


Literal keywords like video and audio are absolutely not the CM/MD way.