Speech synthesis markdown language? (Or conversational markup?)

tl;dr: The concept is a lightweight markup based on how people talk in instant messaging or IRC. This is unlike markdown’s source of influence, which is more based on emails and forum posts.

Just a thought:

There is already a speech synthsis markup language call SSML: http://www.w3.org/TR/speech-synthesis/

However I do wonder if there is a need to perhaps consider what a markdownish speech synthesis markup language may look like. I don’t think you can exactly use commonmark/markdown syntax to do speech synth. Take for example sarcasm, you can’t exactly do it in pure text markup.

So far what I can think of:

  • ... – indicates pause in speech
  • * – can be used to ephasis words in speech
  • \<switch> – at end of the line indicates how entire sentence should be carried out
  • You must reallly love to be a good guy /s
  • "scare quotes" – Or maybe sarcasm is better done as scare quotes " around the sarcastic expression?
  • :D – emotion icons can perhaps be used to indicate previous statement should be carried out in a certain tone. e.g. This really doesn't make me feel good D:
  • --> #id <-- – Allows the speaker to ‘point’ to a particular section in a page?


Other applications

Since this is more conversationally based markup language. This could be useful in terms of creating a lightweight conversational markup language that would work well in instant messaging or IRC. So in general, this speech markup language should aim for speed of writing.

Copy of discussion at http://www.reddit.com/r/LightWeightMarkup/comments/2k6hfk/speech_synthesis_markdown_language/

Is this format in use? The draft is from ten years ago and I don’t see it listed in the W3C technical reports index.

Don’t know. But if not, then they should reconsider, especially since they brought in HTML5 speech standard. Would kind of suck to always have monotonous text.

But eitherway, that old concept got me thinking that we should have a stripped down and modified version of common mark. But designed specifically for conversational usage in IRC and instant messaging, with emphasis on being able to carry emotional inflections in a parse-able manner.

This would benefit:

  1. Emotional speech. Which would help with understandability.
  2. Natural speaking in 3d avatars in VRs. E.g. Better detection of how an avatar should behave when you send a speech bubble.(e.g. in MMOs)
  3. More fluid conversation in instant messaging. Since it is designed for speed of typing.

If anybody is working on such thing. Do let me know, I would love to document it in a wiki.

Btw, do people always use emoticons at the end of sentences to indicate the tone of the previous sentence?

Sorry to be resurrecting such an old post but I’m just wondering if there is still any interest in this.
I’ve been working a lot with speech synthesis and found SSML rather tiresome myself. That’s why I created a ruby gem that implements a kind of speech synthesis markdown (SSMD).

I’m open to suggestions and improvements to the specification.

It’s not yet fully implemented as I just started last week.
So far Text, Emphasis, Mark, Language, Phoneme and Prosody are supported.

Well Any thoughts about my initial suggestions, and your implementation? It’s interesting to consider why you chosen your approach.

My implementation is just a very basic mapping to SSML. Two things from your post are in it just like you suggested:

  • ... becomes <break>
  • *you* becomes `you``

I wouldn’t event know how to do the rest, however. SSML (which is the standard used by all the speech synthesisers I know) doesn’t support complex stuff like sarcasm or emotions. It does allow for extensions, though.

For instance Amazon’s Polly has support for whispering (which is more than just turning down the volume).
Do you know of any speech synthesisers that support the things you suggested?

This might not be useful if you prefer a “markdownish” solution.

We have come up with a lightweight wikitext-like markup language called WikiVoice. Instant messaging applications could integrate it and provide a select-n-mark user interface. It has been used by editors (by typing in markups manually) to edit raw scripts and fine-tune the derived voices in one go with the help of wiki2ssml.