Proper ruby text (<rb>) syntax support in Markdown

RSChiang · October 28, 2016, 7:19pm

It is fairly common for East Asian languages (mostly CJK characters) to have ruby texts annotations; not only do they provide phonetic guides, the actual meaning of text might even differ without labeling.

This technique is currently implemented in HTML as a set of <ruby> tags, as demonstrated. (refer to MDN for details)

<ruby>
躊<rp>（</rp><rt>ㄔㄡˊ</rt><rp>）</rp>
躇<rp>（</rp><rt>ㄔㄨˊ</rt><rp>）</rp>
</ruby>

<ruby>両人<rp>（</rp><rt>ふたり</rt><rp>）</rp></ruby>

In the wild, there are few Markdown extensions support the generation of ruby text, but none of them are consistent.

The Python furigana_markdown package suggests the following syntax:

[図](-と)[書](-しょ)[館](-かん)

The Node.js showdown-kanji package goes a different way, but does not automatically generate <rp> fallback tags:

{漢}(かん){字}(じ)

The PHP parsedown-rubytext extension suggests quite a few ways for adding annotations. Either inline:

[図書館]^(としょかん)
[図書館]^（としょかん） // Full-width parentheses
[図書館]（としょかん）  // Full-width parentheses

Or by defining document-wide ruby text annotations:

**[図書館]: としょかん

And even allowing merging conjugating ruby texts:

 [図書館]^(と しょ かん)  // Will generate three <ruby> tags

TL;DR: How could this syntax be proper implemented in CommonMark spec?

The full-width parenthesis case might not fit well in the context of language-independent spec, but the [base_text]^(ruby_text) syntax might worth a try.

DJTB · July 5, 2017, 8:10am

Would love support for this in CommonMark (and therefore in Discourse).
Anyone creating markdown with CJK characters would benefit enormously from this!

rfindley · July 10, 2017, 9:06pm

I’ve just added the comment below on meta.discourse.org [here]:

One other tag set worth whitelisting is ‘ruby’ tags, which are a standard part of html for Japanese language support.

Since there are thousands of kanji (e.g. 漢字) in the Japanese language, Japanese students are still learning them all the way through high school. So, it is common for publications to use “furigana” to mark the pronunciation above kanji (see snapshot below from NHK News website).

The W3C site has the following for a complete list of ruby tags:

<ruby> </ruby>
<rbc> </rbc>
<rtc> </rtc>
<rb> </rb>
<rc> </rc>
<rp> </rp>

I can’t think of any risks to whitelisting, since they’re pretty straight foward.

Here’s an example of html and result (as an image):

<ruby><rb>漢字</rb><rt>かんじ</rt></ruby>

[oops… I can’t post the image. New users can only add one image to a post]

(Looks like font size might need to be set to 1.2em, as I’ve done in the sample text above)

sam · July 10, 2017, 9:37pm

Is ruby considered a block level tag or an inline?

rfindley · July 10, 2017, 9:48pm

It’s an inline tag, pretty much like a specialized <span>.

obskyr · July 11, 2017, 7:57am

I’m very much for this. It could lead to much larger adoption of Markdown outside English-speaking communities, even.

I do have some thoughts about the syntax (as I’m sure many others will). Since Markdown is inspired by common plaintext practices, and is supposed to be minimal, syntax like [図](-と) is… well, a bit odd, to say the least.

StackExchange’s Japanese site has a fairly well-executed implementation of ruby tags implemented site-wide. It allows three different syntaxes for ruby tags: [漢字]{かんじ}, 漢字{かんじ}, and 漢字【かんじ】. Out of these three, the last one is the most similar to what people use in plain text. In fact, it is exactly what people use in plain text. The first syntax has the advantage that it’s unambiguous, which is required for putting ruby tags over anything other than an entire contiguous group of kanji.

Personally, I’d suggest the [漢字]{かんじ} (unambiguous; easy to type) and 漢字【かんじ】 (already widely understood; low-friction) syntaxes, perhaps with the addition of [漢字]【かんじ】 for consistency.

I should point out that the StackExchange implementation has a few bugs and oddities (there’s no escaping; ruby can’t be put above non-CJK characters), but as a reference point it’s very nice (and has already been put to good use).

Also, this makes me think… Wouldn’t it be beneficial for CommonMark to support full-width brackets (［］（）｛｝【】) in addition to normal ones? That way, it would be immensely more friendly toward CJK text, which most often uses full-width punctuation.

amclees · July 11, 2017, 8:27am

I wrote a Discourse plugin that adds this functionality. Here are a few things I noticed while writing it:

The single bracket support (漢字【かんじ】) is fairly dependent on characters outside of the CJK character range as terminators. It isn’t particularly suited for Chinese, but work fairly well with Japanese. For example, 猫【ねこ】は飛【と】んだ uses は to determine what the base text in the ruby tag is. Given that these brackets might be used without intending to place ruby text, they probably aren’t the best idea for markdown.
It might be worth pattern matching the text in the rt with the base text for Japanese.
For example, [振り向く]{ふりむく}. Adding ruby text without this would be very tedious, because the fully word must be input then seperately annotated.
However, this adds a lot of potentially unnecessary bloat to a fairly simple feature and doesn’t really apply outside Japanese.
On StackExchange there was demand for a format using ASCII (e.g. []{})
Escaping 【】 doesn’t seem to be standard
All browsers except Opera Mini have ruby support (even IE)
Certain email services (including gmail) do not support ruby tags in the email body

obskyr · July 11, 2017, 8:36am

I thought about that, too. Since lenticular brackets are used in Chinese too, and Chinese is 100% hanzi, the 漢字【かんじ】 syntax could lead to surprises in Chinese text. That’s an argument against it. I don’t how much CommonMark cares about entirely CJK text, but hey. Since images and links are also double-brackets in CommonMark, only supporting a []{} / []【】 syntax wouldn’t be too bad.

I’m also a bit apprehensive about this one. It’s sort of inconsistent. What if you wanted to write <ruby>振り向く<rt>ふりむく</rt></ruby>? Having that matching thing there would make that impossible. It’s also a kind of strange exception for the otherwise completely unambiguous bracket set syntax.

What do you mean by that?

rfindley · July 11, 2017, 8:43am

Whatever the markdown, the output html should follow the most widely supported tag format. I think that would be:

<ruby><rb>base_text</rb><rp>(</rp><rt>ruby_text</rt><rp>)</rp></ruby>

The <rp> tag marks a parenthesis that non-supporting browsers would display:

base_text(ruby_text)

obskyr · July 11, 2017, 8:56am

Absolutely. I shortened my example for brevity’s sake - <rb> and <rp> tags should of course be used in practice.

That raises another thought, by the way - the <rp> tags should probably contain 【 and 】, unless there’s a common application for ruby tags where those aren’t appropriate.

Edit: Here’s some more info: <ruby> tags, along with <rt> and <rp>, are part of the living HTML standard. <rb> and <rtc>, however, are part of the HTML5 spec. This could mean that support for them is worse (although I don’t know). For the purposes of this, however, that doesn’t matter: <rtc> isn’t used at all, and <rb> doesn’t actually do anything (and will thus work in browsers that don’t support it) unless tags are out of order (e.g. <rt>1</rt> <rt>2</rt> <rt>3</rt> <rt>1</rt> <rt>2</rt> <rt>3</rt>, which this extension wouldn’t do).

rfindley · July 11, 2017, 9:17am

I’d be inclined to agree, though my only exposure to ruby text is for Japanese. If 【】 is common across other languages, and won’t lead to ambiguity with other uses of 【】, then yeah. The standard mentions how ( ) can be ambiguous in some situations, so the same logic applies to whatever delimiter is chosen.

obskyr · July 11, 2017, 9:27am

I think it’s safe to say that ( and ) are bad choices, in any case, as they would look ridiculous in most text. ( and ) are better (notice the space?), but not too friendly toward full-width text. Since the spec specifically mentions that ruby is “primarily used in East Asian typography as a guide for pronunciation or to include other annotations”, I think 【 and 】 are a fairly safe bet.

sam · July 11, 2017, 12:46pm

This looks awesome but we got to update it to the markdown it engine, similar to how all the extensions in https://github.com/discourse/discourse/tree/master/app/assets/javascripts/pretty-text/engines/markdown-it work

obskyr · July 11, 2017, 3:30pm

Went ahead and wrote a tangentially related proposal for full-width formatting characters. It would make the furigana within entirely full-width Japanese text thing possible (e.g. いい［提案］【ていあん】ですね。).

amclees · July 11, 2017, 6:07pm

I can’t think of a case where I would want to display that. It is an exception, but it makes it much easier to add ruby text for compound words. Using the same example, it would be denoted
振【ふ】り向【む】く
or
[振]{ふ}り[向]{む}く
This requires typing out the compound to get it to appear in the IME, then backtracking to add the ruby text. It might be worth it as an optional feature.

In Discourse, you normally can escape brackets with a backslash (applies to {}, (), []). In my plugin I got rid of all backslashes before 【】 in the baked text and ignored the ones with backslashes, but it might be worth allowing them to be escaped like any other set of brackets. The other full-width brackets would also be good candidates.

I will update it to the markdown it engine. It shouldn’t be difficult since it is a preprocessor on the whole text.

sam · July 11, 2017, 6:47pm

The new engine does not like this for very good reason, when you add rules you need to find the right place to inject them, in this case it would be an inline rule so you could probably just push to the end of the stack.

amclees · July 13, 2017, 3:21am

I made the []{} version on the markdown-it engine in as an inline rule on the top of the stack. It should run a lot better then large numbers of unreadably large regexes.

The inline rule will work with the character seperated syntax for multiple ruby tags ([図書館]^(としょかん)). It can also be switched over to the []^() syntax if necessary.

Supporting full-width brackets in an inline rule right now is problematic because markdown-it does not stop on them. However, it shouldn’t be an issue with []【】.

I can add the same type of pattern matching as the old version I wrote but it might be better as an optional feature rather than a part of the spec.

What syntax would be best for CommonMark? I think []{} is the easiest to type in most cases.

The 【】 syntax is probably too dependent on the text type for the spec. It uses non-CJK characters to determine where to place the ruby tag, and only saves a couple keystrokes. It could also have unintended consequences on existing plaintext documents (【】 are used in headings/titles).

obskyr · July 13, 2017, 7:33am

Well, not if you only implement the unambiguous version ([]【】 / ［］【】).

amclees · July 13, 2017, 8:19am

Yes. I was referring to the 漢字【かんじ】 syntax only. The unambiguous one would be completely fine.

Without full support for full-width brackets in CommonMark though, ［］【】 and other syntax with only full-width brackets would be a bit of an issue.

The implementation Discourse uses, markdown-it, skips over sections of text not in a specific set of characters. This set does not include any full-width characters since they are not used elsewhere in the spec.
There would no way to escape it.

My plugin uses markdown-it so adding a syntax with only full-width brackets will require some hacks to address the above.

The best solution for adding something like ［］【】 would be full-width support in the spec as you proposed.

codinghorror · July 21, 2017, 10:15pm

I still do not understand why simply specifying a monospace font in the editor is not a perfectly fine solution here, and it is vastly simpler.