Proper ruby text (<rb>) syntax support in Markdown

Whatever the markdown, the output html should follow the most widely supported tag format. I think that would be:

<ruby><rb>base_text</rb><rp>(</rp><rt>ruby_text</rt><rp>)</rp></ruby>

The <rp> tag marks a parenthesis that non-supporting browsers would display:

base_text(ruby_text)

Absolutely. I shortened my example for brevity’s sake - <rb> and <rp> tags should of course be used in practice.

That raises another thought, by the way - the <rp> tags should probably contain and , unless there’s a common application for ruby tags where those aren’t appropriate.

Edit: Here’s some more info: <ruby> tags, along with <rt> and <rp>, are part of the living HTML standard. <rb> and <rtc>, however, are part of the HTML5 spec. This could mean that support for them is worse (although I don’t know). For the purposes of this, however, that doesn’t matter: <rtc> isn’t used at all, and <rb> doesn’t actually do anything (and will thus work in browsers that don’t support it) unless tags are out of order (e.g. <rt>1</rt> <rt>2</rt> <rt>3</rt> <rt>1</rt> <rt>2</rt> <rt>3</rt>, which this extension wouldn’t do).

I’d be inclined to agree, though my only exposure to ruby text is for Japanese. If 【】 is common across other languages, and won’t lead to ambiguity with other uses of 【】, then yeah. The standard mentions how ( ) can be ambiguous in some situations, so the same logic applies to whatever delimiter is chosen.

I think it’s safe to say that ( and ) are bad choices, in any case, as they would look ridiculous in most text.  ( and ) are better (notice the space?), but not too friendly toward full-width text. Since the spec specifically mentions that ruby is “primarily used in East Asian typography as a guide for pronunciation or to include other annotations”, I think and are a fairly safe bet.

This looks awesome but we got to update it to the markdown it engine, similar to how all the extensions in https://github.com/discourse/discourse/tree/master/app/assets/javascripts/pretty-text/engines/markdown-it work

Went ahead and wrote a tangentially related proposal for full-width formatting characters. It would make the furigana within entirely full-width Japanese text thing possible (e.g. いい[提案]【ていあん】ですね。).

1 Like

I can’t think of a case where I would want to display that. It is an exception, but it makes it much easier to add ruby text for compound words. Using the same example, it would be denoted
振【ふ】り向【む】く
or
[振]{ふ}り[向]{む}く
This requires typing out the compound to get it to appear in the IME, then backtracking to add the ruby text. It might be worth it as an optional feature.

In Discourse, you normally can escape brackets with a backslash (applies to {}, (), []). In my plugin I got rid of all backslashes before 【】 in the baked text and ignored the ones with backslashes, but it might be worth allowing them to be escaped like any other set of brackets. The other full-width brackets would also be good candidates.

I will update it to the markdown it engine. It shouldn’t be difficult since it is a preprocessor on the whole text.

The new engine does not like this for very good reason, when you add rules you need to find the right place to inject them, in this case it would be an inline rule so you could probably just push to the end of the stack.

I made the []{} version on the markdown-it engine in as an inline rule on the top of the stack. It should run a lot better then large numbers of unreadably large regexes.

The inline rule will work with the character seperated syntax for multiple ruby tags ([図書館]^(と しょ かん)). It can also be switched over to the []^() syntax if necessary.

Supporting full-width brackets in an inline rule right now is problematic because markdown-it does not stop on them. However, it shouldn’t be an issue with []【】.

I can add the same type of pattern matching as the old version I wrote but it might be better as an optional feature rather than a part of the spec.

What syntax would be best for CommonMark? I think []{} is the easiest to type in most cases.

The 【】 syntax is probably too dependent on the text type for the spec. It uses non-CJK characters to determine where to place the ruby tag, and only saves a couple keystrokes. It could also have unintended consequences on existing plaintext documents (【】 are used in headings/titles).

Well, not if you only implement the unambiguous version ([]【】 / []【】).

Yes. I was referring to the 漢字【かんじ】 syntax only. The unambiguous one would be completely fine.

Without full support for full-width brackets in CommonMark though, []【】 and other syntax with only full-width brackets would be a bit of an issue.

  1. The implementation Discourse uses, markdown-it, skips over sections of text not in a specific set of characters. This set does not include any full-width characters since they are not used elsewhere in the spec.
  2. There would no way to escape it.

My plugin uses markdown-it so adding a syntax with only full-width brackets will require some hacks to address the above.

The best solution for adding something like []【】 would be full-width support in the spec as you proposed.

2 Likes

I still do not understand why simply specifying a monospace font in the editor is not a perfectly fine solution here, and it is vastly simpler.

The full width is relating to the list of stop chars for inlines, it’s an implementation detail that makes it annoying to build this plugin, but my new text post processor helps a bit (but if you want to bold or italic ruby text you would be stuck)

Be sure to read

1 Like

Would you use ruby text for: 爨 are people able to hand write that?

Whether you use furigana (the Japanese use case for ruby) or not is really dependent on text type and audience. If you were inclined to use furigana at all, though, 爨 is most definitely a kanji you would use it on, as it’s both hyōgai (non-standard) and very complicated. For example, you’d write [炊爨]【すいさん】 or [爨]【かし】ぐ to make things readable.

1 Like

I am warming up to just adding this syntax to Discourse core, provided it is behind a site setting.

Can you clarify if you require formatting in the brackets, eg: [*爨*]【*かし*】 ?

I’ve never seen anyone use any formatting within Ruby text, and both Japanese and Chinese traditionally lack both bold and italic, so it most definitely isn’t too important.

I don’t think there’s a reason to explicitly forbid it, though. Since formatting is available in regular CJK text (), it might as well be valid within the base text of a Ruby tag. I can also imagine a scenario where someone would want to emphasize a certain part of pronunciation ("She actually says [寂]【さ**み**】しい in this case"), in which case formatting in the actual Ruby text would be useful. None of those would be particularly common, I don’t think, but it could become a minor thing people would stumble over every now and then.

The one thing that might get weird is ending a formatting block within a Ruby block. For example, *Italic outside[italic base]【italic ruby* normal ruby】. Either formatting could be disallowed within Ruby tags, or this could result in something like:

<i>Italic outside</i>
<ruby>
    <rb>
        <i>italic base<i>
    </rb>
    <rt>
        <i>italic ruby</i> normal ruby
    </rt>
</ruby>

I suppose it could also introduce a new “context” where new formatting tags can be started but earlier ones can’t be completed… I’m sure you’re more experienced with that kind of thing than I am, though – it’s bound to have come up with other elements before. Stick to what CommonMark usually does, I guess.

In the Discourse context the main reason is cause I can not make this an inline rule, it would have to be a post process rule that walks through text nodes. ]【 etc, are skipped in inlines. So you would get no formatting in these tags.

I see. In that case, it definitely isn’t a very big deal at all.

What about inclusion in CommonMark itself? It’d certainly be useful outside of Discourse too.

My vote is yes it should be included, but I have no real say here at all, deciding what is or is not included is up to @jgm

1 Like