Explicit RTL indication in pure Markdown

Hi,

I strongly believe that RTL in Markdown should be explicit, and has its own syntax indication within the markdown document itself.

##Why?
The current state of writing non-LTR documents with Markdown is simply awful… there’s no way you can say this element or document is RTL. hence any RTL document will result is messed text, especially if there are Latin words within the RTL text. Some think there’s no problem and this is abstracted away by Unicode or something… but Unicode does NOT handle direction…

Is this implementation/render-specific ?

Some folks say: man just put dir='auto' in the resulted HTML, let the browser do the work and you’re done! others suggest to make the parser detect the language being parsed and its direction… but this won’t cover the majority of cases… it just won’t help it…

Example:
First, to understand what I mean, compare the following 2 <code> blocks, the first without dir="rtl" and the second one with:

أسكن في SidiAmar, Annaba, Algeria.
أسكن في SidiAmar, Annaba, Algeria.

The first one is the current state. Some parser or language detectors (Like Facebook comments) detect the content’s language and decide its direction… but the majority of them (like in Facebook) count the RTL words and the Latin words, the higher wins, like in our example above sum(Latin words) > sum(Arabic words), thus the text is LTR, but it is NOT.

dir="auto" is a bit smarter, it surely helps, but unreliable… here’s why:

Sidi Amar, Annaba, Algeria، هذا مقر سكني

Inspect the above <code> element. It has dir="auto". Did it help? surely no, because it thinks it is LTR as it start with Latin words, but the text is intended to be RTL. Now change it to dir="rtl" to see what I mean.

Conclusion

RTL needs to have an explicit indication in Markdown regardless the content language… the examples I gave are very narrowed, there are other use cases where it really needs RTL explicit indication… (think about editors as well, how one can write RTL in markdown editors?)

If we want to spread Markdown in Wikis, documentation, books… etc, this needs to be in spec due to its importance…

Spec example:

<--rtl-- Foo
Bar

Would result into:

Foo Bar

<p dir="rtl">Foo
Bar</p>

Thanks.

1 Like

I guess you always could resort to the Right-to-left mark unicode char and its HTML named entity &#x202b; although it’s not exactly nice.

Other than that, we’d need attributes on paragraphs, like:

{dir=rtl}
paragraph

or equally:

paragraph
{dir=rtl}

Edit: just discovered that there’s already a topic on RTL.

Yes, we need a clean solution.

Aside attributes on paragraphs, we need them on the whole document too (in case where the whole document is RTL).

The topic you mentioned addresses implementation, but I’m addressing spec. this needs to be resolved at spec level first.

1 Like

Hey, I went ahead and added the dir attributes to the code blocks in your post, because they’re allowed by the sanitizer here. Now you have concrete examples without requiring people to edit the HTML with their browser.

Note that this is NOT a Markdown solution, it is a pure HTML solution.

Thanks @riking ! much better :slight_smile:

@01walid @obeid Did you come up with an agreement?

I naively started outlining an RTL markdown project, got parts of it translated, and then realize that GitHub doesn’t support RTL. #doh

Does anyone have recommendations on Markdown tools that render RTL?

Also, I see that Dariush Abbasi has a very simple version of RTL markdown.

Also, I’d love @riking @mb21 opinion on this.

Stackedit has built-in config option of RTL direction, but that’s not very convenient unless all your docs are RTL.

I’ve been using <div dir=rtl markdown=1></div> around Hebrew docs.
many tools support that on convertion to html (IIRC I used retext and pandoc).
retext is convenient for editing mixed-direction source (as long as you put opposite direction parts on separate line).

In any case conversion to non-HTML is harder. In current landscape I’d try writing pandoc filters that understand <div dir=rtl>.

BTW, github does (currently) support dir=rtl in markdown rendering — but does not support markdown=1:


However, it works fine in github pages processed by jekyll: https://cben.github.io/sandbox/README.html

Okay, I’ve taken a look at the issue of bidirectional text again. For a very accessible introduction, see Unicode Bidirectional Algorithm basics. Quotes below are from Unicode Standards Annex #9: The Bidirectional Alogrithm.

If I understand correctly, as of Unicode 6.3 and later, the preferred way to do bidi text is to use two mechanisms only:

  • Implicit Directional Formatting Characters: LRM (LEFT-TO-RIGHT MARK), RLM (RIGHT-TO-LEFT MARK) and ALM (ARABIC LETTER MARK) (ALM behaves the same as RLM, except around numbers). “Their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display.”

  • Marking up the direction of ranges of text, using either “Explicit Directional Isolate Formatting Characters” or better yet, when using a markup language like HTML, use the dir attribute or similar. “On web pages, the explicit directional formatting characters […] should be replaced by using the dir attribute and the elements BDI and BDO. This does not apply to the implicit directional formatting characters.” BDI is only used when the directionality is not known, e.g. from user input saved in a db (thus doesn’t apply to markdown), and BDO is used to override the normal bidirectionality rules which seems like an edge-case that can be simulated by wrapping each character in an element with a dir attribute in the worst case (or do you think we’d really need a markdown BDO equivalent?)

Reading through Authoring HTML: Handling Right-to-left Scripts, I tried to reproduce similar ‘funny’ behavior with markdown as well:

[مشس هخصث خهس تخت تخهثز](#العربي)

Note that this is a valid internal link in markdown, in any bidi-aware browser and editor, the ](# are just displayed mirrored (i.e. rtl) since the surrounding text is rtl as well. But come to think about it, this is probably fine and could even be considered a strength of the markdown syntax (e.g. as opposed to HTML/XML). Similarly, markdown numbered lists display naturally for RTL scripts:

1. مشس هخصث خهس تخت تخهثز
2. مشس هخصث خهس تخت تخهثز

So… I think we could get away with “just” supporting the dir attribute, so you would write e.g. The title is [مشس هخصث خهس تخت تخهثز!]{dir=rtl} in Arabic., where the ! would be part of the title. However, this leaves the exclamation mark on the wrong side of the text when editing markdown, although it will be displayed correctly in a browser upon convertion to HTML. To display it correctly while editing markdown as well, you’d have to insert a ALM (or RLM) behind the ! (of course, you need a text editors that support bidi for this to work).

btw, can someone link to a good up to date resource about bidirectional text in LaTeX and ConTeXt? If found this PDF, but it’s from 2001.

Basically the question is: if a markdown processor like Pandoc were to add native span and div syntax with the dir attribute and pass along unicode LRM, RLM and ALM chars:

  1. this would already suffice for bidirectional HTML output, right?
  2. what LaTeX and ConTeXt code would need to be emitted?

ConTeXt minimal sample:

\definefontfamily [mainface] [rm] [ALM Fixed] [features=arabic]
\setupbodyfont[mainface,12pt]
\setupdirections[bidi=on,method=two]
\starttext
The title is !مشس هخصث خهس تخت تخهثز in Arabic.
\stoptext

@ousia thanks! is there some documentation on this? However, while the exclamation mark is visually on the correct side in your example, logically (order of characters in memory) it’s on the wrong side—the ! is supposed to be at the end of the title, which visually happens to be on the left when read from right to left. What is the equivalent in ConTeXt of <span dir="rtl">?

@mb21, you are welcome.

As far as I know, the command is \righttoleft.

Since it is a command switch, it should be enclosed in braces, such as in:

\definefontfamily [mainface] [rm] [ALM Fixed] [features=arabic]
\setupbodyfont[mainface,12pt]
\starttext
The title is {\righttoleft مشس هخصث خهس تخت تخهثز!} in Arabic.
\stoptext

Thanks, I first had to install the font (tlmgr install almfixed), but now it works. Btw, the global direction can be set with \setupalign[r2l].

@mb21, sorry I didn’t know that TeX Live hadn’t the font included (I use the ConTeXt Suite).

This sample may work with TeX Live without extra font installation:

\definefontfamily [mainface] [rm] [FreeSerif] [features=arabic]
\setupbodyfont[mainface,12pt]
\starttext
The title is {\righttoleft مشس هخصث خهس تخت تخهثز!} in Arabic.
\stoptext

BTW, I would avoid using \setupalign[r2l] unless text orientation is explicitly set in the document’s metadata.

As far as my research, using bidi is the best approach for handling RTL in almost any context including markdown.

For editor, you just need to add dir="auto" into textarea tag. rest should be handled by the rendering engine which would simply add dir="auto" attribute into each top level elements while composing HTML file.

According to the W3C standard, “auto” should only be used as a last resort:

The heuristic used by this state is very crude (it just looks at the first character with a strong directionality, in a manner analogous to the Paragraph Level determination in the bidirectional algorithm). Authors are urged to only use this value as a last resort when the direction of the text is truly unknown and no better server-side heuristic can be applied.

1 Like

There is no other server side solution unless you want to add something extra to the syntax of Markdown which is absolutely not necessary.

Of course one shouldn’t use this method if he/she is sure the direction will be RTL or LTR. but when the text is mixed, this is the right approach.

I have implemented this on FluxBB and without any modification in database or BB syntax, whole of forums using that software are now rendering new and old texts smartly based on the context.

May you tell me what is wrong with the implementation I propose?

@ahangarha if you have a text that mixes some ltr and some rtl text, auto is sometimes not good enough. See https://www.w3.org/International/articles/inline-bidi-markup/uba-basics#isolation

btw. this is actually implemented in pandoc now… see https://pandoc.org/MANUAL.html#language-variables

Markdown should remain markdown. It should remain simple. It is not supposed to support complex and rare text formatting.

My native language is Persian and I know the problem well. I don’t want to add some extra code to just make my text direction RTL or LTR.

To me, it is very clear that we need to leave decision on direction of the paragraph (any block of text) to the browser by by adding dir="auto" to the block tag (like <p>). Then If you need to do anything else for the rare cases, do it after applying this.

Let us be able to use markdown for RTL. will handle complex cases later or never.

As per my experience and understanding, when we are dealing with mixed RTL and LTR txt, 99 percent of the cases, dir="auto" solves the problem. These 99 percents are to determine paragraph direction. The 1 remaining percent is related to rare cases. To make decision for this rare 1 percent, don’t stop!

Any extra action would be addition to the dir="auto" as per my understanding.