Okay, I’ve taken a look at the issue of bidirectional text again. For a very accessible introduction, see Unicode Bidirectional Algorithm basics. Quotes below are from Unicode Standards Annex #9: The Bidirectional Alogrithm.
If I understand correctly, as of Unicode 6.3 and later, the preferred way to do bidi text is to use two mechanisms only:
-
Implicit Directional Formatting Characters: LRM (LEFT-TO-RIGHT MARK), RLM (RIGHT-TO-LEFT MARK) and ALM (ARABIC LETTER MARK) (ALM behaves the same as RLM, except around numbers). “Their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display.”
-
Marking up the direction of ranges of text, using either “Explicit Directional Isolate Formatting Characters” or better yet, when using a markup language like HTML, use the dir
attribute or similar. “On web pages, the explicit directional formatting characters […] should be replaced by using the dir attribute and the elements BDI and BDO. This does not apply to the implicit directional formatting characters.” BDI is only used when the directionality is not known, e.g. from user input saved in a db (thus doesn’t apply to markdown), and BDO is used to override the normal bidirectionality rules which seems like an edge-case that can be simulated by wrapping each character in an element with a dir attribute in the worst case (or do you think we’d really need a markdown BDO equivalent?)
Reading through Authoring HTML: Handling Right-to-left Scripts, I tried to reproduce similar ‘funny’ behavior with markdown as well:
[مشس هخصث خهس تخت تخهثز](#العربي)
Note that this is a valid internal link in markdown, in any bidi-aware browser and editor, the ](#
are just displayed mirrored (i.e. rtl) since the surrounding text is rtl as well. But come to think about it, this is probably fine and could even be considered a strength of the markdown syntax (e.g. as opposed to HTML/XML). Similarly, markdown numbered lists display naturally for RTL scripts:
1. مشس هخصث خهس تخت تخهثز
2. مشس هخصث خهس تخت تخهثز
So… I think we could get away with “just” supporting the dir attribute, so you would write e.g. The title is [مشس هخصث خهس تخت تخهثز!]{dir=rtl} in Arabic.
, where the !
would be part of the title. However, this leaves the exclamation mark on the wrong side of the text when editing markdown, although it will be displayed correctly in a browser upon convertion to HTML. To display it correctly while editing markdown as well, you’d have to insert a ALM (or RLM) behind the !
(of course, you need a text editors that support bidi for this to work).
btw, can someone link to a good up to date resource about bidirectional text in LaTeX and ConTeXt? If found this PDF, but it’s from 2001.
Basically the question is: if a markdown processor like Pandoc were to add native span and div syntax with the dir attribute and pass along unicode LRM, RLM and ALM chars:
- this would already suffice for bidirectional HTML output, right?
- what LaTeX and ConTeXt code would need to be emitted?