= Unicode bidirectional algorithm =

Infobox
- Title: Unicode Bidirectional Algorithm
- Status: Active
- Year Started: 1999
- Version: Unicode 17.0.0 (Revision 51, 13 August 2025)
- Organization: Unicode Consortium
- Editors: Manish Goregaokar, Robin Leroy
- Document: UAX #9

The Unicode Bidirectional Algorithm (UBA), formally defined in Unicode Standard Annex #9 (UAX #9), is a specification developed by the Unicode Consortium that determines how text containing a mixture of left-to-right and right-to-left scripts is displayed. It is a normative part of the Unicode Standard and is required for conformance wherever characters from right-to-left scripts such as Arabic or Hebrew are rendered.

== Background ==
Most writing systems display text from left to right, but several scripts—including Arabic, Hebrew, Thaana, and Syriac—are written from right to left. When text from both directions appears in the same document, the result is known as bidirectional text (or bidi text). Without a clear specification, ambiguities arise in determining the correct display order of characters.

The Unicode Standard prescribes a logical order for storing characters in memory, regardless of their visual direction. The UBA translates this logical order into a correct visual display order.

== Directional Formatting Characters ==
The UBA defines several categories of special control characters used to influence text direction:

=== Implicit Directional Marks ===
Lightweight, zero-width characters that act as directional anchors without affecting display:

| Abbreviation | Code Point | Name |
| LRM | U+200E | LEFT-TO-RIGHT MARK |
| RLM | U+200F | RIGHT-TO-LEFT MARK |
| ALM | U+061C | ARABIC LETTER MARK |

=== Explicit Directional Embeddings ===
Signal that a piece of text is to be treated as embedded in a given direction:

| Abbreviation | Code Point | Name |
| LRE | U+202A | LEFT-TO-RIGHT EMBEDDING |
| RLE | U+202B | RIGHT-TO-LEFT EMBEDDING |

=== Explicit Directional Overrides ===
Force characters to be treated as strongly directional, overriding their implicit types:

| Abbreviation | Code Point | Name |
| LRO | U+202D | LEFT-TO-RIGHT OVERRIDE |
| RLO | U+202E | RIGHT-TO-LEFT OVERRIDE |

=== Explicit Directional Isolates ===
Introduced in Unicode 6.3, isolates prevent the enclosed text from affecting the surrounding text's ordering:

| Abbreviation | Code Point | Name |
| LRI | U+2066 | LEFT-TO-RIGHT ISOLATE |
| RLI | U+2067 | RIGHT-TO-LEFT ISOLATE |
| FSI | U+2068 | FIRST STRONG ISOLATE |
| PDI | U+2069 | POP DIRECTIONAL ISOLATE |

=== Terminating Characters ===
| Abbreviation | Code Point | Name | Terminates |
| PDF | U+202C | POP DIRECTIONAL FORMATTING | LRE, RLE, LRO, RLO |
| PDI | U+2069 | POP DIRECTIONAL ISOLATE | LRI, RLI, FSI |

== The Algorithm ==
The UBA processes text in four main phases:

=== 1. Paragraph Separation ===
Text is split into paragraphs at paragraph separator characters (type B). Each paragraph is processed independently.

=== 2. Initialization ===
Each character is assigned a bidirectional character type (e.g., L, R, AL, EN, AN) from the Unicode Character Database. An embedding level list is also initialized.

=== 3. Resolving Embedding Levels ===
A series of rules resolves the embedding level of each character:
- P1–P3: Determine the paragraph embedding level (0 for LTR, 1 for RTL).
- X1–X10: Assign explicit embedding levels based on directional formatting characters.
- W1–W7: Resolve weak types (e.g., European numbers, separators).
- N0–N2: Resolve neutral and isolate formatting types, including bracket pairs.
- I1–I2: Resolve implicit embedding levels.

The maximum embedding depth is 125 levels, a value guaranteed not to change in future versions of the standard.

=== 4. Reordering ===
Rules L1–L4 reorder characters on each line for display:
- L1: Resets trailing whitespace and separators to the paragraph embedding level.
- L2: Reverses contiguous sequences of characters at the highest embedding levels, progressively down to the lowest odd level.
- L3: Reorders combining marks relative to their base characters.
- L4: Applies glyph mirroring to characters with the Bidi_Mirrored property when their resolved direction is right-to-left (e.g., "(" becomes ")").

== Bidirectional Character Types ==
Characters are classified into the following categories:

| Category | Type |
| Strong | L |
| R | Right-to-Left (e.g., Hebrew) |
| AL | Right-to-Left Arabic (e.g., Arabic, Syriac) |
| Weak | EN |
| ES | European Number Separator |
| ET | European Number Terminator |
| AN | Arabic Number |
| CS | Common Number Separator |
| NSM | Nonspacing Mark |
| Neutral | B |
| S | Segment Separator |
| WS | Whitespace |
| ON | Other Neutrals |

== Conformance ==
A conforming implementation must:
- Display all visible characters in the order described by the UBA (UAX9-C1).
- Only apply higher-level protocol overrides as defined in Section 4.3 of the specification (UAX9-C2).

=== Higher-Level Protocols ===
The UBA permits six higher-level protocol overrides (HL1–HL6), including:
- HL1: Override the paragraph embedding level.
- HL3: Emulate explicit directional formatting characters via markup (e.g., HTML dir attribute).
- HL4: Apply the UBA independently to segments of structured text (e.g., XML, source code).
- HL6: Apply additional glyph mirroring beyond the standard Bidi_Mirrored property.

== HTML and CSS Equivalents ==
On web pages, Unicode directional formatting characters can be replaced by HTML5 and CSS3 markup:

| Unicode | HTML | CSS |
| RLI...PDI | dir="rtl" | direction:rtl; unicode-bidi:isolate |
| LRI...PDI | dir="ltr" | direction:ltr; unicode-bidi:isolate |
| FSI...PDI | <bdi>, dir="auto" | unicode-bidi:plaintext |

== Security Considerations ==
The misuse of bidirectional formatting characters poses significant security risks, as they can be used to make malicious code or text appear benign. This is documented in Unicode Technical Report #36 (UTR36). Directional overrides (LRO, RLO) are particularly dangerous and should be avoided where possible.

== History ==
- Unicode 1.0 (1991): Basic bidirectional support introduced.
- Unicode 6.3 (2013): Major revision introducing directional isolates (LRI, RLI, FSI, PDI) and bracket pair resolution (rule N0). These additions were made to address the overly strong effect of directional embeddings on surrounding text.
- Unicode 17.0 (2025): Current version (Revision 51).

== See Also ==
- Unicode
- Right-to-left
- Arabic script
- Hebrew alphabet
- Unicode security
- Internationalization and localization

== External Links ==
- UAX #9: Unicode Bidirectional Algorithm (latest version)
- UAX #9 Revision 51 (Unicode 17.0.0)
