IDN homograph attack
The internationalized domain name (IDN) homograph attack is a way a malicious party may deceive computer users about what remote system they are communicating with, by exploiting the fact that many different characters look alike, (i.e., they are homographs, hence the term for the attack). For example, a person frequenting citibank.com may be lured to click a link in which the Latin C is replaced with the Cyrillic С.
This kind of spoofing attack is also known as script spoofing. Unicode incorporates numerous writing systems, and, for a number of reasons, similar-looking characters such as Greek Ο, Latin O, and Cyrillic О were not assigned the same code. Their incorrect or malicious usage is a possibility for security attacks.
The registration of homographic domain names is akin to typosquatting. The major difference is that in typosquatting the perpetrator relies on natural human typos, while in homograph spoofing the perpetrator intentionally deceives the web surfer with visually indistinguishable names. Indeed, it would be a rare accident for a web user to type, e.g., a Cyrillic letter within an otherwise English word such as "citibank". There are cases in which a registration can be both typosquatting and homograph spoofing; the pairs of l/I, i/j, and 0/O are all both close together on keyboards and bear a certain amount of resemblance to each other.
An early nuisance of this kind, pre-dating the Internet and even text terminals, was the confusion between "l" (lowercase letter "L") / "1" (the number "one") and "O" (capital letter for vowel "o") / "0" (the number "zero"). Some typewriters in the pre-computer era even conflated the ell and the one; users had to type a lowercase L when the number one was needed. The zero/oh confusion gave rise to the tradition of crossing zeros, so that a computer operator would type them correctly. Unicode may contribute to this greatly with its combining characters, accents, several types of hyphen-alikes, etc., often due to inadequate rendering support, especially with smaller fonts sizes and wide variety of fonts.
Even earlier, handwriting provided rich opportunities for confusion. A notable example is the etymology of the word "zenith". The translation from the Arabic "samt" included the scribe's confusing of "m" into "ni". This was common in medieval blackletter, which did not connect the vertical columns on the letters i, m, n, or u, making them difficult to distinguish when several were in a row. The latter, as well as "rn"/"m"/"rri" ("RN"/"M"/"RRI") confusion, is still possible for a human eye even with modern advanced computer technology.
Intentional look-alike character substitution with different alphabets has also been known in various contexts. For example, Faux Cyrillic has been used as an amusement or attention-grabber and "Volapuk encoding", in which Cyrillic script is represented by similar Latin characters, was used in early days of the Internet as a way to overcome the lack of support for the Cyrillic alphabet.
Homographs in ASCII
ASCII has several characters or pairs of characters that look alike and are known as homographs (or homoglyphs). Spoofing attacks based on these similarities are known as homograph spoofing attacks. For example, 0 (the number) and O (the letter), "l" lowercase L, and "I" uppercase "i".
In a typical example of a hypothetical attack, someone could register a domain name that appears almost identical to an existing domain but goes somewhere else. For example, the domain "rnicrosoft.com" contains "r" and "n", not "m". Other examples are G00GLE.COM which looks much like GOOGLE.COM in some fonts. Using a mix of uppercase and lowercase characters, googIe.com (capital i, not small L) looks much like google.com in some fonts. PayPal was a target of a phishing scam exploiting this, using the domain PayPaI.com. In certain narrow-spaced fonts such as Tahoma (the default in the address bar in Windows XP), placing a c in front of a j, l or i will produce homoglyphs such as cl cj ci (d g a).
Homographs in internationalized domain names
In multilingual computer systems, different logical characters may have identical appearances. For example, Unicode character U+0430, Cyrillic small letter a ("а"), can look identical to Unicode character U+0061, Latin small letter a, ("a") which is the lowercase "a" used in English. Hence
wikipediа.org instead of
The problem arises from the different treatment of the characters in the user's mind and the computer's programming. From the viewpoint of the user, a Cyrillic "а" within a Latin string is a Latin "a"; there is no difference in the glyphs for these characters in most fonts. However, the computer treats them differently when processing the character string as an identifier. Thus, the user's assumption of a one-to-one correspondence between the visual appearance of a name and the named entity breaks down.
Internationalized domain names provide a backward-compatible way for domain names to use the full Unicode character set, and this standard is already widely supported. However this system expanded the character repertoire from a few dozen characters in a single alphabet to many thousands of characters in many scripts; this greatly increased the scope for homograph attacks.
This opens a rich vein of opportunities for phishing and other varieties of fraud. An attacker could register a domain name that looks just like that of a legitimate website, but in which some of the letters have been replaced by homographs in another alphabet. The attacker could then send e-mail messages purporting to come from the original site, but directing people to the bogus site. The spoof site could then record information such as passwords or account details, while passing traffic through to the real site. The victims may never notice the difference, until suspicious or criminal activity occurs with their accounts.
In December 2001 Evgeniy Gabrilovich and Alex Gontmakher, both from Technion, Israel, published a paper titled "The Homograph Attack", which described an attack that used Unicode URLs to spoof a website URL. To prove the feasibility of this kind of attack, the researchers successfully registered a variant of the domain name microsoft.com which incorporated Cyrillic characters.
Problems of this kind were anticipated before IDN was introduced, and guidelines were issued to registries to try to avoid or reduce the problem. For example, it was advised that registries only accept characters from the Latin alphabet and that of their own country, not all of Unicode characters, but this advice was neglected by major TLDs.
On February 7, 2005, Slashdot reported that this exploit was disclosed by 3ric Johanson at the hacker conference Shmoocon. Web browsers supporting IDNA appeared to direct the URL http://www.pаypal.com/, in which the first a character is replaced by a Cyrillic а, to the site of the well known payment site Paypal, but actually led to a spoofed web site with different content.
The following alphabets have characters that can be used for spoofing attacks (please note, these are only the most obvious and common, given artistic license and how much risk the spoofer will take of getting caught; the possibilities are far more numerous than can be listed here):
Cyrillic is, by far, the most commonly used alphabet for homoglyphs, largely because it contains 11 lowercase glyphs that are identical or nearly identical to Latin counterparts.
The Russian letters а, с, е, о, р, х and у have optical counterparts in the basic Latin alphabet and look close or identical to a, c, e, o, p, x and y. Cyrillic З, Ч and б resemble the numerals 3, 4 and 6. Italic type generates more homoglyphs: дтпи (дтпи in standard type), resembling dmnu (in some fonts д can be used, since its italic form resembles a lowercase g; however, in most mainstream fonts, д instead resembles a partial differential sign, ∂).
If capital letters are counted, АВСЕНІЈКМОРЅТХ can substitute ABCEHIJKMOPSTX, in addition to the capitals for the lowercase Cyrillic homoglyphs. In the Serbian alphabet and handwritten based fonts, Cyrillic Д and Latin D are homoglyphs.
While Komi De (ԁ), shha (һ), palochka (Ӏ) and izhitsa (ѵ) bear strong resemblance to Latin d, h, l and v, these letters are either rare or archaic and are not widely supported in most standard fonts (they are not included in the WGL-4). Attempting to use them could cause a ransom note effect.
From the Greek alphabet, only omicron ο and sometimes nu ν appear identical to a Latin alphabet letter in the lowercase used for URLs. Fonts that are in italic type will feature Greek alpha α looking like a Latin a.
This list increases if close matches are also allowed (such as Greek εικηρτυωχγ for eiknptuwxy). Using capital letters, the list expands greatly. Greek ΑΒΕΗΙΚΜΝΟΡΤΧΥΖ looks identical to Latin ABEHIKMNOPTXYZ. Greek ΑΓΒΕΗΚΜΟΠΡΤΦΧ looks similar to Cyrillic АГВЕНКМОПРТФХ (as do Cyrillic Л (Л) and Greek Λ in certain geometric sans-serif fonts), Greek letters κ and о look similar to Cyrillic к and о. Besides this Greek τ, φ can be similar to Cyrillic т, ф in some fonts, Greek δ resembles Cyrillic б in the Serbian alphabet, and the Cyrillic а also italicizes the same as its Latin counterpart, making it possible to substitute it for alpha or vice versa.
If an IDN itself is being spoofed, Greek beta β can be a substitute for German esszet ß in some fonts (and in fact, code page 437 treats them as equivalent), as can Greek sigma ς for ç; accented Greek substitutes όίά can usually be used for óíá in many fonts, with the last of these (alpha) again only resembling a in italic type.
Also the Armenian alphabet can contribute critical characters: Several Armenian characters like օ, ո, ս, as well capital Տ and Լ are often completely identical to Latin characters in modern fonts. Symbols like ա can resemble Cyrillic ш. Beside that, there are symbols which look alike. ցհոօզս which look like ghnoqu, յ which resembles j (albeit dotless), and ք, which can either resemble p or f depending on the font. However, the use of Armenian is problematic. Not all standard fonts feature the Armenian glyphs (whereas the Greek and Cyrillic scripts are in most standard fonts). Because of this, Windows prior to Windows 7 rendered Armenian in a distinct font, Sylfaen, which supports Armenian, and the mixing of Armenian with Latin would appear obviously different if using a font other than Sylfaen or a Unicode typeface. (This is known as a ransom note effect.) The current version of Tahoma, used in Windows 7, supports Armenian (previous versions did not). Furthermore, this font differentiates Latin g from Armenian ց.
Two letters in Armenian (Ձշ) also can resemble the number 2, Յ resembles 3, while another (վ) sometimes resembles the number 4.
Hebrew spoofing is generally rare. Only three letters from that alphabet can reliably be used: samekh (ס), which sometimes resembles o, vav with diacritic (וֹ), which resembles an i, and heth (ח), which resembles the letter n. Less accurate approximants for some other alphanumerics can also be found, but these are usually only accurate enough to use for the purposes of foreign branding and not for substitution. Furthermore, the Hebrew alphabet is written from right to left and trying to mix it with left-to-right glyphs may cause problems.
The Chinese language can be problematic for homographs as many characters exist as both traditional (regular script) and Simplified Chinese characters. In the .org domain, registering one variant renders the other unavailable to anyone; in .biz a single Chinese-language IDN registration delivers both variants as active domains (which must have the same domain name server and the same registrant). .hk (.香港) also adopts this policy.
Other Unicode scripts in which homographs can be found include Number Forms (Roman numerals), CJK Compatibility and Enclosed CJK Letters and Months (certain abbreviations), Latin (certain digraphs), Mathematical Alphanumeric Symbols, and Alphabetic Presentation Forms (typographic ligatures).
Two names which differ only in an accent on one character may look very similar, for instance "wíkipedia.org" is not "wikipedia.org" as the dot on the first i has been replaced by an acute accent. In most top-level domain registries, wíkipedia.tld (xn--wkipedia-c2a.tld) and wikipedia.tld are two different names which may be held by different registrants. One exception is .ca, where reserving the plain-ASCII version of the domain prevents another registrant from claiming an accented version of the same name.
Unicode includes many characters which are not displayed by default, such as the zero-width space.
Defending against the attack
The simplest defense is for web browsers not to support IDNA or other similar mechanisms, or for users to turn off whatever support their browsers have. That could mean blocking access to IDNA sites, but generally browsers permit access and just display IDNs in Punycode. Either way, this amounts to abandoning non-ASCII domain names.
Opera displays Punycode for IDNs unless the top-level domain (for example TLDs such as
.museum) prevents homograph attacks by restricting which characters can be used in domain names. It also allows users to manually add TLDs to the allowed list.
Since version 22 (2013), Firefox displays IDNs if either the TLD prevents homograph attacks by restricting which characters can be used in domain names or labels do not mix scripts for different languages. Otherwise IDNs are displayed in Punycode.
Internet Explorer 7 allows IDNs except for labels that mix scripts for different languages. Labels that mix scripts are displayed in Punycode. There are exceptions to locales where ASCII characters are commonly mixed with localized scripts.
Starting with version 7, Internet Explorer was capable of using IDNs, but it imposes restrictions on displaying non-ASCII domain names based on a user-defined list of allowed languages and provides an anti-phishing filter that checks suspicious Web sites against a remote database of known phishing sites.
Some internationalized country code top-level domains are restricted in a way that hinders homographic attacks. For example, the Russian TLD .рф only accepts Cyrillic names, forbidding a mix with Latin or Greek characters. However the problem in .com and other gTLDs remains open.
ICANN has implemented a policy prohibiting any potential internationalized TLD from choosing letters that could resemble an existing Latin TLD and thus be used for homograph attacks. Proposed IDN TLDs .бг (Bulgaria), .укр (Ukraine) and .ελ (Greece) have been rejected or stalled because of their perceived resemblance to Latin letters. All three (and Serbian .срб and Mongolian .мон) have later been accepted. Three-letter TLD are considered safer than two-letter TLD, since they are more hard to mismatch against the normal latin ISO-3166 country domains, although they might match new generic domains.
These methods of defense only extend to within a browser. Homographic URLs that house malicious software can still be distributed, without being displayed as Punycode, through e-mail, social networking or other Web sites without being detected until the user actually clicks the link. While the fake link will show in Punycode when it is clicked, by this point the page has already begun loading into the browser and the malicious software may have already been downloaded onto the computer. Television station KBOI-TV raised these concerns when an unknown source (registering under the name "Completely Anonymous") registered a domain name homographic to its own to spread an April Fool's Day joke regarding the Governor of Idaho issuing a supposed ban on the sale of music by Justin Bieber.
- "Unicode Security Considerations", Technical Report #36, 2010-04-28
- Evgeniy Gabrilovich and Alex Gontmakher, The Homograph Attack, Communications of the ACM, 45(2):128, February 2002
- IDN hacking disclosure by shmoo.com
- There are various Punycode converters online, such as https://www.hkdnr.hk/idn_conv.jsp
- "Advisory: Internationalized domain names (IDN) can be used for spoofing.". Opera. 2005-02-25. Archived from the original on 2007-02-19. Retrieved 2007-02-24.
- "Opera's Settings File Explained: IDNA White List". Opera Software. 2006-12-18. Archived from the original on 2007-12-03. Retrieved 2007-02-24.
- "IDN Display Algorithm". Mozilla. Retrieved 2016-01-31.
- "Bug 722299". Bugzilla.mozilla.org. Retrieved 2016-01-31.
- Sharif, Tariq (2006-07-31). "Changes to IDN in IE7 to now allow mixing of scripts". IEBlog. Microsoft. Retrieved 2006-11-30.
- Sharif, Tariq (2005-09-09). "Phishing Filter in IE7". IEBlog. Microsoft. Retrieved 2006-11-30.
- "Firefox 2 Phishing Protection". Mozilla. 2006. Retrieved 2006-11-30.
- "Opera Fraud Protection". Opera Software. 2006-12-18. Retrieved 2007-02-24.
- "About Safari International Domain Name support". Retrieved 2008-08-07.
- IDN in Google Chrome - The Chromium Projects
- IDN ccTLD Fast Track String Evaluation Completion
- Fake website URL not from KBOI-TV. KBOI-TV. Retrieved 2011-04-01.
- Boise TV news website targeted with Justin Bieber prank. KTVB. Retrieved 2011-04-01.