An internationalized domain name (IDN) is an Internet domain name that contains one or more non-ASCII characters. Such domain names could contain letters with diacritics, as required by many languages, or characters from non-Latin scripts such as Arabic, Chinese, Cyrillic, Devanagari or Hebrew. However, the standard for domain names does not allow such characters, and much work has gone into finding a way to internationalize domain names into a standard ASCII format, thereby preserving the stability of the domain name system.
IDN was originally proposed in 1996 by M. Dürst and implemented in 1998 by Tan Juay Kwang and Leong Kok Yong under the guidance of T.W. Tan (James Seng was only recruited later after he joined the nascent company, i-dns.net). After much debate and many competing proposals, a system called Internationalizing Domain Names in Applications (IDNA) [1] was adopted as a standard, and has been implemented in several top level domains.
In IDNA, the term internationalized domain name means specifically any domain name consisting only of labels to which the IDNA ToASCII algorithm can be successfully applied. (For the meaning of 'label' and 'ToASCII', see the section ToASCII and ToUnicode below.) In March 2008, the IETF formed a new IDN working group to update[2] the current IDNA protocol.
Internationalizing Domain Names in Applications
Internationalizing Domain Names in Applications (IDNA) is a mechanism defined in 2003 for handling internationalized domain names containing non-ASCII characters. While much of the Domain Name System can technically support non-ASCII characters, applications such as e-mail and web browsers restrict domain names to what can be used as a hostname. Rather than redesigning the existing DNS infrastructure, it was decided that non-ASCII domain names should be converted to a suitable ASCII-based form by web browsers and other user applications; IDNA specifies how this conversion is to be done.
IDNA was designed for maximum backward compatibility with the existing DNS system, which was designed for use with names using only a subset of the ASCII character set.
An IDNA-enabled application is able to convert between the restricted-ASCII and non-ASCII representations of a domain, using the ASCII form in cases where it is needed (such as for DNS lookup), but being able to present the more readable non-ASCII form to users. Applications that do not support IDNA will not be able to handle domain names with non-ASCII characters, but will still be able to access such domains if given the (usually rather cryptic) ASCII equivalent.
ICANN issued guidelines for the use of IDNA in June 2003, and it was already possible to register .jp domains using this system in July 2003 and .info[3] domains in March 2004. Several other top-level domain registries started accepting registrations in 2004 and 2005. IDN Guidelines were first created[4] in June 2003, and have been updated[5] to respond to phishing concerns in November 2005. An ICANN working group focused on country code domain names at the top level was formed in November 2007[6] and promoted jointly by the country code supporting organization and the Governmental Advisory Committee.
Mozilla 1.4, Netscape 7.1, Opera 7.11 and Safari are among the first applications to support IDNA. A browser plugin is available for Internet Explorer 6 to provide IDN support. Internet Explorer 7.0[7][8] and Windows Vista's URL APIs provide native support for IDN.[9]
[edit] ToASCII and ToUnicode
The conversions between ASCII and non-ASCII forms of a domain name are accomplished by algorithms called ToASCII and ToUnicode. These algorithms are not applied to the domain name as a whole, but rather to individual labels. For example, if the domain name is www.example.com, then the labels are www, example, and com, and ToASCII or ToUnicode would be applied to each of these three separately.
The details of these two algorithms are complex, and are specified in the RFCs linked at the end of this article. The following gives an overview of their behaviour.
ToASCII leaves unchanged any ASCII label, but will fail if the label is unsuitable for DNS. If given a label containing at least one non-ASCII character, ToASCII will apply the Nameprep algorithm (which converts the label to lowercase and performs other normalization) and will then translate the result to ASCII using Punycode[10] before prepending the 4-character string "xn--". This 4-character string is called the ACE prefix, where ACE means ASCII Compatible Encoding, and is used to distinguish Punycode-encoded labels from ordinary ASCII labels. Note that the ToASCII algorithm can fail in a number of ways; for example, the final string could exceed the 63-character limit for the DNS. A label on which ToASCII fails cannot be used in an internationalized domain name.
ToUnicode reverses the action of ToASCII, stripping off the ACE prefix and applying the Punycode decode algorithm. It does not reverse the Nameprep processing, since that is merely a normalization and is by nature irreversible. Unlike ToASCII, ToUnicode always succeeds, because it simply returns the original string if decoding would fail. In particular, this means that ToUnicode has no effect on a string that does not begin with the ACE prefix.
[edit] Example of IDNA encoding
Main article: Punycode
As an example of how IDNA works, suppose the domain to be encoded is Bücher.ch (“Bücher” is German for “books”, and .ch is the country domain for Switzerland). This has two labels, Bücher and ch. The second label is pure ASCII, and so is left unchanged. The first label is processed by Nameprep to give bücher, and then by Punycode to give bcher-kva, and then has xn-- prepended to give xn—bcher-kva. The final domain suitable for use with the DNS is therefore xn—bcher-kva.ch.
[edit] ASCII Spoofing and squatting concerns
Main article: IDN homograph attack
Because IDN allows websites to use full Unicode names, it also makes it much easier to create a spoofed web site that looks exactly like another, including domain name and security certificate, but in fact is controlled by someone attempting to steal private information. These spoofing attacks potentially open users up to phishing attacks.
These attacks are not due to technical deficiencies in either the Unicode or IDNA specifications, but because different characters in different languages can look the same, depending on the font used. For example, Unicode character U+0430, Cyrillic small letter a ("а"), can look identical to Unicode character U+0061, Latin small letter a, ("a") which is the lowercase "a" used in English. Characters that look alike in this way may be termed homonyms, homographs, or (less ambiguously) homoglyphs.
Although a computer may display visually identical or very similar glyphs for two different characters, these differences are still significant to the computer when locating web sites or validating certificates. The user assumes a one-to-one correspondence between the visual appearance of a name and the named entity, but when two names appear identical, this correspondence breaks down.
By contrast, with the old set of a to z, 0 to 9, and the hyphen, there is little in the way of homographs. Capital I (in a sans-serif font), lower-case l and number 1, and number 0 and capital O are the closest, and combinations, such as "r" with "n" and "v" with another "v" can look similar to "m" ("rn") and "w" ("vv"), in fonts that do not make a noticeable visible distinction between them. However, this provides a much smaller domain of collisions than all of Unicode.
On December 2001, two Israeli researchers, Evgeniy Gabrilovich and Alex Gontmakher, published a paper titled "The Homograph Attack",[11] an attack that used Unicode URLs to spoof a website URL. To prove the feasibility of this kind of attack, the researchers successfully registered a variant of the domain name "Microsoft.com" which incorporated Russian language characters.
In general, this kind of attack is known as a homograph spoofing attack. This problem was anticipated before IDN was introduced, and guidelines were issued to registries to try and avoid or reduce the problem – for example, recommending that registries only accept the Latin alphabet and that of their own country, not all of Unicode. Unfortunately this advice was not followed by those in control of a number of major TLDs.
On February 7, 2005, Slashdot reported that this exploit was disclosed at the hacker conference Shmoocon with an example available at http://www.shmoo.com/idn/. On browsers supporting IDNA, the URL "http://www.pаypal.com/" (where the first a is replaced by a Cyrillic а) appears to lead to paypal.com but instead led to a spoofed PayPal web site that said "Meeow."
Internet Explorer 7 imposes restrictions on displaying non-ASCII domain names based on a user-defined list of allowed languages and provide an anti-phishing filter that checks suspicious Web sites against a remote database of known phishing sites.
Since Internet Explorer prior to version 7 does not support IDNs, it is not vulnerable to this kind of attack. However, older versions of Internet Explorer can be made IDN-compatible by browser plug-ins some of which are vulnerable to the spoofing attacks.
On February 17, 2005, Mozilla developers announced that they would ship their next versions of their software with IDN support still enabled, but showing the punycode URLs instead, thus thwarting any attacks exploiting similarities between ASCII and non-ASCII letters (but not necessarily, for example, between Cyrillic and Greek letters, unless the user knows which Punycode URL corresponds to their chosen IDN URL) while still allowing people to access websites on an IDN domain. This is a change from the earlier plans to disable IDN entirely for the time being. [1]
Since then, both Mozilla and Opera have now announced that they will be using per-domain whitelists to selectively switch on IDN display for domain run by registries which are taking appropriate anti-spoofing precautions[2]. (See the article on homograph spoofing attacks for more details). As of September 9, 2005, the most recent version of Mozilla Firefox as well as the most recent Internet Explorer displays the spoofed Paypal URL as "http://www.xn--pypal-4ve.com/", unsightly but clearly different from the actual paypal.com. By contrast, the (non-existent) "http://www.xn--pypal-4ve.org" will display in the Firefox address bar as http://www.pаypal.org, as this form of domain is prohibited from registration at the Afilias registry level and therefore does not pose the same risk.
Safari's approach is to render problematic character sets as punycode. This can be changed by altering the settings in Mac OS X's system files[12].
Tuesday, October 20, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment