Unicode in Microsoft Windows

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Microsoft started to consistently implement Unicode in their products quite early.[clarification needed] Windows NT was the first operating system that used "wide characters" in system calls. Using at first UCS-2 encoding scheme, it was upgraded to UTF-16 starting with Windows 2000, allowing a representation of additional planes with surrogate pairs.

In various Windows families[edit]

Windows NT based systems[edit]

Modern operating systems Windows XP and Windows Server 2003, and prior to them as Windows NT 4 and Windows 2000 are shipped with the system libraries, which supported string encoding of two types: UTF-16 (often called "Unicode" in Windows documentation) and an 8-bit encoding called the "code page" (or incorrectly referred to as ANSI code page). 16-bit functions have names suffixed with -W (from "wide"), for example, lstrlenW(). Code page oriented functions uses suffix -A, e.g., lstrlenA(), for "ANSI". This split was necessary because many languages, including C, do not provide a clean way to pass both 8-bit and 16-bit strings to the same api or put them in the same structure. Windows also provides the 'M' API which in some locales provided multi-byte encodings, but in most locales is the same as 'a'. Most of such "A" and "M"-functions are implemented as a wrapper that translates the code page to UTF-16 and calls the "W" functions.

The IsTextUnicode function uses a heuristic algorithm on a byte string passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like Notepad, often gives incorrect results. This gave rise to legends about the existence of "Easter eggs" like Bush hid the facts.

Windows CE[edit]

In Windows CE UTF-16 was used almost exclusively, with the "A" api mostly missing.

Windows 9x[edit]

In 2001, Microsoft released a special supplement to Microsoft’s old Windows 9x systems. It includes a dynamic link library unicows.dll (only 240 KB) containing the 16-bit flavor (the ones with the letter W on the end) of all the basic functions of Windows API.

UTF-8[edit]

Although the locale can be set so the "M" encodings handle some multi-byte encodings, it is not possible to set them to support UTF-8 (attempts to use the locale id passed to MultiByteToWideChar for UTF-8 are ignored). As many libraries, including the standard C and C++ library, only allow access to files using the "M" api, it is not possible to open all Unicode-named files with them. Thus Unicode is not supported by Windows in software using a portable API.

There are proposals to add api to portable libraries such as Boost to do the necessary conversion, by adding new functions for opening and renaming files. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows.[1]

Many applications imminently have to support UTF-8 because it is the most used of Unicode encoding schemes in various network protocols, including the Internet Protocol Suite. An application which has to pass UTF-8 to or from a "w" Windows API should call the functions MultiByteToWideChar and WideCharToMultiByte.[2] To get predictable handling of errors and surrogate halves it is more common for software to implement their own versions of these functions.

References[edit]

  1. ^ "Boost.Nowide". 
  2. ^ "UTF-8 in Windows". Stack Overflow. Retrieved July 1, 2011. 

External links[edit]