|
uCoz Community Archives Locked UTF-8 AND design All browsers |
UTF-8 AND design All browsers |
lu9821, UTF-8 it's encoding !
Code UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages,[1][2] and other places where characters are stored or streamed. UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters. The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.[3] The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8 Code Print-friendly Version Unicode is a character set supported across many commonly used software applications and operating systems. For example, many popular web browser, e-mail, and word processing applications support Unicode. Operating systems that support Unicode include Solaris Operating Environment, Linux, Microsoft Windows 2000, and Apple's Mac OS X. Applications that support Unicode are often capable of displaying multiple languages and scripts within the same document. In a multilingual office or business setting, Unicode's importance as a universal character set cannot be overlooked. Unicode is the only practical character set option for applications that support multilingual documents. However, applications do have several options for how they encode Unicode. An encoding is the mapping of Unicode code points to a stream of storable code units or octets. The most common encodings include the following: UTF-8 UTF-16 UTF-32 Each encoding has advantages and drawbacks. However, one encoding in particular has gained widespread acceptance. That encoding is UTF-8. This article describes UTF-8, what it is, and why it is important. Table 1 defines some terms that are used in this document. Table 1 Common Definitions Character Set A repertoire of characters that have been collected together for some purpose. Coded Character Set An ordered character set in which each character has an assigned integer value. Code Point The integer value of a character within a coded character set. Character Encoding A mapping of code points to a series of bytes. Code Unit A single octet or byte of an encoded character. Charset A set of characters that has been encoded using a character encoding. Often used as a synonym for character encoding. What is it? Unicode 3.1 code points exist in the range U+0000 - U+10FFFF. Although each of the code points can be stored and manipulated as 32-bit integers, convincing the world to use a 32-bit wide character encoding won't be immediately successful everywhere. This is especially true for Western European and non-Asian nations in general, which can encode their legacy character sets in as little as one byte per character. UTF-8 is a multibyte encoding in which each character can be encoded in as little as one byte and as many as four bytes. Most Western European languages require less than two bytes per character. For example, characters from Latin-based scripts require only 1.1 bytes on average. Greek, Arabic, Hebrew, and Russian require an average of 1.7 bytes. Finally, Japanese, Korean, and Chinese typically require three bytes per character. [1] The encoding algorithm is straightforward. Table 2 below shows how bits from a Unicode code point are arranged in the encoding for different character ranges. Table 2 UTF-8 Bit Encoding of a Unicode Code Point Character Range 1st Byte 2nd Byte 3rd Byte 4th Byte U+0000 - U+007F 00..7F U+0080..U+07FF C2..DF 80..BF 80..BF U+0800..U+0FFF E1..EC 80..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+D800..U+DFFF ill-formed U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF As the above table shows, characters in the range U+0000 - U+007F can be encoded as a single byte. This means that the ASCII charset can be represented unchanged with a single byte of storage space. The next range, U+0080 - U+07FF, contains the remaining characters for most of the world's scripts and includes characters with diacritics. This range requires two bytes of encoded storage. The notable scripts in the range U+0800 - U+FFFF are Chinese, Korean, and Japanese. These scripts require three bytes of storage for each character. Finally, the non-BMP range contains characters that can be represented as surrogate pairs in UTF-16. Most of the new characters in this range are Chinese ideographs. The newly defined characters in this range require four bytes in the UTF-8 encoding. Yes,we can!!!
GEORGIAN ARMY |
| |||
| |||