UTF-8 AND design All browsers

Page 1 of 1
1

lu9821

Posts: 1

Reputation: 0

Message # 1 | 3:36 PM 2010-01-30

What is UTF-8? and I want to know how Testing Design for all Browser????
example: mozilla firefox, opera, internet explorer and another

Pretextat

Posts: 6

Reputation: 0

Message # 2 | 3:43 PM 2010-01-30

lu9821, UTF-8 it's encoding !

Code

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages,[1][2] and other places where characters are stored or streamed.

UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters.

The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.[3] The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8

Code

   Print-friendly Version

Unicode is a character set supported across many commonly used software applications and operating systems. For example, many popular web browser, e-mail, and word processing applications support Unicode. Operating systems that support Unicode include Solaris Operating Environment, Linux, Microsoft Windows 2000, and Apple's Mac OS X. Applications that support Unicode are often capable of displaying multiple languages and scripts within the same document. In a multilingual office or business setting, Unicode's importance as a universal character set cannot be overlooked.

Unicode is the only practical character set option for applications that support multilingual documents. However, applications do have several options for how they encode Unicode. An encoding is the mapping of Unicode code points to a stream of storable code units or octets. The most common encodings include the following:
UTF-8
UTF-16
UTF-32
Each encoding has advantages and drawbacks. However, one encoding in particular has gained widespread acceptance. That encoding is UTF-8. This article describes UTF-8, what it is, and why it is important.

Table 1 defines some terms that are used in this document.

Table 1 Common Definitions

Character Set    A repertoire of characters that have been collected together for some purpose.
Coded Character Set    An ordered character set in which each character has an assigned integer value.
Code Point    The integer value of a character within a coded character set.
Character Encoding    A mapping of code points to a series of bytes.
Code Unit    A single octet or byte of an encoded character.
Charset    A set of characters that has been encoded using a character encoding. Often used as a synonym for character encoding.

What is it?
Unicode 3.1 code points exist in the range U+0000 - U+10FFFF. Although each of the code points can be stored and manipulated as 32-bit integers, convincing the world to use a 32-bit wide character encoding won't be immediately successful everywhere. This is especially true for Western European and non-Asian nations in general, which can encode their legacy character sets in as little as one byte per character.

UTF-8 is a multibyte encoding in which each character can be encoded in as little as one byte and as many as four bytes. Most Western European languages require less than two bytes per character. For example, characters from Latin-based scripts require only 1.1 bytes on average. Greek, Arabic, Hebrew, and Russian require an average of 1.7 bytes. Finally, Japanese, Korean, and Chinese typically require three bytes per character. [1]

The encoding algorithm is straightforward. Table 2 below shows how bits from a Unicode code point are arranged in the encoding for different character ranges.

Table 2 UTF-8 Bit Encoding of a Unicode Code Point

Character Range    1st Byte    2nd Byte    3rd Byte    4th Byte
U+0000 - U+007F    00..7F



U+0080..U+07FF  C2..DF  80..BF  80..BF

U+0800..U+0FFF  E1..EC  80..BF  80..BF
U+1000..U+CFFF  E1..EC  80..BF  80..BF
U+D000..U+D7FF  ED  80..9F  80..BF
U+D800..U+DFFF  ill-formed

U+E000..U+FFFF  EE..EF  80..BF  80..BF

U+10000..U+3FFFF  F0    90..BF    80..BF    80..BF
U+40000..U+FFFFF  F1..F3  80..BF  80..BF  80..BF
U+100000..U+10FFFF    F4    80..8F    80..BF    80..BF

As the above table shows, characters in the range U+0000 - U+007F can be encoded as a single byte. This means that the ASCII charset can be represented unchanged with a single byte of storage space. The next range, U+0080 - U+07FF, contains the remaining characters for most of the world's scripts and includes characters with diacritics. This range requires two bytes of encoded storage. The notable scripts in the range U+0800 - U+FFFF are Chinese, Korean, and Japanese. These scripts require three bytes of storage for each character. Finally, the non-BMP range contains characters that can be represented as surrogate pairs in UTF-16. Most of the new characters in this range are Chinese ideographs. The newly defined characters in this range require four bytes in the UTF-8 encoding.

FROM WIKIPEDIA

Yes,we can!!!
GEORGIAN ARMY

UTF-8 AND design All browsers

Page 1 of 1
1

Need help? Contact our support team via the contact form or email us at support@ucoz.com.