On the encoding and implementation of Unicode in Java
Encoding and implementation of Unicode
Generally speaking, Unicode coding system can be divided into two levels: coding mode and implementation mode.
Coding mode
A character is the smallest unit of abstract text. It has no fixed shape (possibly a glyph) and no value. "A" is a character and "" is also a character. The character set is a collection of characters. The encoded character set is a character set that assigns a unique number to each character.
Unicode was originally designed as a fixed width 16 bit character encoding. That is, each character occupies 2 bytes. In this way, in theory, a total of 216 (i.e. 65536) characters can be represented. The above 16 bit unified code characters form the Basic Multilingual plane. The coding of the characters in the Basic Multilingual plane is U + hhhh, in which each h represents a hexadecimal digit.
Obviously, all 65536 characters encoded in 16 bits do not fully represent all the characters in use or used in the world. As a result, the Unicode standard has been extended to contain up to 1112064 characters. Characters that exceed the original 16 bit limit are called supplementary characters. Unicode standard version 2.0 was the first to include
The version of supplementary character design is enabled, but the first supplementary character set is not included until version 3.1.
Mapping of Unicode character planes
At present, Unicode characters are arranged in 17 groups, each of which is called a plane, and each plane has 65536 (i.e. 216) code points. However, at present, only a few planes are used.
Supplementary characters are characters whose code points are in the range of U + 10000 to U + 10ffff (between plane 1 and plane 16 in the above table), that is, those characters that cannot be represented by the original Unicode 16 bit design. The character set from U + 0000 to U + ffff is sometimes referred to as the Basic Multilingual plane (BMP). Therefore, each Unicode character is either a BMP or a supplementary character.
Implementation mode
UTF-32, utf-16 and UTF-8 are specific implementation schemes. Unicode is implemented differently from encoding. The Unicode encoding of a character is determined. However, in the actual transmission process, because the design of different system platforms is not necessarily consistent, and for the purpose of saving space, the implementation of Unicode coding is different. The implementation of Unicode is called Unicode transformation format (UTF).
For example, if a Unicode file contains only basic 7-Bit ASCII characters, if each character is transmitted using the original Unicode encoding of 2 bytes, the 8 bits of the first byte are always 0. This has caused a great waste. In this case, UTF-8 encoding can be used, which is a variable length encoding, which still represents the basic 7-Bit ASCII characters in 7-bit encoding, It takes up one byte (the first byte is filled with 0). In case of mixing with other Unicode characters, it will be converted according to a certain algorithm. Each character will be encoded with 1-3 bytes and recognized with the first byte of 0 or 1. In this way, the encoding length of Western documents dominated by 7-Bit ASCII characters will be greatly saved (see UTF-8 for the specific scheme). Similarly, for the 4-byte auxiliary plane characters and other ucs-4 extended characters that will appear in the future, the 2-byte encoded utf-16 also needs to be converted through a certain algorithm.
As another example, If utf-16 encoding consistent with Unicode encoding (only BMP characters) is directly used, since each character occupies two bytes, it will be used on the Macintosh computer On (MAC) computers and personal computers, the understanding of byte order is inconsistent. At this time, the same byte stream may be interpreted as different contents. For example, a character is hexadecimal coded 4e59, which is divided into 4E and 59 according to two bytes. When it is read on the Mac, it starts from the low byte, then the Mac OS will think that the 4e59 code is 594e, and the found character is "Kui" On windows, if you start reading from the high byte, the character encoded as u + 4e59 is "B". In other words, a character "B" is saved in utf-16 encoding under windows, and it will be displayed as "B" when opened under Mac OS environment. This situation shows that the coding sequence of utf-16 may be confused if it is not artificially defined, Therefore, the concepts of big endian (abbreviated as utf-16 be), small endian (abbreviated as utf-16 LE) and attachable byte order mark solution are used in the implementation of utf-16 coding. At present, windows system and Linux system on PC use utf-16 le by default for utf-16 coding. (see utf-16 for specific scheme)
In addition, Unicode implementations include utf-7, punycode, cesu-8, SCSU, UTF-32, GB18030, etc. some of these implementations are only used in certain countries and regions, while others belong to future planning methods. At present, the common implementation methods are utf-16 small end sequence (LE), utf-16 large end sequence (be) and UTF-8. In the Notepad attached to Microsoft Windows XP, the four coding methods that can be selected in the Save As dialog box are "Unicode" except ANSI with non Unicode coding (ASCII coding for English system, GB2312 or Big5 coding for Chinese system) (corresponding to utf-16 LE), "Unicode big endian" (corresponding to utf-16 be) and "UTF-8".
Code point and code point
In the term of character coding, code point or coding position, that is, code point or code position in English, Is a numeric value that makes up the code space (or code page). For example, ASCII code contains 128 code points ranging from 016 to 7f16, extended ASCII code contains 256 code points ranging from 016 to ff16, and Unicode contains 1114112 code points ranging from 016 to 10ffff16. The Unicode code space is divided into 17 Unicode character planes (Basic Multilingual plane, 16 auxiliary planes), each plane has 65536 (= 216) code points. Therefore, the total Unicode code space is 17 × 65,536 = 1,112.
Code unit, symbol
Code unit (also known as "code unit") refers to the unit with the shortest bit combination in an encoded text. For UTF-8, the symbol is 8-bit extra long; for utf-16, the symbol is 16 bit extra long; for UTF-32, the symbol is 32 bit extra long (code value) is an outdated usage. After understanding the above two concepts, we can think that what utf-n (n is 8,16,32) does is to map the abstract code bits of Unicode character set into a sequence of n-bit integers (i.e. symbols) for data storage or transmission.
UTF-32 represents each Unicode code point as a 32-bit integer with the same value. Obviously, it is the most convenient expression for internal processing, but as a general string expression, it consumes more memory.
Utf-16 encodes Unicode code points using a sequence of one or two unallocated 16 bit code units. The values U + 0000 to U + ffff are encoded into a 16 bit unit of the same value. The supplementary characters are encoded into two code units, The first unit is from the high agent range (U + d800 to U + dbff), and the second unit is from the low agent range (U + dc00 to U + dfff). This may look like multi byte encoding conceptually, but there is an important difference: the values U + d800 to U + dfff are reserved for utf-16; no characters are assigned as code points. This means that for each individual code unit in a string, the software can identify whether the code unit represents a single unit character, Or whether the code unit is the first or second unit of a double unit character. This is equivalent to a significant improvement in some traditional multi byte character coding. In the traditional multi byte character coding, the byte value 0x41 may represent either the letter "a" or the second byte of a double byte character.
UTF-8 encodes Unicode code points using a sequence of one to four bytes. U + 0000 to U + 007F use one byte encoding, U + 0080 to U + 07ff use two bytes, U + 0800 to U + ffff use three bytes, and U + 10000 to U + 10ffff use four bytes. The design principle of UTF-8 is: byte values 0x00 to 0x7F always represent code points U + 0000 to U + 007F (Basic Latin character subset, which corresponds to ASCII character set). These byte values will never represent other code points. This feature enables UTF-8 to easily assign special meaning to some ASCII characters in the software.
The following table shows the comparison of different expressions of several characters:
Note: the numbers in the above codes are expressed in hexadecimal.
summary
The above is all about the encoding and implementation of Unicode in Java. I hope it will be helpful to you. Interested friends can continue to refer to this site: code examples of Java programming to convert Chinese characters to unicode codes, object classes for Java source code analysis, etc. if there are deficiencies, please leave a message to point out. Thank you for your support!