This paper solves the little secret of string coding in Java
brief introduction
In this article, you will learn about the relationship between Unicode and UTF-8, utf-16 and UTF-32. At the same time, you will also learn about variant UTF-8, and discuss the application of UTF-8 and variant UTF-8 in Java.
Let's have a look.
The history of Unicode
A long time ago, a high-tech product called computer appeared in the western world.
The first generation of computers could only do some simple arithmetic operations and use manual drilling programs to run. However, with the passage of time, the volume of computers became smaller and smaller and the computing power became stronger and stronger. Drilling no longer existed and became a manually written computer language.
Everything is changing, only one thing has not changed. This event is that computers and programming languages only spread in the West. In Western daily communication, 26 letters and limited punctuation are enough.
The initial computer storage can be very expensive. We use one byte, that is, 8bit, to store all the characters that can be used. In addition to the first 1bit, there are 128 choices in total, including 26 lowercase + 26 uppercase letters and other punctuation marks.
This is the original ASCII code, also known as American Standard Code for information interchange.
Later, when computers spread to the world, people found that it seemed that the previous ASCII code was not enough. For example, there were more than 4000 Chinese characters commonly used in Chinese. What should we do?
It doesn't matter. Localize ASCII coding, called ANSI coding. If one byte is not enough, use two bytes. People come out of the road, and the coding also serves people. Therefore, various coding standards such as GB2312, BIG5, JIS and so on are produced. Although these codes are compatible with ASCII codes, they are not compatible with each other.
This has seriously affected the process of internationalization. How can we realize the dream of one earth and one home?
Therefore, international organizations took the initiative to formulate the Unicode character set, which defines a unique code for all characters of all languages. The Unicode character set is so many codes from U + 0000 to U + 10ffff.
So what is the relationship between Unicode and UTF-8, utf-16 and UTF-32?
Unicode character sets are finally stored in files or memory. Direct storage takes up too much space. How do you save it? Use fixed 1 byte, 2 bytes or variable length bytes? Therefore, according to different coding methods, we divided it into UTF-8, utf-16, UTF-32 and other coding methods.
UTF-8 is a variable length coding scheme, which uses 1-4 bytes to store. Utf-16 uses two or four bytes to store. The underlying encoding of string after jdk9 has changed into two types: Latin1 and utf16.
UTF-32 uses 4 bytes to store. Among the three coding methods, only UTF-8 is ASCII compatible, which is why UTF-8 coding method is more common in the world (after all, computer technology is made by Westerners).
Unicode explanation
After knowing the development history of Unicode, let's explain in detail how Unicode is encoded.
The Unicode standard has developed from version 1.0 released in 1991 to the latest version 13.0 in March 2020.
The string range that Unicode can represent is 0 to 10ffff, expressed as u + 0000 to U + 10ffff.
The characters from U + d800 to U + dfff are reserved for utf-16, so the actual number of Unicode characters is 216 − 211 + 220 = 1112064.
We divide these Unicode character sets into 17 planes, and the distribution diagram of each plane is as follows:
Taking plan 0 as an example, Basic Multilingual plane (BMP) basically contains most common characters. The following figure shows the corresponding characters represented in BMP:
As mentioned above, U + d800 to U + dfff are reserved characters of utf-16. The high-order U + d800 – U + dbff and low-order U + dc00 – U + dfff are used as a pair of 16bits to utf-16 encode non BMP characters. A single 16bits is meaningless.
UTF-8
UTF-8 represents all 1064 Unicode characters in 1 to 4 bytes. Therefore, UTF-8 is a variable length coding method.
UTF-8 is currently the most common encoding method in the web. Let's see how UTF-8 encodes Unicode:
The first byte can represent 128 ASCII characters, so UTF-8 is ASCII compatible.
The next 1920 characters require two bytes to encode, covering almost all the rest of the Latin alphabet, as well as Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syrian, thaana and n'ko letters, and combined diacritical marks. The characters in the rest of BMP need three bytes, which contains almost all common characters, including most Chinese, Japanese and Korean characters. Characters in other planes in Unicode require four bytes, including less common CJK characters, various historical scripts, mathematical symbols and emoticons (pictograms).
The following is a specific example of UTF-8 coding:
UTF-16
Utf-16 is also a variable length encoding method. Utf-16 uses one to two 16bits to represent the corresponding characters.
Utf-16 is mainly used inside Microsoft Windows, Java and JavaScript / ECMAScript.
However, the usage of utf-16 on the web is not high.
Next, let's take a look at how utf-16 is encoded.
First: U + 0000 to U + d7ff and U + e000 to U + ffff. The characters in this range are directly represented by one 16bits, which is very intuitive.
Then: U + 010000 to U + 10ffff
The characters in this range are first subtracted by 0x10000 and become 0x00000 – 0xfffff represented by 20bits.
Then, 0x000 – 0x3FF of the high 10bits plus 0xd800 becomes 0xd800 – 0xdbff, represented by one 16bits.
0x000 – 0x3FF of lower 10bits plus 0xdc00 becomes 0xdc00 – 0xdfff, represented by one 16bits.
U' = yyyyyyyyyyxxxxxxxxxx // U - 0x10000
W1 = 110110yyyyyyyyyy // 0xD800 + yyyyyyyyyy
W2 = 110111xxxxxxxxxx // 0xDC00 + xxxxxxxxxx
This is why 0xd800 – 0xdfff are reserved characters in utf-16 in Unicode.
The following is an example of utf-16 coding:
UTF-32
UTF-32 is a fixed length encoding, and each character needs to be represented by one 32 bits.
Because it is 32bits, UTF-32 can be directly used to represent Unicode characters. The disadvantage is that UTF-32 occupies too much space, so generally speaking, few systems use UTF-32
Null terminated string and variant UTF-8
In C language, a string ends with null character ('\ 0') nul.
Therefore, in such characters, 0x00 cannot be stored in the middle of string. So what if we really want to store 0x00?
We can use variant UTF-8 coding.
In variant UTF-8, null character (U + 0000) is represented by two bytes: 11000000 10000000.
Therefore, variant UTF-8 can represent all Unicode characters, including null character U + 0000.
Generally speaking, in Java, inputstreamreader and outputstreamwriter use the standard UTF-8 encoding by default, but the string constants in object serialization and datainput, dataoutput, JNI and class files are represented by the variant UTF-8.