Examples of ANSI, Unicode, BMP, UTF and other coding concepts

1、 Foreword

In fact, since I began to write java code, I have encountered numerous problems of garbled code and transcoding, such as garbled code when reading string from text file, garbled code when obtaining HTTP request parameters in servlet, garbled data queried by JDBC, etc. these problems are very common. When I encounter them, I can search them and solve them smoothly, so I don't have an in-depth understanding.

Until two days ago, my classmates talked with me about the coding of a java source file (this problem is analyzed in the last example). Starting with this problem, we pulled out a series of problems. Then we checked the data and discussed it. Until late at night, we finally found key clues in a blog, solved all doubts, and explained all the sentences I didn't understand before. Therefore, I decided to use this essay to record some coding problems Understanding and experimental results.

Some of the following concepts are my own understanding combined with practice. If there is any error, please don't hesitate to correct it.

2、 Concept summary

In the early days, the Internet was not developed, and computers were only used to process some local data, so many countries and regions designed coding schemes for local languages, This area related coding is collectively referred to as ANSI coding (because they are all extensions of ansi-ascii code). However, they did not discuss how to be compatible with each other in advance, but made their own, which buried the root of the coding conflict. For example, GB2312 code used in the mainland conflicts with Big5 code used in Taiwan. The same two bytes represent different characters in the two coding schemes. With the rise of the Internet , a document often contains multiple languages. The computer has trouble displaying it because it doesn't know what kind of code these two bytes belong to.

Such problems are common in the world, so the voice of redefining a common character set and numbering all characters in the world is rising.

Therefore, Unicode code came into being. It numbers all characters in the world. Because it can uniquely identify one character, the font only needs to be designed for Unicode code. However, the Unicode standard defines a character set without specifying the encoding scheme, that is, it only defines the abstract numbers and their corresponding characters, and does not specify how to store a string of Unicode numbers. It really specifies how to store UTF-8, utf-16, UTF-32 and other schemes, so it has the encoding beginning with UTF, As the name suggests, UTF-8 is encoded with 8-bit length as the basic unit. It is variable length encoding, encoding a character with 1 ~ 6 bytes (due to the constraint of Unicode range, the actual maximum is only 4 bytes); utf-16 is a 16 bit basic unit coding, and it is also a variable length coding, either 2 bytes or 4 bytes; UTF-32 is a fixed length, and a fixed 4 bytes stores a Unicode number.

In fact, I have always misunderstood Unicode. In my impression, the maximum Unicode code can only be 0xFFFF, that is, it can only represent 2 ^ 16 characters at most. After reading Wikipedia carefully, I understand that the early ucs-2 coding scheme is indeed like this. Ucs-2 uses two bytes to encode a character, Therefore, it can only encode BMP (the Basic Multilingual plane, i.e. 0x0000-0xffff, contains the most commonly used characters in the world). In order to encode characters with Unicode greater than 0xFFFF, people expanded ucs-2 coding and created utf-16 coding, which is variable length. Within the BMP range, utf-16 is completely consistent with ucs-2, while utf-16 outside BMP uses 4 bytes for storage.

In order to facilitate the following description, first explain the concept of code unit. The basic unit of a code is called code unit. For example, the code unit of UTF-8 is 1 byte and the code unit of utf-16 is 2 bytes. It is difficult to explain, but it is easy to understand.

In order to be compatible with various languages and better cross platform, javastring saves the Unicode code of characters. It used to use the ucs-2 encoding scheme to store Unicode. Later, it was found that there were not enough characters in the BMP range, but due to memory consumption and compatibility, Not upgraded to ucs-4 (i.e. UTF-32, fixed 4-byte encoding), but utf-16 mentioned above is adopted. Char type can be regarded as its code unit. This method leads to some troubles. If all characters are within the BMP range, it is OK. If there are characters outside the BMP, one code unit will no longer correspond to one character. The length method returns the number of code units, not the number of characters The number returned by the charat method is naturally a code unit rather than a character, and it becomes troublesome to traverse. Although some new operation methods are provided, they are still inconvenient and cannot be accessed randomly.

In addition, I found that Java does not process Unicode literals greater than 0xFFFF when compiling, so if you can't type a non BMP character, but you know its Unicode code, We have to use a stupid method to let string store it: manually calculate the utf-16 encoding (four bytes) of the character, take the first two bytes and the last two bytes as a Unicode number, and then assign them to string. The example code is as follows.

The Notepad provided with Windows system can be saved as Unicode code, which actually refers to utf-16 code. As mentioned above, the main character codes used are within the BMP range, and within the BMP range, the utf-16 coding value of each character is equal to the corresponding Unicode value, which is probably why Microsoft calls it Unicode. for instance, I input "good a" in Notepad and save it as Unicode big endian code. Open the file with WinHex as shown in the following figure. The first two bytes of the file are called byte order mark, (FE FF) identifies that the byte order is high priority, and then (59 7D) is the Unicode code of "good", and (00 61) is "a" "Unicode code for.

With Unicode code, we can't solve the problem immediately, because first of all, there are a large number of non Unicode standard coding data in the world, and we can't discard them. Second, Unicode coding often takes up more space than ANSI coding. Therefore, from the perspective of saving resources, ANSI coding is still necessary. Therefore, a conversion mechanism needs to be established so that ANSI code can be converted to unicode for unified processing, or Unicode can be converted to ANSI code to meet the requirements of the platform.

The conversion method is easy to say. For UTF series or iso-8859-1, which are compatible codes, It can be converted directly through calculation and Unicode values (actually, it may also be a table lookup). For ANSI coding left over from the system, it can only be carried out through table lookup. Microsoft calls this mapping table codepage (code page) and classify and number by code. For example, cp936 is the code page of GBK and cp65001 is the code page of UTF-8. The following figure is the GBK - > Unicode mapping table found on Microsoft's official website (incomplete visual inspection). Similarly, there should be a reverse Unicode - > GBK mapping table.

With a code page, you can easily perform various coding conversions, such as converting from GBK to UTF-8. You only need to divide the data by characters according to the coding rules of GBK, and use the coded data of each character to check the GBK code page to get its Unicode value, Use the Unicode to check the code page of UTF-8 (or direct calculation), the corresponding UTF-8 code can be obtained. The reverse is the same. Note: UTF-8 is the standard implementation of Unicode, and its code page contains all Unicode values, so there will be no loss when any code is converted to UTF-8 and then back. So, we can draw a conclusion that the most important thing to complete the code conversion is page The next step is to successfully convert to Unicode, so the correct selection of character set (code page) is the key.

After understanding the essence of transcoding loss, I suddenly understand why the JSP framework uses iso-8859-1 to decode HTTP request parameters, so we have to write such a statement when obtaining Chinese parameters:

Stringparam=newString(s.getBytes("iso-8859-1"),"UTF-8");

Because the JSP framework receives a parameter encoded binary byte stream, It doesn't know what the code is (or don't care), I don't know which code page to check to convert to Unicode. Then it chooses a scheme that will never cause loss. It assumes that this is the data encoded by iso-8859-1, and then checks the code page of iso-8859-1 to get the Unicode sequence, because iso-8859-1 is encoded by bytes, and different from ASCII, it advances every bit in 0 ~ 255 space Line encoding, so any byte can find the corresponding Unicode in its code page. If it is transferred back from Unicode to the original byte stream, there will be no loss. In this way, European and American programmers who do not consider other languages can directly use the JSP framework to decode the good string, and if they want to be compatible with other languages, they only need to turn back to the original byte stream and decode it with the actual code page.

After I have explained the concepts related to unicode and character coding, let's feel it with a java example.

3、 Case analysis

1. Convert to unicode -- string construction method

The construction method of string is to convert various encoded data into Unicode sequences (stored in utf-16 encoding). The following test code is used to show the application of javastring construction method. No non BMP characters are involved in the examples, so codepointat methods are not used.

The operation results are shown in the figure below

It can be found that since string has mastered the Unicode code, it needs to be converted to other codes soeasy!

3. Take Unicode as a bridge to realize code conversion

With the foundation of the above two parts, it is very simple to realize code conversion. You only need to use them together. First, newString converts the original encoded data into Unicode sequence, and then call GetBytes to go to the specified encoding.

For example, a very simple conversion code from GBK to Big5 is as follows

4. Code loss

As explained above, the JSP framework uses the iso-8859-1 character set to decode. First, use an example to simulate the restore process. The code is as follows

The operation results are as follows. The first output is incorrect because the decoding rules are wrong, and the wrong code page is also checked. The wrong Unicode is obtained. Then it is found that the iso-8859-1 code page can perfectly restore the data through the wrong Unicode.

This is not the key point. If you replace "China" with "China", the compilation will succeed, and the running results are shown in the figure below. In addition, it can be further found that the compilation fails when the number of Chinese characters is odd, and passes when the number is even. Why? Let's analyze it in detail.

Because javastring uses Unicode internally, the compiler will transcode our string literal during compilation, Convert encoding from source file to unicode (Wikipedia says that the encoding is slightly different from UTF-8). We did not specify the encoding parameter during compilation, so the compiler will decode in GBK by default. Those who know UTF-8 and GBK should know that generally, a Chinese character encoded in UTF-8 needs 3 bytes, while GBK only needs 2 bytes, which can explain why the parity of the number of characters is different The result is affected because if there are two characters, the UTF-8 encoding takes up 6 bytes, which can be decoded into 3 characters by GBK decoding. If there is one character, there will be one more unmapped byte, which is the place of the question mark in the figure.

To be more specific, the UTF-8 encoding of the word "China" in the source file is e4b8ade59bbd, which is decoded by the compiler in GBK. For the three byte pairs, check cp936 to get three Unicode values, 6d93e15e6d57 respectively, corresponding to the three strange characters in the result figure. As shown in the figure below, the three Unicode codes are compiled in The class file is actually stored in UTF-8 code. During operation, Unicode is stored in the JVM. However, when the final output is output, it will be encoded and passed to the terminal. This time, the agreed code is the code set by the system region. Therefore, if the terminal code setting is changed, it will still be garbled. E15e here does not define corresponding characters in Unicode standard, so it will be displayed differently in different platforms and different fonts.

It can be imagined that if, on the contrary, the source file is stored in GBK encoding and deceives the compiler that it is UTF-8, it basically cannot be compiled no matter how many Chinese characters are input, because the encoding of UTF-8 is very regular, and the randomly combined bytes will not comply with the encoding rules of UTF-8.

Of course, the most direct way for the compiler to correctly convert the encoding to unicode is to honestly tell the compiler what the encoding of the source file is.

4、 Summary

After this collection and experiment, I learned a lot of concepts related to coding and became familiar with the specific process of coding conversion. These ideas can be extended to various programming languages, and the implementation principles are similar. Therefore, I think I won't be unaware of this kind of problem in the future.

The above is all about the examples of ANSI, Unicode, BMP, UTF and other coding concepts in this paper. I hope it will be helpful to you. Interested friends can continue to refer to other related topics on this site. If there are deficiencies, please leave a message to point out. Thank you for your support!

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>