Detailed explanation of character encoding format in Java
1、 Foreword
When analyzing comparable and comparator, the CompareTo method of string class is analyzed. The bottom layer of string uses char [] array to store elements. When comparing, the characters of the two strings are compared, and the characters are stored in char. At this time, it suddenly occurred to me that can char in Java store Chinese? Later, it was found that it was possible, and this also led to the problem of character encoding format in Java.
2、 Java storage format
In Java, the following code obtains various encoding formats of the character 'Zhang'.
Operation results:
Note: from the results, we can know that the GBK and GB2312 codes of the character 'Zhang' are the same, and the Unicode and utf-16 codes are the same, but the iso-8859-1, Unicode and UTF-8 codes are different. So, in the JVM, in which encoding format is the character 'Zhang' stored? Let's start our analysis.
3、 Exploring ideas
1. see. Storage format of class file constant pool
The test code is as follows
Use javap - verbose test Class, and the constant pool is found as follows:
Then use WinHex to open the class file and find that the character 'Zhang' is stored in the constant pool as follows
Note: the above two can be stored in the class file in UTF-8 format.
But is it in UTF-8 format at runtime? Continue our quest.
2. Find out in the program
Use the following code
Operation results:
5F20
Note: according to the results, we know that the JVM uses utf-16 format for storage at runtime. Utf-16 generally uses 2 bytes for storage. If characters that cannot be represented by two bytes are encountered, they will be represented by 4 bytes. After that, there will be another space for introduction. When we check the source code of character class, we will find that it is encoded with utf-16, and we have found the answer we want from both sides.
3. Can char type store Chinese?
According to the above exploration, we have known that the characters in Java class files are encoded in UTF-8 and stored in utf-16 when the JVM is running. The character 'Zhang' can be represented by two bytes, and char is also two bytes in Java, so it can be stored.
4、 Summary
Through the above analysis, we know that:
1. Characters are encoded in UTF-8 format in the class file and utf-16 format when the JVM runs.
2. Char type is two bytes, which can be used to store Chinese.
In the process of this call, I have consulted a lot of information about characters, benefited a lot, and found it particularly interesting. I will share it next. Therefore, I hereby notice that the next article will further introduce coding and coding in Java. Coming soon