Java character coding principle (power node Java College)

2019-08-16 • Java

In java development, we often encounter the problem of garbled code. Once we encounter this problem, we are often annoyed. We are unwilling to admit that there is a problem with our own code. In fact, the coding problem is not so mysterious and unpredictable. The truth will be revealed when we understand the essence of java coding.

Let's look at a picture first:

In fact, there are two aspects of the coding problem: within the JVM and outside the JVM.

1. Java files are compiled to form a class

There may be a variety of java file codes here, but the java compiler will automatically read these codes correctly according to the coding format of java files to generate class files. The class file code here is Unicode code (specifically utf-16 code).

Therefore, define a string in Java code:

No matter what encoding is used for java files before compilation, they are the same after compilation into class - Unicode encoding representation.

2. Encoding in JVM

When the JVM loads the class file, it uses Unicode encoding to correctly read the class file, then the originally defined string s = "Chinese character"; The representation in memory is Unicode encoding.

When calling string When getbytes(), it actually bought the curse for garbled code. Because this method uses the platform default character set to obtain the byte array corresponding to the string. In the Chinese version of WindowsXP, the default code used is GBK. If you don't believe it, run it:

Current JRE: 1.8 0_ sixteen

Default character set for current JVM: GBK

When different systems and databases are coded for many times, if they do not understand the principle, it is easy to cause garbled code. Therefore, in the first mock exam, it is necessary to unify the encoding of strings. For example, the method string parameters and IO stream can be used uniformly in the Chinese system, such as GBK, gb13080, UTF-8, utf-16, etc., but some larger character sets should be selected to ensure that any possible characters can be displayed normally and avoid the problem of garbled code. (assuming ASCII code is used for all files) then bidirectional conversion cannot be realized.

It should be noted that UTF-8 does not accommodate all Chinese character set codes. Therefore, in special cases, there may be random codes when UTF-8 is converted to GB18030. However, a group of stupid B often make Chinese systems and like to use UTF-8 coding without saying why! The stupidest thing is that many people do it in one system. Some people use GBK coding, some use UTF-8, and others use GB18030. FK is Chinese and not an outsourcing project. What UTF-8 is used? Nerve! It's OK to use gbk18030 for all the source code, so as not to prompt unrecognized character coding when compiling ant script.

Therefore, for the Chinese system, it is best to choose GBK or GB18030 coding (in fact, GBK is a subset of GB18030) in order to avoid garbled code to the greatest extent.

3. Encoding of strings in memory

Strings in memory are not limited to strings directly loaded from class code. Some strings are read from text files, from databases, or from byte arrays. However, they are basically not Unicode encoded. The reason is very simple. Storage optimization.

Therefore, we need to deal with all kinds of coding problems. Before processing, we must clarify the "source" coding, and then correctly read it into memory with the specified coding method. If it is a parameter of a method, you must actually specify the encoding of the string parameter, because this parameter may be passed from another Japanese system. When the string encoding is specified, the string can be processed correctly according to the requirements to avoid garbled code.

When decoding and encoding a string, the following method should be called:

Instead of signing with methods without character set names, the above two methods can re encode characters in memory.

The Java character coding principle introduced by Xiaobian above is expected to be helpful to you. If you have any questions, please leave me a message, and Xiaobian will reply to you in time. Thank you very much for your support for the programming tips website!

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

A simple example of intercepting and dividing string in Java

< <上一篇

Android phone can’t connect to MAC solution

下一篇>>

搜索内容

Java character coding principle (power node Java College)

热门文章