Summary of experience in dealing with character coding in Java

When facing a string of byte streams, its actual meaning cannot be known if its encoding is not specified. This sentence should also be kept in mind when we face the problem of "character to byte, byte to character". Otherwise, the problem of garbled code may follow. In fact, the essence of the problem of garbled code is that encoding and decoding do not use the same code. Understanding this truth will solve the problem of garbled code. Common in Java are as follows: 1 The string class uses the constructor string (byte [] bytes) of byte []. The string class provides two overloads at the same time: (1) string (byte [] bytes, charset, charset) (2) string (byte [] bytes, string charsetname) is used to specify the encoding.

2. The GetBytes function byte [] getbytes() of string class also has the following two overloads: (1) byte [] GetBytes (charset charset) (2) byte [] GetBytes (string charsetname) all that do not need to specify the code use the platform's default charset, which can use system getProperty("file.encoding"),Charset. Obtained by defaultcharset(). 3. Printstream's print (string s) is also designed to solve this problem. Therefore, in the printstream constructor, in addition to printstream (file file), there is printstream (file file, string CSN). Otherwise, the string's characters are converted into bytes according to the platform's default character encoding. There is no method to specify the encoding when dataoutputstream is constructed, But it provides a writeutf (string STR)

Give an example at the beginning to illustrate the necessity of specifying the code: if a web page specifies the code as UTF-8, < meta http equiv = "content type" content = "text / HTML; charset = UTF-8" / >, there is a form on the page, which is submitted to a servlet, then the byte stream from the characters entered by the user is encoded according to the specified code. For example, if you enter "hello", if it is UTF-8, Then the message is as follows:

, we can see that the following Chinese characters use 3 bytes each. For this, please refer to the relevant knowledge of UTF-8. However, if GBK is specified on your page, the message will be different:

So on the servlet side, when using request When getparameter is used, string s = new string (bytes, response. Getencoding()) should be called internally. If you do not set the response code, the default code will be used. Null will be converted to GBK of Java platform, and Chinese will become garbled. Therefore, in order to avoid random code, JSP sites generally set a filter, and all pages and servers are set with unified coding. response. setEncoding,request. setEncoding.

Inside the Java string is a char [], which is a utf-16 encoded unit stored in 16 bits. Therefore, when you want to convert characters and strings into bytes and output them to files and networks, or restore the byte stream read from files and networks to meaningful characters, you should understand what their encoding is.

Some experience 1 The string class is always stored in Unicode encoding 2. Note string Use of getbytes(): if there is no character set parameter, it will depend on the character set encoding of the JVM. It is generally Unicode on Linux and GBK under windows (to change the JVM default character set encoding, use the option - dfile. Encoding = UTF-8 when starting the JVM. For security reasons, it is recommended to always call with parameters, such as: String s; s.getbytes ("UTF-8"). 3. Charset class is very easy to use. (1) charset.encode is encoding, that is, encoding a string according to the character set encoding format you specify and outputting a byte array. (2) charset.decode is decoding, that is, decoding a byte array according to the character set encoding format you specify and outputting it into a string.

Examples are as follows:

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>