Java character set encoding

1. General

This paper mainly includes the following aspects: basic coding knowledge, Java, system software, URL, tool software and so on.

In the following description, the word "Chinese" will be taken as an example. After looking up the table, we can know that the GB2312 code is "d6d0 cec4", the Unicode code is "4e2d 6587", and the UTF code is "e4b8ad e69687". Note that there is no iso8859-1 code for these two words, but they can be "represented" by iso8859-1 code.

2. Basic knowledge of coding

The earliest coding is iso8859-1, which is similar to ASCII coding. However, in order to facilitate the representation of various languages, many standard codes have gradually emerged. The important ones are as follows.

2.1. iso8859-1

It belongs to single byte coding and can represent up to 0-255 characters. It is applied to English series. For example, the code of the letter 'a' is 0x61 = 97.

Obviously, the character range represented by iso8859-1 code is too narrow to represent Chinese characters. However, because it is a single byte code, which is consistent with the most basic representation unit of the computer, it is still expressed by iso8859-1 code. And on many protocols, this code is used by default. For example, although there is no iso8859-1 code for "Chinese", take GB2312 code as an example, it should be "d6d0 cec4", When using iso8859-1 encoding, it is broken down into four bytes to represent: "D6 d0 CE C4" (in fact, it is also processed in bytes when storing). If it is UTF encoding, it is six bytes "E4 B8 ad E6 96 87". Obviously, this presentation method needs to be based on another encoding.

2.2. GB2312/GBK

This is the Chinese national standard code, which is specially used to represent Chinese characters. It is a two byte code, and the English letter is consistent with iso8859-1 (compatible with iso8859-1 code). Among them, GBK code can be used to represent both traditional and simplified characters, while GB2312 can only represent simplified characters, and GBK is compatible with GB2312 code.

2.3. unicode

This is the most uniform encoding and can be used to represent characters in all languages, Moreover, it is a fixed length double byte (or four byte) code, including English letters. Therefore, it can be said that it is incompatible with iso8859-1 coding and any coding. However, compared with iso8859-1 coding, uniocode coding only adds a 0 byte in front of it, for example, the letter 'a' is "00 61".

It should be noted that fixed length coding is convenient for computer processing (note that GB2312 / GBK is not fixed length coding), and Unicode can be used to represent all characters. Therefore, Unicode coding is used internally in many software, such as Java.

2.4. UTF

Considering that Unicode coding is incompatible with iso8859-1 coding, and it is easy to occupy more space: for English letters, Unicode also needs two bytes to represent. Therefore, Unicode is not convenient for transmission and storage. Therefore, UTF coding is produced. UTF coding is compatible with iso8859-1 coding, and can also be used to represent characters of all languages. However, UTF coding is variable length coding, and the length of each character ranges from 1-6 bytes. In addition, UTF coding has a simple verification function. Generally speaking, English letters are represented by one byte, while Chinese characters use three bytes.

Note that although UTF is used to use less space, it is only compared with Unicode coding. If it is already known that it is Chinese characters, it is undoubtedly the most economical to use GB2312 / GBK. On the other hand, it is worth noting that although UTF encoding uses 3 bytes for Chinese characters, even for Chinese web pages, UTF encoding will save more than Unicode encoding, because web pages contain a lot of English characters.

3. Java character processing

In Java application software, character set coding is involved in many places. Some places need to be set correctly, and some places need to be processed to a certain extent.

3.1. getBytes(charset)

This is a standard function for Java string processing. Its function is to encode the characters represented by the string according to charset and represent them in bytes. Note that strings are always stored in Unicode in JAVA memory. For example, "Chinese" is normally stored as "4e2d 6587". If charset is "GBK", it is encoded as "d6d0 cec4", and then the byte "D6 d0 CE C4" is returned. If charset is "utf8", it is finally "E4 B8 ad E6 96 87". If "iso8859-1", it is unable to encode, and finally "3F 3F" (two question marks) is returned.

3.2. new String(charset)

This is another standard function for Java string processing. Contrary to the previous function, the byte array is combined and identified according to charset coding, and finally converted to unicode storage. Referring to the above example of GetBytes, "GBK" and "utf8" can get the correct result "4e2d 6587", but iso8859-1 finally becomes "003f 003f" (two question marks).

Because utf8 can be used to represent / encode all characters, new string (str.getbytes ("utf8"), "utf8") = = STR, that is, completely reversible.

3.3. setCharacterEncoding()

This function is used to set the HTTP request or the corresponding encoding.

For request, it refers to the encoding of the submitted content. After specifying, you can directly obtain the correct string through getparameter(). If not specified, the iso8859-1 encoding is used by default and needs further processing. See "form entry" below. It is worth noting that no getparameter () can be executed until setcharacterencoding() is executed. Description on Java doc: this method must be called prior to reading request parameters or reading input using getreader(). Moreover, this specification is only valid for the post method, not for the get method. The reason for analysis should be that when the first getparameter () is executed, Java will analyze all the submitted contents according to the code, and the subsequent getparameter () will not be analyzed, so SetCharacterEncoding () is invalid. For the get method submission form, the submitted content is in the URL. All the submitted content has been analyzed according to the code at the beginning, and setcharacterencoding() is naturally invalid.

For response, specify the encoding of the output content. At the same time, this setting will be passed to the browser to tell the browser the encoding used for the output content.

3.4. Processing process

Here are two representative examples to illustrate how Java handles coding related problems.

3.4. 1. Form input

User input * (GBK: d6d0 cec4) browser * (GBK: d6d0 cec4) web server iso8859-1 (00d6 00D 000ce 00c4) class, which needs to be processed in the class: GetBytes ("iso8859-1") is D6 d0 CE C4, new string ("GBK") is d6d0 cec4, and Unicode in memory is 4e2d 6587.

L the encoding method entered by the user is related to the encoding specified on the page and the user's operating system, so it is uncertain. The above example takes GBK as an example.

L from the browser to the web server, you can specify the character set used when submitting content in the form, otherwise the encoding specified on the page will be used. And if you use it directly in the URL? If you enter parameters in the way of, the code is often the code of the operating system itself, because it is irrelevant to the page. The above still takes GBK coding as an example.

L the web server receives a byte stream. By default (getparameter) will process it with iso8859-1 encoding. The result is incorrect, so it needs to be processed. However, if the encoding is set in advance (through request. Setcharacterencoding()), the correct result can be obtained directly.

L it is a good habit to specify the code in the page, otherwise you may lose control and cannot specify the correct code.

3.4. 2. File compilation

Assuming that the file is saved in GBK code, there are two coding options for compilation: GBK or iso8859-1. The former is the default code of Chinese windows and the latter is the default code of Linux. Of course, the code can also be specified during compilation.

Jsp *(gbk:d6d0 cec4) java file *(gbk:d6d0 cec4) compiler read uincode(gbk: 4e2d 6587; iso8859-1: 00d6 00d 000ce 00c4) compiler write utf(gbk: e4b8ad e69687; iso8859-1: *) compiled file unicode(gbk: 4e2d 6587; iso8859-1: 00d6 00d 000ce 00c4) class。 Therefore, it is incorrect to save with GBK code and compile with iso8859-1.

class unicode(4e2d 6587) system. out / jsp. out gbk(d6d0 cec4) os console / browser。

L files can be saved in a variety of encoding methods. In Chinese windows, the default is ANSI / GBK.

L when the compiler reads a file, it needs to get the file code. If it is not specified, the system default code is used. General class files are saved in the system default code, so there will be no problem in compiling. However, for JSP files, if they are edited and saved under Chinese windows and deployed to run / compile under English Linux, there will be problems. Therefore, you need to specify the encoding in the JSP file with pageencoding.

L when compiling Java, it will be converted to unified Unicode coding processing, and finally converted to UTF coding when saving.

L when the system outputs characters, it will output them according to the specified code. For Chinese windows, system Out will use GBK encoding, and for response (browser), use the contenttype specified in the JSP file header, or directly specify the encoding for response. At the same time, it will tell the browser the encoding of the web page. If not specified, it will use iso8859-1 encoding. For Chinese, the encoding of the output string should be specified for browser.

L when the browser displays a web page, it first uses the encoding specified in the response (the contenttype specified in the JSP file header is finally reflected in the response). If it is not specified, it will use the contenttype specified in the meta item in the web page.

3.5. Several settings

For web applications, the settings or functions related to coding are as follows.

3.5. 1. JSP compilation

Specify the storage code of the file. Obviously, this setting should be placed at the beginning of the file. For example: <% @ page pageencoding = "GBK"% >. In addition, for general class files, you can specify the encoding when compiling.

3.5. 2. JSP output

Specifies the encoding used to output the file to the browser, and this setting should also be placed at the beginning of the file. For example: <% @ page contenttype = "text / HTML; charset = GBK"% >. This setting and response SetCharacterEncoding ("GBK") is equivalent.

3.5. 3. Meta settings

Specifies the encoding used by the web page, which is especially useful for static web pages. Because static web pages cannot adopt JSP settings and cannot execute response setCharacterEncoding()。 For example: < meta http equiv = "content type" content = "text / HTML; charset = GBK" / >

If both JSP output and meta settings are used, the code specified by JSP takes precedence. Because the specified by JSP is directly reflected in the response.

It should be noted that Apache has a setting that can specify the code for the web page without coding. This specification is equivalent to the coding specification method of JSP, so it will override the meta specification in the static web page. Therefore, it is suggested to turn off this setting.

3.5. 4. Form settings

When the browser submits the form, you can specify the corresponding code. For example: < form accept charset = "GB2312" >. Generally, you don't need to use this setting. The browser will directly use the encoding of the web page.

4. System software

Several related system software are discussed below.

4.1. Mysql database

Obviously, to support multilingualism, you should set the database encoding to UTF or Unicode, and UTF is more suitable for storage. However, if Chinese data contains few English letters, Unicode is more suitable.

The encoding of the database can be set through the configuration file of MySQL, for example, default character set = utf8. It can also be set in the database link URL, for example: useunicode = true & characterencoding = UTF-8. Note that the two should be consistent. In the new SQL version, the database link URL can not be set, but it can not be set incorrectly.

4.2. apache

The configuration of Apache related to encoding is in httpd Conf, for example, adddefaultcharset UTF-8. As mentioned earlier, this function will set the encoding of all static pages to UTF-8. It is best to turn off this function.

In addition, Apache has a separate module to handle the web page response header, which may also set the encoding.

4.3. Linux default encoding

The Linux default code mentioned here refers to the environment variables at runtime. Two important environment variables are LC_ For all and Lang, the default encoding will affect the behavior of Java URLEncode, as described below.

It is recommended to set "zh_cn. UTF-8".

4.4. other

To support Chinese file names, Linux should specify the character set when loading the disk, for example: Mount / dev / hda5 / MNT / hda5 / - t NTFS - O iocharset = GB2312.

In addition, as mentioned earlier, the information submitted using the get method does not support request Setcharacterencoding(), but the character set can be specified through Tomcat's configuration file, which is in Tomcat's server XML file, such as: < connector URIEncoding="GBK"/>。 This method will set all requests uniformly, not for specific pages, and may not be the same as the code used by browser, so sometimes it is not expected.

5. URL address

It is troublesome to include Chinese characters in the URL address. The case of submitting a form using the get method was described earlier. When using the get method, the parameters are included in the URL.

5.1. URL encoding

For some special characters in the URL, the browser will automatically encode them. In addition to "/? &", these characters also include Unicode characters, such as men. The coding is special at this time.

IE has an option "always use UTF-8 to send URLs". When this option is valid, ie will encode special characters in UTF-8 and encode URLs at the same time. If the change option is invalid, the default encoding "GBK" is used and no URL encoding is performed. However, the parameters after the URL are always not encoded, which means that the UTF-8 option is invalid. For example, "Chinese. HTML? A = Chinese", when the UTF-8 option is valid, the link "% E4% B8% ad% E6% 96% 87. HTML? A = x4ex2dx65x87" will be sent; When the UTF-8 option is invalid, the link "x4ex2dx65x87. HTML? A = x4ex2dx65x87" will be sent. Note that the word "Chinese" in front of the latter has only 4 bytes, while the former has 18 bytes, which is mainly due to URL coding.

When the web server (Tomcat) receives the link, it will decode the URL, i.e. "%" will be removed and identified according to iso8859-1 encoding (as described above, urlencoding can be used to set other encoding). The results of the above examples are "ue4ub8uadue6u96u87. HTML? A = u4eu2du65u87" and "u4eu2du65u87. HTML? A = u4eu2du65u87". Note the "Chinese" in front of the former Two words were restored to six characters. "U" here means Unicode.

Therefore, due to different client settings, the same link gets different results on the server. Many people have encountered this problem, but there is no good solution. Therefore, some websites will advise users to try to turn off the UTF-8 option. However, a better approach is described below.

5.2. rewrite

Familiar people know that Apache has a powerful rewrite module. Its functions are not described here. It should be noted that the module will automatically decode the URL (remove%) to complete some of the functions of the above web server (Tomcat). There are relevant documents that say that the [ne] parameter can be used to turn off the function, but my experiment did not succeed, which may be due to the problem of the version (I use Apache 2.0.54). In addition, when the parameter contains "? &" This function will cause the system not to get normal results when waiting for symbols.

Rewrite itself seems to be completely byte processing, regardless of string encoding, so it will not bring encoding problems.

5.3. URLEncode. encode()

This is the URL encoding function provided by Java itself. The work completed is similar to that done by the browser when the UTF-8 option above is valid. It is worth noting that Java does not approve of using this method without specifying the code. The code specification should be added when using it.

When the code is not specified, the method uses the system default code, which will lead to uncertain software operation results. For example, for "Chinese", when the system default code is "GB2312", the result is "% 4e%2d%65% 87", while the default code is "UTF-8", the result is "% e4%b8%ad%e6%96% 87", which will be difficult to handle in subsequent programs. In addition, the system default code mentioned here is determined by the environment variable LC when running Tomcat_ As determined by all and Lang, there was a problem of garbled code after Tomcat was restarted. Finally, it was found that the two environment variables were modified.

It is recommended to uniformly specify "UTF-8" code, and the corresponding program may need to be modified.

5.4. A solution

As mentioned above, due to different browser settings, the web server receives different contents for the same link, and the software system cannot know the difference, so this protocol still has defects.

For specific problems, we should not be lucky to think that the IE settings of all customers are UTF-8 effective, nor should we rudely recommend users to modify ie settings. You know, users cannot remember the settings of each web server. Therefore, the next solution is to make your program more intelligent: analyze whether the coding is UTF-8 according to the content.

Fortunately, the UTF-8 encoding is quite regular, so you can judge whether it is the correct UTF-8 character by analyzing the transmitted link content. If so, it will be processed in UTF-8. If not, the customer's default encoding (such as "GBK") will be used. The following is an example to judge whether it is UTF-8. If you understand the corresponding rules, it will be easy to understand.

public static boolean isValidUtf8(byte[] b,int aMaxCount){

int lLen=b.length,lCharCount=0;

for(int i=0;i

byte lByte=b[i++];// to fast operation,++ Now,ready for the following for(;;)

if(lByte>=0) continue;//>= 0 is normal ascii

if(lByte<(byte)0xc0 || lByte>(byte)0xfd) return false;

int lCount=lByte>(byte)0xfc? 5:lByte>(byte)0xf8? four

:lByte>(byte)0xf0? 3:lByte>(byte)0xe0? 2:1;

if(i+lCount>lLen) return false;

for(int j=0;j
=(byte)0xc0) return false;

}

return true;

}

Accordingly, an example using the above method is as follows:

public static String getUrlParam(String aStr,String aDefaultCharset)

throws UnsupportedEncodingException{

if(aStr==null) return null;

byte[] lBytes=aStr. getBytes("ISO-8859-1");

return new String(lBytes,StringUtil.isValidUtf8(lBytes)? "utf8":aDefaultCharset);

}

However, this method also has defects in the following two aspects:

L it does not include the identification of the user's default code, which can be judged according to the language of the requested information, but it is not necessarily correct, because we sometimes input some Korean or other characters.

L the UTF-8 character may be misjudged. An example is the word "learn", and its GBK code is "xd1xa7xcfxb0". If the above isvalidoutf8 method is used to judge, it will return true. More rigorous judgment methods can be considered, but the estimation effect is not good.

An example can prove that Google also encountered the above problems, and also adopted a similar processing method. For example, if you enter in the address bar“ http://www.google.com/search?hl=zh -Cn & newwindow = 1 & Q = learning ", Google will not recognize correctly, while other Chinese characters can generally be recognized normally.

Finally, it should be added that if you do not use the rewrite rule or submit data through a form, you do not necessarily encounter the above problems, because you can specify the desired code when submitting data. In addition, Chinese file names do cause problems and should be used with caution.

6. Others

Some other coding related issues are described below.

6.1. SecureCRT

In addition to the browser and console related to coding, some clients are also related. For example, when using SecureCRT to connect to Linux, you should keep the display code of SecureCRT (different sessions can have different coding settings) consistent with the coding environment variables of Linux. Otherwise, some help information you see may be garbled.

In addition, MySQL has its own coding settings, which should also be consistent with the display coding of SecureCRT. Otherwise, when executing SQL statements through SecureCRT, Chinese characters may not be processed, and the query results will be garbled.

For UTF-8 files, Many editors (such as Notepad) will add three invisible flag bytes at the beginning of the file. If it is used as the input file of MySQL, these three characters must be removed. (these three characters can be removed by saving with VI in Linux). An interesting phenomenon is that under Chinese windows, create a new TXT file, open it with notepad and enter "connect" Two words, save, and then open, you will find that the two words are gone, leaving only a small black spot.

6.2. filter

If the coding needs to be set uniformly, it is a good choice to set it through filter. In the filter class, you can uniformly set codes for the required requests or responses. Participate in setcharacterencoding() above. Apache has given an example of SetCharacterEncoding filter that can be used directly.

6.3. Post and get

Obviously, when submitting information in post, the URL has better readability, and it is convenient to use SetCharacterEncoding () to deal with character set problems. However, the URL formed by the get method can more easily express the actual content of the web page and can also be used for collection.

Considering the problem from a unified point of view, it is recommended to use the get method, which requires special processing to obtain parameters in the program, and the convenience of setcharacterencoding() cannot be used. If rewrite is not considered, there will be no UTF-8 problem of ie. you can consider setting uriencode to obtain parameters in the URL.

6.4. Simple and traditional encoding conversion

GBK contains both simplified and traditional codes, that is, the same word belongs to two words under GBK code due to different codes. Sometimes, in order to achieve complete results correctly, traditional and simplified Chinese should be unified. All traditional characters in UTF and GBK can be converted into corresponding simplified characters, and Big5 encoded data should also be converted into corresponding simplified characters. Of course, it is still stored in UTF encoding.

For example, for "language", it is expressed in UTF as "xe8xafxadxe8xa8x80 xe8xaax9exe8xa8x80". After simple and traditional coding conversion, there should be two identical "xe8xafxadxe8xa8x80 >.

Manufacturer. Com Liu keyin

2006-3-8

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>