Research and sharing on Java and related character set coding

This paper will describe and discuss the above issues. We will take the word "Chinese" as an example. Looking up relevant data, we can see that the GB2312 code of "Chinese" is "d6d0 cec4", the Unicode code is "4e2d 6587", and the UTF code is "e4b8ad e69687". (note that the word "Chinese" does not have iso8859-1 code, but can be "represented" by iso8859-1 code).

1、 Coding Basics:

The earliest coding is iso8859-1, which is similar to ASCII coding. However, in order to facilitate the expression of various languages, many standard codes have gradually emerged. The important ones are as follows:

1. iso8859-1

It belongs to single byte coding and can represent up to 0-255 characters. It is applied to English series. For example, the code of the letter A is 0x61 = 97.

Obviously, the character range represented by iso8859-1 code is too narrow to represent Chinese characters. However, because it is a single byte code, which is consistent with the most basic representation unit of the computer, it is still expressed by iso8859-1 code. And on many protocols, this code is used by default. For example, although there is no iso8859-1 code for "Chinese", take GB2312 code as an example, it should be "d6d0 cec4", When using iso8859-1 encoding, it is broken down into four bytes to represent: "D6 d0 CE C4" (in fact, it is also processed in bytes when storing). If it is UTF encoding, it is six bytes "E4 B8 ad E6 96 87". Obviously, this presentation method needs to be based on another encoding.

2. GB2312/GBK

This is the Chinese national standard code, which is specially used to represent Chinese characters. It is a two byte code, and the English letter is consistent with iso8859-1 (compatible with iso8859-1 code). Among them, GBK code can be used to represent both traditional and simplified characters, while GB2312 can only represent simplified characters, and GBK is compatible with GB2312 code.

3. unicode

This is the most uniform encoding and can be used to represent characters in all languages, Moreover, it is a fixed length double byte (or four byte) code, including English letters. Therefore, it can be said that it is incompatible with iso8859-1 coding and any coding. However, compared with iso8859-1 coding, uniocode coding only adds a 0 byte in front of it, for example, the letter A is "00 61".

It should be noted that fixed length coding is convenient for computer processing (note that GB2312 / GBK is not fixed length coding), and Unicode can be used to represent all characters. Therefore, Unicode coding is used internally in many software, such as Java.

4. UTF

Considering that Unicode coding is incompatible with iso8859-1 coding, and it is easy to occupy more space: for English letters, Unicode also needs two bytes to represent. Therefore, Unicode is not convenient for transmission and storage. Therefore, UTF coding is produced. UTF coding is compatible with iso8859-1 coding, and can also be used to represent characters of all languages. However, UTF coding is variable length coding, and the length of each character ranges from 1-6 bytes. In addition, UTF coding has a simple verification function. Generally speaking, English letters are represented by one byte, while Chinese characters use three bytes.

Note that although UTF is used to use less space, it is only compared with Unicode coding. If it is already known that it is Chinese characters, it is undoubtedly the most economical to use GB2312 / GBK. On the other hand, it is worth noting that although UTF encoding uses 3 bytes for Chinese characters, even for Chinese web pages, UTF encoding will save more than Unicode encoding, because web pages contain a lot of English characters.

2、 Java character processing

In writing Java applications, character set coding is involved in many places. Some places need to be set correctly, and some places need to be processed to a certain extent.

1. getBytes(charset)

This is a standard function for Java string processing. Its function is to encode the characters represented by the string according to charset and represent them in bytes. Note that strings are always stored in Unicode in JAVA memory. For example, "Chinese" is normally stored as "4e2d 6587". If charset is "GBK", it is encoded as "d6d0 cec4", and then the byte "D6 d0 CE C4" is returned. If charset is "utf8", the last is "E4 B8 ad E6 96 87". If it is "iso8859-1", because it cannot be encoded, it finally returns "3F 3F" (Note: "3F 3F" is two question marks).

2. new String(charset)

This is another standard function for Java string processing. Contrary to the previous function, the byte array is combined and identified according to charset coding, and finally converted to unicode storage. Referring to the above example of GetBytes, "GBK" and "utf8" can get the correct result "4e2d 6587", but iso8859-1 finally becomes "003f 003f" (two question marks).

Because utf8 can be used to represent / encode all characters, new string (str.getbytes ("utf8"), "utf8") = = STR, that is, completely reversible.

3. setCharacterEncoding()

This function is used to set the HTTP request or the corresponding encoding.

For request, it refers to the encoding of the submitted content. After specifying, you can directly obtain the correct string through getparameter(). If not specified, the iso8859-1 encoding is used by default and needs further processing. See "form entry" below. It is worth noting that no getparameter () can be executed until setcharacterencoding() is executed. Description on Java doc: this method must be called prior to reading request parameters or reading input using getreader().

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>