Chinese coding in Java Web
Chinese coding is often encountered in Java Web development, so why do you need coding? Because human beings need to represent too many symbols, they can't be represented by one byte, and the minimum unit of information stored in the computer is one byte. Therefore, the encoding rules between char and byte must be specified.
1. Common coding methods
A variety of coding methods are provided in the computer, including ASCII, iso-8859-1, GBK, GB2312, utf-16, UTF-8, etc.
2 coding scenario
In IO operations, we generally need to encode. The IO operations here include disk IO, network IO, etc. for example, the following is an example of disk IO:
Design IO codec in the program, as long as we specify a unified codec charset character set, Generally, there is no problem (or the encoding and decoding are all in the same system, and the default character set is used). However, if the encoding and decoding specified by us are inconsistent, the Chinese garbled code problem will occur. If inputstreamreader in the code is specified as GBK coding, the garbled code problem will occur.
Using string type in Java can also specify the encoded character set of byte to character conversion.
3 Comparison of several coding formats
The four coding formats of Chinese characters GBK / GB2312 / utf-16 / UTF-8 can be processed. GB2312 is similar to GBK coding rules, but GBK has a larger range and can process all Chinese characters. Therefore, GBK should be selected as GB2312 compared with GBK. Both utf-16 and UTF-8 handle Unicode encoding, and their encoding rules are different. Relatively speaking, utf-16 has the highest encoding efficiency, simpler character to byte conversion and better string operation. It is suitable for use between local disk and memory, and can quickly switch between characters and bytes. For example, JAVA memory coding adopts utf-16 coding. However, it is not suitable for transmission between networks, because the network transmission is easy to damage the byte stream. Once the byte stream is damaged, it will be difficult to recover. In comparison, UTF-8 is more suitable for network transmission. Single byte storage is adopted for ASCII characters. In addition, the damage of a single character will not affect other subsequent characters. The coding efficiency is between GBK and utf-16, Therefore, UTF-8 balances coding efficiency and coding security, and is an ideal Chinese coding method.
4 Analysis of common problems
When we encounter some garbled code, how should we deal with these problems? The only reason for the problem of garbled code is that the character sets encoded and decoded in char to byte or byte to char conversion are inconsistent. Since one operation often involves multiple encoding and decoding, it is difficult to find out which link has the problem when garbled code occurs. The following is an analysis of several common phenomena. Chinese has become an incomprehensible character
For example, the string "Tao! I like it!" It becomes a "boom!" ² The encoding process of ¶ "is shown in the figure below
The character set used in string decoding is inconsistent with the encoded character set, resulting in Chinese characters becoming incomprehensible garbled, and one Chinese character becomes two garbled characters. A Chinese character becomes a question mark
For example, the string "Tao! I like it!" It turned into " The coding process is shown in the figure below:
After Chinese and Chinese symbols are encoded by iso-8859-1 that does not support Chinese, all characters become "?", This is because when encountering characters outside the code value range when encoding and decoding with iso-8859-1, it is uniformly represented by 3F, which is commonly referred to as "black hole", and all characters unknown to iso-8859-1 become ".
A Chinese character becomes two question marks, for example, the string "Tao! I like it!" It turned into " The coding process is shown in the figure below:
This situation is complicated. Chinese characters are encoded many times, but one of them is wrong in encoding or decoding, and Chinese characters will still become "?" In this case, you should carefully check the coding link in the middle to find out where the coding error occurs.
reference resources:
1. In depth analysis of Chinese coding in Java
2. In depth analysis of Java Web Technology