In depth analysis of Chinese coding in Java — conversion

< H2 id = "major1" > several common coding formats < H3 id = "minor1.1" > why code

I wonder if you have ever thought about a question, that is, why code? Can we not code? To answer this question, we must return to how computers represent symbols that we humans can understand. These symbols are the language we humans use. Because there are too many human languages, there are too many symbols representing these languages, which can not be represented by byte, a basic storage unit in the computer. Therefore, it must be split or some translation work before the computer can understand them. We can assume that the language that the computer can understand is English. If other languages want to be used in the computer, they must be translated into English once. The process of translation is coding. So it can be imagined that as long as countries that do not speak English want to be able to use computers, they must be encoded. This seems a bit overbearing, but this is the current situation. It is the same as our country is vigorously promoting Chinese now. I hope other countries can speak Chinese. In the future, other languages will be translated into Chinese. We can change the minimum unit of information stored in the computer into Chinese characters, so that we don't have coding problems.

Therefore, in general, the reasons for coding can be summarized as follows:

Understand that all kinds of languages need communication, and translation is necessary, so how to translate? In the calculation, there are many translation methods, such as ASCII, iso-8859-1, GB2312, GBK, UTF-8, utf-16, etc. They can all be regarded as dictionaries. They stipulate the rules of transformation. According to this rule, the computer can correctly represent our characters. At present, there are many coding formats. For example, GB2312, GBK, UTF-8 and utf-16 can represent a Chinese character. Which coding format do we choose to store Chinese characters? This needs to consider other factors, whether storage space is important or coding efficiency is important. Select the correct coding format according to these factors. The following briefly introduces these coding formats. ASCII code people who have studied computer know that there are 128 ASCII codes in total, which are represented by the lower 7 bits of a byte. 0 ~ 31 are control characters, such as line feed, carriage return, deletion, etc; 32 ~ 126 are printed characters, which can be input through the keyboard and can be displayed. Iso-8859-1 128 characters is obviously not enough. Therefore, ISO organization has formulated some column standards based on ASCII code to expand ASCII coding. They are iso-8859-1 ~ iso-8859-15. Among them, iso-8859-1 covers most Western European language characters and is the most widely used. Iso-8859-1 is still a single byte encoding, which can represent a total of 256 characters. The full name of GB2312 is the basic set of Chinese character coding character set for information exchange. It is double byte coding. The total coding range is a1-f7, in which a1-a9 is the symbol area, including 682 symbols in total, and b0-f7 is the Chinese character area, including 6763 Chinese characters. The full name of GBK is called Chinese character internal code extension specification. It is a new Chinese character internal code specification formulated by the State Bureau of technical supervision for Windows 95. Its emergence is to expand GB2312 and add more Chinese characters, Its coding range is 8140 ~ FeFe (excluding xx7f), with a total of 23940 code bits. It can represent 21003 Chinese characters. Its coding is compatible with GB2312, that is, the Chinese characters encoded with GB2312 can be decoded with GBK without random code. The full name of GB18030 is the Chinese character coding character set for information exchange , is a mandatory standard in China. It may be single byte, double byte or four byte coding. Its coding is compatible with GB2312 coding. Although it is a national standard, it is not widely used in practical application systems. Utf-16 must mention Unicode when it comes to UTF (universal code), ISO is trying to create a new superlanguage dictionary, through which all languages in the world can be translated into each other. It can be imagined how complex this dictionary is. For detailed specifications of Unicode, please refer to the corresponding documents. Unicode is the basis of Java and XML. The storage form of Unicode in the computer is described in detail below Type. Utf-16 specifically defines the access method of Unicode characters in the computer. Utf-16 uses two bytes to represent the Unicode conversion format. This is a fixed length representation method. Any character can be represented by two bytes. Two bytes are 16 bits, so it is called utf-16. Utf-16 is very convenient to represent characters. Every two bytes represent a character. This greatly simplifies the operation during string operation. This is also a very important reason why Java uses utf-16 as the character storage format of memory. UTF-8 utf-16 uses two bytes to represent a character. Although it is very simple and convenient in representation, it also has its disadvantages. A large part of characters can be represented by one byte. Now it needs two bytes to represent, and the storage space is doubled. Today, the network bandwidth is still very limited, which will increase the network transmission traffic, and it is not necessary. UTF-8 adopts a variable length technology, and each coding area has different word code length. Different types of characters can be composed of 1 ~ 6 bytes. UTF-8 has the following encoding rules: if a byte, The highest bit (bit 8) is 0, indicating that this is an ASCII character (00 - 7F). It can be seen that all ASCII codes are UTF-8. If a byte starts with 11, the number of consecutive 1 indicates the number of bytes of the character. For example, 110xxxxx represents that it is the first byte of double byte UTF-8 character. If a byte starts with 10, it means that it is not the first byte. You need to look forward to get the first byte of the current character This page describes several common coding formats. The following will introduce how to deal with coding support in Java and where coding is required. We know that the encoding is generally involved in character to byte or byte to character conversion, and the scenarios requiring this conversion are mainly in I / O. this I / O includes disk I / O and network I / O. the network I / O part will be mainly introduced with web applications as an example later. The following figure shows the interface for handling I / O problems in Java: the reader class is the parent class for reading characters in Java I / O, while the InputStream class is the parent class for reading bytes. The inputstreamreader class is the bridge for associating bytes to characters. It is responsible for handling the conversion from reading bytes to characters in the I / O process, and the decoding of specific bytes to characters is implemented by the streamdecoder, The charset encoding format must be specified by the user during streamdecoder decoding. It is worth noting that if you do not specify charset, the default character set in the local environment will be used. For example, GBK encoding will be used in the Chinese environment. The writing situation is similar. The parent class of characters is writer and the parent class of bytes is OutputStream. Characters are converted to bytes through outputstreamwriter. As shown in the following figure: similarly, the streamencoder class is responsible for encoding characters into bytes. The encoding format and default encoding rules are consistent with decoding. For example, the following code realizes the file reading and writing function: when I / O operation is involved in our application, as long as we pay attention to specifying a unified encoding and decoding charset character set, there will generally be no garbled code problem. If some applications do not pay attention to specifying the character code, the default code of the operating system will be taken in the Chinese environment. If the encoding and decoding are all in the Chinese environment, Generally, there is no problem, but it is strongly not recommended to use the default coding of the operating system, because in this way, the coding format of your application is bound to the running environment, and there is likely to be garbled code in cross environment. In java development, in addition to I / O involving coding, the most commonly used should be the conversion of character to byte data types in memory. String is used to represent string in Java, so string class provides the method of converting to byte and supports the constructor of converting byte to string. The following code example: the other is the obsolete bytetochareconverter and chartobyteconverter classes, which provide convertall methods respectively to realize the mutual conversion of byte [] and char []. As shown in the following code: these two classes have been replaced by charset class. Charset provides encoding and decoding corresponding to char [] to byte [] and byte [] to char [] respectively. As shown in the following code: both encoding and decoding are completed in one class. It is easier to unify the encoding format by setting the encoding and decoding character set through forname, which is more convenient than bytetochareconverter and chartobyteconverter classes. There is also a ByteBuffer class in Java, which provides a soft conversion between char and byte. The conversion between them does not require encoding and decoding, but splits a 16 bit char format into two 8-bit byte representations. Their actual values have not been modified, but only the type of data has been converted. The code is as follows: the above provides the conversion between characters and bytes. Generally, there will be no problem as long as we set the encoding and decoding format to be unified. Several common encoding formats have been introduced earlier. Here we will introduce how to implement encoding and decoding in java with practical examples. Next, we will take the string "I am Junshan" as an example to introduce how to encode it in iso-8859-1, GB2312, GBK, utf-16 and UTF-8 encoding formats in Java. We encode the name string according to several coding formats mentioned above, convert it into a byte array, and then output it in hexadecimal. Let's take a look at how Java encodes it first. The following is the class diagram required for coding in Java. First, according to the specified charsetname, pass charset Forname (charsetname) sets charset class, creates charsetencoder object according to charset, and then calls charsetencoder Encode encodes strings. Different encoding types will correspond to one class. The actual encoding process is completed in these classes. The following is string The sequence diagram of GetBytes (charsetname) encoding process can be seen from the above figure. Find the charset class according to charsetname, and then generate charsetencoder according to the character set encoding. This class is the parent class of all character encoding, and defines how to implement encoding in its subclasses for different character encoding sets, With the charsetencoder object, you can call the encode method to implement the encoding. This is string The GetBytes encoding method is similar to other methods such as streamencoder. Let's see how different character sets encode the previous string into a byte array? For example, the char array of the string "I am Junshan" is 49 20 61 6D 20 541b 5c71, which is converted into corresponding bytes according to different encoding formats. The string "I am Junshan" is encoded with iso-8859-1. The following is the encoding result: from the above figure, seven char characters are transformed into seven byte arrays through iso-8859-1 encoding. Iso-8859-1 is a single byte encoding, and the Chinese "Junshan" is transformed into a byte with a value of 3F. 3F, or "?" Characters, so it often appears that Chinese becomes "?" It is probably caused by the wrong use of iso-8859-1. Chinese characters will lose information after iso-8859-1 coding, which is usually called "black hole", which will absorb unknown characters. Since most basic Java frameworks or system default character set codes are iso-8859-1, it is easy to cause garbled code. We will analyze how different forms of garbled code appear later. The string "I am Junshan" is encoded in GB2312. Here is the encoding result: the charset corresponding to GB2312 is sun nio. cs. ext. EUC_ Cn and the corresponding charsetdecoder coding class is sun nio. cs. Ext. doublebyte, GB2312 character set has a code table from char to byte. Different character codes are to look up this code table to find the corresponding byte of each character, and then assemble it into a byte array. The table lookup rules are as follows: > 8] + (char & 0xff)] if the code point value found is greater than oxff, it is double byte, otherwise it is single byte. Double byte: the upper 8 bits are the first byte and the lower 8 bits are the second byte, as shown in the following code: 0xff) {/ / doublebyte if (DL - DP < 2) return coderresult.overflow; Da [DP + +] = (byte) (BB > > 8); Da [DP + +] = (byte) BB;} else { // SingleByte if (dl - dp < 1) return CoderResult.OVERFLOW; da[dp++] = (byte) bb; } It can be seen from the above figure that the first five characters are still five bytes after encoding, while Chinese characters are encoded into double bytes. In Section 1, GB2312 only supports 6763 Chinese characters, so it is not

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>