Encoding and scrambling (05) — conversion between GBK and UTF-8 — reprint

Original address: http://www.blogjava.net/pengpenglin/archive/2010/02/22/313669.html

[GBK to UTF-8]

On many forums and websites, netizens often ask, "why can I use new string (TMP. GetBytes (" iso-8859-1 ")," UTF-8 ") or new string (TMP. GetBytes (" iso-8859-1 ")," GBK ") to get correct Chinese, but can't use new string (TMP. GetBytes (" GBK ")," UTF-8 ") to convert GBK into UTF-8?" Referring to the previous article, we will know the reason. Because if the client uses GBK and UTF-8 encoding, the encoded bytes are transmitted through iso-8859-1 and decoded in the same encoding method. This process is "lossless conversion" -- because the original and final encoding methods are the same. However, if the client uses GBK encoding, it should be converted to UTF-8 on the server side, or the reverse process. Think about it, bytes are still those bytes, but the coding rules have changed. The original four bytes after GBK coding should be encoded according to the rule of three bytes per character of UTF-8. How can it not be garbled? So from now on, don't make this mistake again. In the process of new string (TMP. GetBytes ("GBK"), "UTF-8"), the JVM will not automatically expand bytes to adapt to UTF-8 encoding. The correct method should be to expand bytes according to the coding rules of UTF-8, that is, manually change from 2 bytes to 3 bytes, and then convert to hexadecimal UTF-8 coding. At the beginning of the first article on this topic, we have introduced this rule: ① get the binary GBK code of each character; ② convert the hexadecimal GBK code into a binary string (2 bytes); ③ insert 110 at the first place of the string, 10 at bit 9, and 10 at bit 17, Get three bytes ④ convert these three bytes into hexadecimal coding to get the final UTF-8 coding.

。 The following code has been slightly modified

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>