Java solution for intercepting strings with Chinese characters by bytes (recommended)

Since the length of the Oracle field used by the interface is a fixed number of bytes, and the string transmitted later is estimated to be larger than the total number of bytes of the database field, intercept the string less than the number of bytes of the database.

I refer to the example on the Internet and just make a recursive call, because the byte length of the intercepted character must be smaller than that of the database, that is, if the last character is a Chinese character, the forward interception can only be removed.

Java interview questions:

Write a function to intercept a string. The input is a string and the number of bytes, and the output is a string intercepted by bytes. However, to ensure that Chinese characters are not cut in half, for example, "I ABC" 4 should be cut as "I AB", enter "I ABC Han def", and 6 should be output as "I ABC" instead of "I ABC + half of Chinese".

At present, many popular languages, such as c#, Java, adopt Unicode 16 (ucs2) encoding internally. In this encoding, all characters are two characters. Therefore, if the string to be intercepted is a mixture of Chinese, English and numbers, problems will arise, such as the following character string:

String s = "a plus B equals C. If a equals 1 and B equals 2, then C equals 3";

The string above contains not only Chinese characters, but also English characters and numbers. If you want to intercept the first six bytes of characters, it should be "a plus B", but if you use the substring method to intercept the first six characters, it will become "a plus B equals C". The reason for this problem is that the substring method treats double byte Chinese characters as one byte characters (ucs2 characters).

English letters and Chinese characters occupy different bytes in different coding formats. We can use the following examples to see how many bytes an English letter and a Chinese character occupy in some common coding formats.

The operation results are as follows:

1. English letters: a 2 Number of bytes: 1; Code: gb2312.3 Number of bytes: 1; Code: GBK 4 Number of bytes: 1; Code: GB18030 5 Number of bytes: 1; Code: iso-8859-1 6 Number of bytes: 1; Code: utf-8.7 Number of bytes: 4; Code: utf-16 8 Number of bytes: 2; Code: utf-16be 9 Number of bytes: 2; Code: utf-16le 10 Chinese characters: person 11 Number of bytes: 2; Code: GB2312 12 Number of bytes: 2; Code: GBK 13 Number of bytes: 2; Code: GB18030 14 Number of bytes: 1; Code: iso-8859-1 15 Number of bytes: 3; Code: UTF-8 16 Number of bytes: 4; Code: utf-16 17 Number of bytes: 2; Code: utf-16be 18 Number of bytes: 2; Code: utf-16le

Utf-16be and utf-16le are two members of the Unicode coding family. The Unicode standard defines three coding formats: UTF-8, utf-16 and UTF-32. There are seven coding schemes: UTF-8, utf-16, utf-16be, utf-16le, UTF-32, utf-32be and utf-32le. The coding scheme adopted by Java is utf-16be. From the operation results of the above example, we can see that the three coding formats of GB2312, GBK and GB18030 can meet the requirements of the topic. Let's take GBK coding as an example.

We can't directly use the substring (int beginindex, int endindex) method of the string class because it is intercepted by characters. ' I 'and' Z 'are treated as one character, and the length is 1. In fact, as long as we can distinguish Chinese characters from English letters, this problem will be solved. The difference between them is that Chinese characters are two bytes and English letters are one byte.

The above solution (recommended) for Java to intercept strings with Chinese characters by bytes is all the content shared by Xiaobian. I hope it can give you a reference and support programming tips.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>