Handling Unicode proxy values in Java strings

Consider the following codes:

byte aBytes[] = { (byte)0xff,0x01,(byte)0xd9,(byte)0x65,(byte)0x03,(byte)0x04,(byte)0x05,(byte)0x06,(byte)0x07,(byte)0x17,(byte)0x33,(byte)0x74,(byte)0x6f,1,2,3,4,5,0 };
String sCompressedBytes = new String(aBytes,"UTF-16");
for (int i=0; i<sCompressedBytes.length; i++) {
    System.out.println(Integer.toHexString(sCompressedBytes.codePointAt(i)));
}

Get the following incorrect output:

ff01,fffd,506,717,3374,6f00,102,304,500.

However, if 0xd9 in the input data is changed to 0x9d, the following correct output can be obtained:

ff01,9d65,500.

I realized the function because bytes 0xd9 are high algebra Unicode tags

Question: is there any way to provide, identify and extract proxy bytes (0xd800 to 0xdfff) in Java Unicode strings? thank you

Solution

Just because no one mentioned it, I will point out that the character course includes the method of using proxy pairs For example Is high surrogate (char), codepointat (charsequence, int), and Tochars (int) I realized that this was in addition to the explanation

new String(aBytes,"UTF-16");

This is a decoding operation that will convert the input data I'm sure it's illegal because the selected decoding operation requires the input to start with 0xFE 0xff or 0xff 0xFE (byte order mark) In addition, since utf-16 is variable width encoding, not every possible byte value can be decoded correctly

If you want to convert any byte symmetrically into a string and return it, you'd better use 8-bit single byte encoding, because each byte value is a valid character:

Charset iso8859_15 = Charset.forName("ISO-8859-15");
byte[] data = new byte[256];
for (int i = Byte.MIN_VALUE; i <= Byte.MAX_VALUE; i++) {
  data[i - Byte.MIN_VALUE] = (byte) i;
}
String asString = new String(data,iso8859_15);
byte[] encoded = asString.getBytes(iso8859_15);
System.out.println(Arrays.equals(data,encoded));

Note: the number of characters will be equal to the number of bytes (double the data size); The resulting string is not necessarily printable (including it may be bunch of control characters)

I'm with Jon, although putting any sequence of bytes into a Java string is almost always a bad idea

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>