UTF-8 and utf-16 in Java
I really hope the following byte data should show different, but in fact, they are the same, according to wiki http://en.wikipedia.org/wiki/UTF-8#Examples , the encoding in bytes looks different, but why do java print them the same?
String a = "€"; byte[] utf16 = a.getBytes(); //Java default UTF-16 byte[] utf8 = null; try { utf8 = a.getBytes("UTF-8"); } catch (UnsupportedEncodingException e) { throw new RuntimeException(e); } for (int i = 0 ; i < utf16.length ; i ++){ System.out.println("utf16 = " + utf16[i]); } for (int i = 0 ; i < utf8.length ; i ++){ System.out.println("utf8 = " + utf8[i]); }
Solution
Although Java internally saves characters as utf-16, when you use string When getbytes() is converted to bytes, each character is converted using the default platform encoding, which may be similar to windows-1252 My result is:
utf16 = -30 utf16 = -126 utf16 = -84 utf8 = -30 utf8 = -126 utf8 = -84
This means that the default code on my system is "UTF - 8"
Also note that string The document for getbytes() has the following comment: the behavior of this method when this string is not specified and cannot be encoded in the default character set
However, in general, if you always specify an encoding like using a. GetBytes ("UTF-8"), you will avoid confusion
Another thing that can cause confusion is to include Unicode characters directly in the source file: string a = "€" The euro symbol must be encoded as one or more bytes stored in the file When Java compiles your program, it will see these bytes and decode them back to the euro symbol you hope. You must ensure that software that saves Euro symbols to files (Notepad, eclipse, etc.) encodes them in the same way as Java expects UTF - 8 is becoming more and more popular, but it is not popular, and many editors will not write files in UTF - 8