Java bug? GB2312 files cannot be read directly using scanner
I have a gb3212 encoded file (Chinese) Download the file from here http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO , the same as WGet under windows, and stored in modernchinesecharacterfrequencylist HTML file name
The following code demonstrates how Java can read it in one way and end up with another
That is, if you create a scanner using scanner = new scanner (SRC, "GB2312"), the code does not work If the scanner is created using scanner = new scanner (New FileInputStream (SRC), "GB2312"), it will work normally
The separator pattern line displays only the remaining option
public static void main(String[] args) throws FileNotFoundException { File src = new File("ModernChineseCharacterFrequencyList.html"); //Pattern frequencyDelimitingPattern = Pattern.compile("<br>|<pre>|</pre>"); Scanner scanner; String line; //scanner = new Scanner(src,"GB2312"); // does NOT work scanner = new Scanner(new FileInputStream(src),"GB2312"); // does work //scanner.useDelimiter(frequencyDelimitingPattern); while(scanner.hasNext()) { line = scanner.next(); System.out.println(line); } }
Is this a glitch or design behavior?
UPDATE
When the code runs, it only reads all tokens If it doesn't work, it cancels about the intermediate read, with no exception or error messages
No singularity can be found at the rest There is no "magic" number like 2 ^ 32
Update 2
The initial behavior was discovered on windows using sun's Java se 1.6
Openjdk 1.6 is also found on Ubuntu 0_ 23 same behavior
Solution
I can't test my answer now, but the JDK 6 document suggests different specification names for encoding according to the API you use: IO or NiO
JDK 6 Supportted Encondings
Perhaps you should use "euc_cn" instead of "GB2312", which is the recommended specification name for Java I / O