Java bug？ GB2312 files cannot be read directly using scanner

2020-08-02 • Java

I have a gb3212 encoded file (Chinese) Download the file from here http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO , the same as WGet under windows, and stored in modernchinesecharacterfrequencylist HTML file name

The following code demonstrates how Java can read it in one way and end up with another

That is, if you create a scanner using scanner = new scanner (SRC, "GB2312"), the code does not work If the scanner is created using scanner = new scanner (New FileInputStream (SRC), "GB2312"), it will work normally

The separator pattern line displays only the remaining option

public static void main(String[] args) throws FileNotFoundException {

    File src = new File("ModernChineseCharacterFrequencyList.html");
    //Pattern frequencyDelimitingPattern = Pattern.compile("<br>|<pre>|</pre>");

    Scanner scanner;
    String line;

    //scanner = new Scanner(src,"GB2312"); // does NOT work
    scanner = new Scanner(new FileInputStream(src),"GB2312"); // does work


    //scanner.useDelimiter(frequencyDelimitingPattern);

    while(scanner.hasNext()) {
        line = scanner.next();
        System.out.println(line);
    }

}

Is this a glitch or design behavior?

UPDATE

When the code runs, it only reads all tokens If it doesn't work, it cancels about the intermediate read, with no exception or error messages

No singularity can be found at the rest There is no "magic" number like 2 ^ 32

Update 2

The initial behavior was discovered on windows using sun's Java se 1.6

Openjdk 1.6 is also found on Ubuntu 0_ 23 same behavior

Solution

I can't test my answer now, but the JDK 6 document suggests different specification names for encoding according to the API you use: IO or NiO

JDK 6 Supportted Encondings

Perhaps you should use "euc_cn" instead of "GB2312", which is the recommended specification name for Java I / O

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java