Junior sister learning javaio: file encoding and character set Unicode

brief introduction

On a whim, the younger martial sister used a new skill that she had never used before, but there was an unsolvable problem. How many steps does it take to put an elephant in the fridge? How to solve the problem of garbled code? Come and have a look with senior brother F.

More highlights:

Reading files using properties

On this day, the younger martial sister was in a happy mood, whistling and singing. The standard 45 degree overlooking made people uncomfortable.

Little younger martial sister, what's so happy that elder martial brother is happy to say it?

Younger martial sister: elder martial brother F, I recently found a new method to read files, which is very easy to use, just like map:

public void usePropertiesFile() throws IOException {
        Properties configProp = new Properties();
        InputStream in = this.getClass().getClassLoader().getResourceAsStream("www.flydean.com.properties");
        configProp.load(in);
        log.info(configProp.getProperty("name"));
        configProp.setProperty("name","www.flydean.com");
        log.info(configProp.getProperty("name"));
    }

Senior brother F, you see, I used properties to read the file. The content in the file is in the form of key = value. It is very appropriate to use it as a configuration file. Inspired by the properties configuration file in the spring project, I found that Java also has a class properties that specifically reads the properties file.

Younger martial sister can rush to answer now. She is really better than the blue.

Garbled code appears

Younger martial sister, you've done very well. Just by analogy, Java will soon be yours. It's estimated that Scala, go, JS, etc. in the back are all right. In a few years, you can be promoted to an architect, and the company's technology will prosper under your leadership.

As a senior brother, my greatest responsibility is to give my junior sister encouragement and confidence, describe her a bright future, what to be the CEO, and win Gao Fu Shuai. It is said that there is a professional word to describe this process, which is called "painted cake".

Younger martial sister is a little guilty: but elder martial brother F, I still have a little problem to solve, a little Chinese random code

I nodded deeply: mosaic is a stumbling block to human progress Oh, it's not mosaic, it's file scrambling. To find out this problem, we have to start with the character set and file coding.

Character set and file encoding

A long time ago, before my senior brother was born, a high-tech product called computer appeared in the western world.

The first generation of computers could only do some simple arithmetic operations and use manual punching programs to run. However, with the passage of time, the volume of the computer became smaller and smaller, and the computing power became stronger and stronger. Punching no longer existed, and the manually written computer language was programmed.

Everything is changing, only one thing has not changed. This event is that computers and programming languages only spread in the West. In Western daily communication, 26 letters and limited punctuation are enough.

The initial computer storage can be very expensive. We use one byte, that is, 8bit, to store all the characters that can be used. In addition to the first 1bit, there are 128 choices in total, including 26 lowercase + 26 uppercase letters and other punctuation marks.

This is the original ASCII code, also known as American Standard Code for information interchange.

Later, when computers spread to the world, people found that it seemed that the previous ASCII code was not enough. For example, there were more than 4000 Chinese characters commonly used in Chinese. What should we do?

It doesn't matter. Localize ASCII coding, called ANSI coding. If one byte is not enough, use two bytes. People come out of the road, and the coding also serves people. Therefore, various coding standards such as GB2312, BIG5, JIS and so on are produced. Although these codes are compatible with ASCII codes, they are not compatible with each other.

This has seriously affected the process of internationalization. How can we realize the dream of one earth and one home?

Therefore, international organizations took the initiative to formulate the Unicode character set, which defines a unique code for all characters of all languages. The Unicode character set is so many codes from U + 0000 to U + 10ffff.

Younger martial sister: Senior brother F, what is the relationship between Unicode and UTF-8, utf-16 and UTF-32 I usually hear?

I smiled and asked younger martial sister: younger martial sister, how many steps are there to put the elephant in the refrigerator?

Younger martial sister: elder martial brother F, the story of sudden brain turn is no longer suitable for me. There are three steps to put an elephant into the refrigerator: first, open the refrigerator, second, put the elephant in, and third, close the refrigerator. It's done.

Younger martial sister, as a cultured Chinese, you are wrong to really undertake the great task of national rejuvenation and scientific and technological progress. You can't just think of slogans. You should have practical and operable plans. Otherwise, when can we build Qinxin, tangxin and Mingxin?

Elder martial brother is right, but what does this have to do with Unicode?

Unicode character sets are finally stored in files or memory. How? Use a fixed 1 byte, 2 bytes, or a byte with side length? According to different coding methods, it can be divided into UTF-8, utf-16, UTF-32 and other coding methods.

UTF-8 is a variable length coding scheme, which uses 1-6 bytes to store. Utf-16 uses two or four bytes to store. The underlying encoding of string after jdk9 has changed into two types: Latin1 and utf16.

UTF-32 uses 4 bytes to store. Among the three coding methods, only UTF-8 is ASCII compatible, which is why UTF-8 coding method is more common in the world (after all, computer technology is made by Westerners).

Resolve garbled code in properties

Younger martial sister, it's easy to solve the problem of garbled code in your properties. Basically, the reader has a charsets parameter. Through this parameter, you can pass in the encoding method to be read. Let's just pass in UTF-8:

public void usePropertiesWithUTF8() throws IOException{
        Properties configProp = new Properties();
        InputStream in = this.getClass().getClassLoader().getResourceAsStream("www.flydean.com.properties");
        InputStreamReader inputStreamReader= new InputStreamReader(in,StandardCharsets.UTF_8);
        configProp.load(inputStreamReader);
        log.info(configProp.getProperty("name"));
        configProp.setProperty("name","www.flydean.com");
        log.info(configProp.getProperty("name"));
    }

In the above code, we use inputstreamreader to encapsulate InputStream, and finally solve the problem of Chinese garbled code.

Really Ultimate solution

Younger martial sister has another problem: Senior brother F, this is because we know that the file encoding method is UTF-8. What if we don't know? Is it UTF-8, utf-16 or UTF-32?

Younger martial sister's question is becoming more and more tricky. Fortunately, I am also prepared for this question.

Next, let's introduce our ultimate solution. We finally convert various encoded characters into Unicode character sets and save them in the properties file. Is there no coding problem when we read them again?

The conversion requires the JDK's own tools:

 native2ascii -encoding utf-8 file/src/main/resources/www.flydean.com.properties.utf8 file/src/main/resources/www.flydean.com.properties.cn

The above command converts the encoding of UTF-8 to Unicode.

Before conversion:

site=www.flydean.com
name=程序那些事

After conversion:

site=www.flydean.com
name=\u7a0b\u5e8f\u90a3\u4e9b\u4e8b

Run the following test code:

public void usePropertiesFileWithTransfer() throws IOException {
        Properties configProp = new Properties();
        InputStream in = this.getClass().getClassLoader().getResourceAsStream("www.flydean.com.properties.cn");
        configProp.load(in);
        log.info(configProp.getProperty("name"));
        configProp.setProperty("name","www.flydean.com");
        log.info(configProp.getProperty("name"));
    }

Output the correct results.

If you want to support internationalization, you should do the same.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>