The past and present life of character coding — turn

Original address: http://gitbook.cn/books/599d075614d1bc13375caeaf/index.html

Many programmers don't quite understand character coding. Although they probably know the concepts of ASCII, utf8, GBK, Unicode and other terms, they will still encounter various strange coding problems in the process of writing code. The most common is garbled code in Java, while the most encountered in Python development is coding errors, such as Unicode decodeerror and Unicode encodeerror, Almost every Python developer will encounter this problem, which is at a loss. Starting from the origin of character coding, this article describes how to deal with coding problems in programming. By understanding this article, you can calmly locate, analyze and solve problems related to character coding. When it comes to "character coding", we first need to understand what coding is and why.

All students who have studied computer know that computer can only process binary data composed of 0 and 1. Any information seen and heard by human beings with the help of computer, including text, video, audio and pictures, is stored and calculated in binary form in computer. Computers are good at handling binary data, but human beings are short of binary data. In order to reduce the communication cost between people and computers, people decide to number each character. For example, the number given to the letter A is 65, and the corresponding binary number is "01000001". When a is stored in the computer, 01000001 is used to replace it, When you want to load and display in a file or web page for reading, convert the binary number into character A. this process will involve the conversion between data in different formats. Encoding is the process of converting data from one form to another. It is a set of algorithms. For example, the conversion of character a to 01000001 here is an encoding process, and decoding is the inverse process of encoding. Today we are talking about character coding, which is the algorithm of conversion between characters and binary data. Encryption and decryption in cryptography is sometimes called encoding and decoding, but it is not within the scope of this paper.

Character set is a collection of all abstract characters supported by a system. It is the general name of various characters and symbols. Common character sets include ASCII character set, GBK character set, Unicode character set, etc. Different character sets specify a limited number of characters. For example, the ASCII character set contains only Latin letters, GBK contains Chinese characters, and the Unicode character set contains all text symbols in the world. Some people can't help asking, what is the relationship between character set and character coding? Don't worry, go down first

The world's first computer was designed and developed in 1945 by two professors at the University of Pennsylvania, motchley and Eckert. The Americans drafted the first character set and coding standard of the computer, It is called ASCII (American Standard Code for information interchange). It specifies a total of 128 characters and the corresponding binary conversion relationship. 128 characters include 26 letters that can be displayed (case), 10 numbers, punctuation marks and special control characters, which are common characters in English and Western European languages. It is more than enough to represent 128 characters in one byte. Because a byte can represent 256 characters, currently only 7 bits of the byte are used, and the highest bit is used as parity. As shown in the figure below, the character lowercase a corresponds to 0110001, Capital a corresponds to 01000001. ASCII character sets are letters, numbers 128 characters composed of punctuation marks and control characters (carriage return, line feed, backspace), etc. ASCII character coding is a set of rules for converting these 128 characters into binary data recognized by the computer (algorithm). Now you can answer the previous question. Generally speaking, the character set defines a set of character coding rules with the same name. For example, ASCII defines the character set and character coding. Of course, this is not absolute. For example, Unicode only defines the character set, and the corresponding character codes are UTF-8 and utf-16. ASCII is formulated by the American National Standards Institute, 19 It was finalized in 67. It was originally an American national standard, and later it was determined as an international standard by the international organization for Standardization (ISO), known as ISO 646 standard, which is applicable to all Latin letters.

With the continuous popularization of computers, computers began to be used in Western Europe and other countries. Then, many characters in Western European languages are not in the ASCII character set, which has caused great restrictions on their use of computers. Just like in China, you can only communicate with others in English. Therefore, they think of a way to expand the ASCII character set, thinking that ASCII only uses the first seven bits of bytes. If the eighth bit is also used, the number of characters that can be represented is 256. This is the later eascii (extended ASCII, extended American standard information exchange code) the symbols extended by eascii code compared with ASCII code include table symbols, calculation symbols, Greek letters and special Latin symbols. Then, eascii does not form a unified standard. Businesses in various countries have their own small computers and want to write in the high bit of bytes, such as MS-DOS and IBM PC In order to end this confusion, A series of 8-bit character set standards jointly formulated by the international organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) are called ISO 8859, the full name of which is ISO / IEC 8859. It is extended on the basis of ASCII, so only 0xa0 ~ 0xff (decimal 160 ~ 255) are used in the 128 codes extended by the full ASCII and ISO 8859 character coding scheme, In fact, ISO 8859 is the general name of a group of character sets. It contains 15 character sets, namely ISO 8859-1 ~ ISO 8859-15. ISO 8859-1, also known as Latin-1, is a Western European language, and others represent central Europe, southern Europe, northern Europe and other character sets.

Later, computers began to spread to China, but one of the problems faced was the characters. Chinese characters are broad and profound. There are 3500 commonly used Chinese characters, which has greatly exceeded the character range that can be represented by the ASCII character set. Even eascii is a drop in the bucket. In 1981, the National Standardization Administration Commission set a character set called GB2312, Each Chinese character symbol is composed of two bytes. Theoretically, it can represent 65536 characters, but it only contains 7445 characters, 6763 Chinese characters and 682 other characters. At the same time, it is compatible with ASCII. The characters defined in ASCII only occupy one byte of space. Chinese characters Chinese characters collected by Chinese GB2312 have covered 99.75% of the frequency of Chinese mainland usage. However, there are still many characters that can not be used to handle rare characters and traditional characters. So, a character encoding called GBK was created on the basis of GB2312. GBK not only recorded 27484 characters, but also collected Tibetan, Mongolian, and other characters. Uyghur and other major minority languages. GBK is expanded by using the unused coding space in GB2312, so it can be fully compatible with GB2312 and ASCII. GB 18030 is the latest character set, compatible with GB 2312-1980 and GBK. It contains 70244 Chinese characters. It adopts multi byte coding. Each character can be composed of 1, 2 and 4 bytes. In a sense, it can accommodate 1.61 million characters, including traditional Chinese characters and Japanese and Korean Chinese characters. Single byte is compatible with ASCII and double byte is compatible with GBK standard.

Although we have our own character set and character coding GBK, many countries in the world have their own languages and characters. For example, JIS is used in Japan and Big5 is used in Taiwan. Communication between different countries is very difficult because there is no unified coding standard. The same character may be stored in two bytes in country a and three bytes in country B, It was easy to have coding problems, so in 1991, ISO / IEC 10646 has been developed by the international organization for standardization and the unified code alliance (USC) and Unicode projects. The purpose of these two projects is to unify all characters in the world with one character set. However, soon both sides realized that there is no need for two incompatible character sets in the world. So they met very friendly on the coding issue and decided to merge the work contents of each other. Although the project still exists independently and publish their own characters Standards, but only if they remain compatible. However, because the name Unicode is easy to remember, it is more widely used and has become a de facto unified coding standard. The above is a brief review of the history of character sets. Now let's focus on Unicode. Unicode is a character set that contains all the characters in the world, Each character has a unique code point. Note! It is not a character encoding, but a character set. How to encode Unicode characters can be UTF-8, utf-16, or even GBK. For example: > > A = > > A

UTF (Unicode transformation format) encoding and USC (Universal coded character set) coding is the two coding methods in Unicode and ISO / IEC 10646 coding system respectively. UCS is divided into ucs-2 and ucs-4, and the common types of UTF are UTF-8, utf-16 and UTF-32. Because Unicode and USC character sets are compatible with each other, these coding formats also have corresponding equivalence. Ucs-2 uses two fixed length bytes To represent a character, utf-16 also uses two bytes, However, utf-16 is longer (many wrong statements on the Internet say that utf-16 is of fixed length). When two bytes cannot be represented, it will be represented by four bytes. Therefore, utf-16 can be regarded as an extension of ucs-2. UTF-32 is completely equivalent to usc-4 and is represented by four bytes. Obviously, this method wastes more space. The advantage of UTF-8 is that it takes a single byte as a single byte Bit represents a character with 1 ~ 4 bytes. It can be judged from the first byte that there are several bytes in UTF-8 encoding of a character. If the first byte starts with 0, it must be single byte coding, if it starts with 110, it must be double byte coding, if it starts with 1110, it must be three byte coding, and so on. All subsequent bytes of multi byte UTF-8 codes begin with 10 except for single word sections. The UTF-8 encoding of 1 ~ 4 bytes looks like this: single byte encoded Unicode range: \ u0000 ~ \ u007f (0 ~ 127) double byte encoded Unicode range: \ u0080 ~ \ u07ff (128 ~ 2047) three byte encoded Unicode range: \ u0800 ~ \ uFFFF (2048 ~ 65535) four byte encoded Unicode range: \ u10000 ~ \ u1fffff (65536 ~ 2097151) UTF-8 is ASCII compatible, which saves space in data transmission and storage. Second, UTF-8 does not need to consider the problem of large and small ends. These two points are the disadvantages of utf-16. However, for Chinese characters, UTF-8 requires 3 bytes, while utf-16 requires only 2 bytes. The advantage of utf-16 is that it is very fast in calculating the string length and performing index operations Come on. Java uses utf-16 coding scheme internally. Python 3 uses UTF-8. UTF-8 coding is more widely used in the field of Internet. Let's take a look at the figure below. The following figure shows the character encoding types that can be selected when saving files on the windows platform. You can specify what encoding format the system uses to store files. ANSI is a superset of ISO 8859-1. The reason why there is Unicode encoding under windows is actually an error representation method of windows, which may have been used today for historical reasons, In fact, it really represents utf-16 coding, more specifically utf-16 small end. What are big end and small end?

The large and small end refers to the storage order of data in the memory. The large end mode refers to that the high byte of data is in the front and stored in the low address of the memory, which is consistent with the human reading and writing method. The low byte of data is in the back and stored in the high address of the memory. On the contrary, the small end mode refers to that the high byte of data is in the back and stored in the high address of the memory, The low byte of data is stored in the low address of memory. For example, the writing method of large end byte order and small end byte order of hexadecimal value 0x1234567: Why are there large end and small end? For 16 bit or 32-bit processors, since the register width is greater than one byte, there must be a problem of how to arrange multiple bytes, because different operating systems read multiple bytes in different order,, X86 and general OS (e.g. windows, FreeBSD, Linux) uses the small end mode. For example, Mac OS uses the large end mode. This leads to the existence of the large end storage mode and the small end storage mode, which is neither better nor worse. Why does UTF-8 not need to consider the problem of large and small ends? The encoding unit of UTF-8 is one byte, so the byte order problem does not need to be considered. Utf-16 uses two bytes For encoding Unicode characters, the encoding unit is two bytes. Therefore, the byte order needs to be considered, because which of the two bytes needs to be determined.

Now that we have finished the theory, let's talk about the coding problem in Python, which is also the most concerned and frequently encountered problem by every Python developer. Python was born a few years earlier than Unicode. Therefore, the first version of Python continued to Python 2 7. The default encoding of Python is ASCII > > > > sys getdefaultencoding()

We have introduced characters before. It is also necessary to repeat the difference between characters and bytes. A character is a symbol, such as a Chinese character, a letter, a number and a punctuation point. A byte is a character, which is a binary sequence converted after coding. A byte is 8 bits. For example, the character "P" stored on the hard disk is a string of binary data 0111000, occupying one byte. Bytes are convenient for storage and network transmission, while characters are used for display for easy reading.

In Python 2, the representation of characters and bytes is very subtle, and the boundary between them is very vague. In Python 2, strings are divided into Unicode and str. In essence, STR type is a binary byte sequence and Unicode type string is a character. The following example code shows that the "Zen" of STR type is printed in hexadecimal \ xec \ xf8, and the corresponding binary byte sequence is' 11101100 11111000 '.

>> s = >> s
>> type(s)

    
    
    
     
     
     

    
    
    

The Unicode symbol corresponding to u "Zen" of Unicode type is u '\ u7985'

>> u = >> u
>> type(u)

    
    
    
     
     
     

    
    
    

If we want to save Unicode characters to a file or transfer them to the network, we need to convert them into STR type in binary form through encoding. Therefore, Python string provides encode method to convert from Unicode to STR, and vice versa.

encode

>> u = >> u
>> u.encode(

decode

>> s = >> s.decode(>>

Many beginners can't remember whether the conversion between STR and Unicode uses encode or decode. If you remember that STR is essentially a string of binary data, Unicode is a character (symbol), and encoding is the process of converting characters (symbols) into binary data. Therefore, the encoding method should be used for the conversion from Unicode to STR, and vice versa.

After knowing the conversion relationship between STR and Unicode, let's see when Unicode encodeerror and Unicode decodeerror will occur.

Unicode encodeerror occurs when a Unicode string is converted into a str byte sequence. For an example, save a string of Unicode strings to the file error log Unicode encodeerror: 'ASCII' codec can't encode characters in position 6-7: ordinal not in range (128) why does Unicode encodeerror occur? When calling the write method, the program will encode and convert characters into binary byte sequences, and there will be an encoding conversion process from Unicode to str. the program will first judge what type of string is. If it is STR, it will be written directly to the file without encoding, because the str type string itself is a string of binary byte sequences. If the string is of Unicode type, it will first call the encode method to convert the Unicode string into a binary STR type before saving it to a file. In Python 2, the encode method uses ASCII by default Equivalent to: > > however, we know that the ASCII character set contains only 128 Latin letters, excluding Chinese characters, so the error of 'ASCII' codec can't encode characters occurs. To use encode correctly, you must specify a character set containing Chinese characters, such as UTF-8 and GBK. > >< Span class = "hljs meta" > > > < span class = "hljs string" > U "Python Zen" encode("gbk")

Unicode decodeerror occurs when a byte sequence of STR type is decoded into a string of Unicode type > > A = > > A

Python 3 has completely reconstructed the string and character encoding, which is completely incompatible with Python 2. At the same time, many projects that want to migrate to Python 3 have brought great trouble. Python 3 sets the system default encoding to UTF-8, and the character and binary byte sequences are more clearly distinguished, which are represented by STR and bytes respectively. All text characters are represented by STR type, which can represent all characters in Unicode character set, while binary byte data is represented by a new data type, bytes. Although Python 2 also has byte type, it is just an alias of str.

>> a = >> a

In Python 3, add 'B' before character quotation marks to clearly indicate that this is an object of bytes type. In fact, it is data composed of a group of binary byte sequences. Bytes type can be characters in ASCII range and character data in other hexadecimal forms, but can not be represented by non ASCII characters such as Chinese.

>> c = >> c
>> type(c)
<




     
     
     

<span class="hljs-title">d = <span class="hljs-title">b'\<span class="hljs-title">xe7\<span class="hljs-title">xa6\<span class="hljs-title">x85'
<span class="hljs-title">d
<span class="hljs-title">b'\<span class="hljs-title">xe7\<span class="hljs-title">xa6\<span class="hljs-title">x85'
<span class="hljs-title">type<span class="hljs-params">(d)
<<span class="hljs-title">class '<span class="hljs-title">bytes'>

<span class="hljs-title">e = <span class="hljs-title">b'禅'
<span class="hljs-title">File "<<span class="hljs-title">stdin>",<span class="hljs-title">line 1
<span class="hljs-title">SyntaxError: bytes can only contain ASCII literal characters.

The byte type provides the same operations as STR, and supports slicing, indexing, basic numerical operations, etc. However, STR and bytes data cannot perform the + operation, although it is feasible in Python 2.

>> b>> b>> b>> b


<span class="hljs-meta">>>> b<span class="hljs-string">"a" + <span class="hljs-string">"b"
Traceback (most recent call last):
File <span class="hljs-string">" ",<span class="hljs-keyword">in <<span class="hljs-class"><span class="hljs-keyword">module>
<span class="hljs-symbol">TypeError: can<span class="hljs-string">'t concat bytes to str

Comparison of bytes and characters between python2 and python3

Character coding is essentially a conversion process from character to byte. The evolution process of character set is: ASCII, eascii, ios8895-x, GB2312 Unicode Unicode is a character set. The corresponding encoding format is UTF-8. Utf-16 byte sequence can be stored in large and small ends. Characters and bytes in python2 are represented by Unicode and STR respectively, and characters and bytes in python3 are represented by STR and bytes respectively

@L_ 403_ 2@

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>