Java implementation of Huffman coding algorithm

Introduction to Huffman coding

Huffman coding deals with the encoding pairing of characters and binary corresponding to characters, which is divided into encoding and decoding, in order to compress the length of binary data corresponding to characters. We know that character storage and transmission are binary (the computer only knows 0 / 1), so there is the mapping relationship between character and binary. Characters belong to charset. Characters need to be encoded into binary for storage and transmission. When displaying, characters need to be decoded back. The relationship between character set and encoding method is one to many (Unicode can be encoded by UTF-8, utf-16, etc.). Understand the character set, encoding and decoding, and solve the problem of random code all over the sky. Take the English letter lowercase a as an example. In ASCII coding, the decimal system is 97 and the binary system is 0110001. Each character of ASCII is encoded with 8 bits (1byte). If 1000 characters need to be transmitted, 8000 bits need to be transmitted. The problem is that the frequency of the letter E in English is 12.702%, while that of Z is 0.074%. The former is more than 100 times that of the latter, but it does use binary with the same number of digits. We can do better. The method is variable length coding. The guiding principle is that those with high frequency are encoded with shorter bits and those with low frequency are encoded with longer bits. Huffman coding algorithm is to deal with such problems.

Java implementation of Huffman coding

The main data structures used in Huffman coding algorithm are full binary tree and priority queue. The latter uses Java util. PriorityQueue, the former is implemented by itself (both internal classes), and the code is as follows:

statistical data

Since you want to arrange the coding table according to frequency, you must first obtain the frequency statistics. I implemented a method to deal with such a problem. If there is already statistical information, it can be changed to map < character, integer >. If the information you get is a percentage, multiply by 100 or 1000, or 10000. Can always be converted to an integer. For example, 12.702% multiplied by 1000 is 12702. Huffman coding only cares about size. The statistical method is as follows:

Build tree

Constructing tree is the core step of Huffman coding algorithm. The idea is to hang all characters to the leaf node of a complete binary tree, and the frequency of the left node of any non page child node is no greater than that of the right node. The algorithm is to convert the statistical information into nodes and store it in a priority queue. Each time, two nodes with the lowest frequency pop up from the queue, and build a new parent node (non leaf node). The sum of the character contents of the two nodes whose character contents have just popped out, and the frequency is also their sum. The first popped out is the left child node and the latter is the right child node, And put the newly built parent node into the queue. Repeat the above actions n-1 times. N is the number of different characters (the number in each queue minus 1). After the above steps, there is one node left in the queue, and the root node of the tree will pop up. The code is as follows:

code

The corresponding code of a character is to search upward from the leaf node where the character is located. If the character node is the left node of the parent node, add 0 before the coded character; otherwise, if it is the right node, add 1 until the root node. As long as the mapping relationship between characters and binary codes is obtained, the coding is very simple. The code is as follows:

decode

Because Huffman coding algorithm can ensure that any binary code will not be the prefix of another code, decoding is very simple. Take out each bit of the binary in turn, search down from the tree root, 1 to the right, 0 to the left, reach the leaf node (HIT), return to the root node and continue to repeat the above actions. The code is as follows:

Testing and comparison

The following tests the correctness of Huffman encoding (encoding first and then decoding, including Chinese), and the comparison between Huffman encoding and common character encoded binary strings. The code is as follows:

The above is the whole content of this article. I hope it will be helpful to your study, and I hope you can support programming tips.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>