Introduction to bitmap
1. BitMap
The basic idea of bit map is to mark the value corresponding to an element with a bit bit, and the key is the element. Since bit is used to store data, it can greatly save storage space. (PS: focus on saving storage space)
Suppose there is such a requirement: find out whether a certain number m exists among 2 billion random integers, and assume a 32-bit operating system and 4G memory
In Java, int occupies 4 bytes, 1 byte = 8 bits (1 byte = 8 bits)
If each number is stored with ints, it is 2 billion ints, so the occupied space is about (2000000000 * 4 / 1024 / 1024 / 1024) ≈ 7.45g
If the storage by bit is different, the number of 2 billion is 2 billion bits, and the occupied space is about (2000000000 / 8 / 1024 / 1024 / 1024) ≈ 0.233g
There is no need to say more
So, the question is, how to represent a number?
As I said just now, each bit represents a number, 0 represents nonexistence, and 1 represents existence, which is in line with binary
In this way, we can easily express {1,2,4,6}:
The smallest unit of computer memory allocation is bytes, that is, 8 bits. What if you want to represent {12,13,15}?
Of course, it means on another 8-bit:
In this case, it seems to become a two-dimensional array
If one int occupies 32 bits, we only need to apply for an int array with the length of int TMP [1 + n / 32], where n represents the maximum value of these numbers to be stored, so:
TMP [0]: can represent 0 ~ 31
TMP [1]: can represent 32 ~ 63
TMP [2]: can represent 64 ~ 95
。。。
In this way, given any integer m, M / 32 gets the subscript, and M% 32 knows where it is in this subscript
add to
Here's a question. How do we put a number in it? For example, if you want to put the number 5 in, what should you do?
First, 5 / 32 = 0, 5% 32 = 5, which means it should be in the fifth position of TMP [0]. Then we move 1 to the left by 5 bits, and then press bit or
It's binary
This is equivalent to 86 | 32 = 118
86 | (1<<5) = 118
b[0] = b[0] | (1<<5)
That is, to insert a number, shift 1 left to the bit representing the number, and then perform bitwise OR operation with the original number
Simplify it to 86 + (5 / 8) | (1 < < (5% 8))
Therefore, the formula can be summarized as: P + (I / 8) | (1 < < (I% 8)), where p represents the current value and I represents the number to be inserted
eliminate
The above is an addition. What should I do if I want to clear it?
In the above example, suppose we want to 6 remove it, what should we do?
From the figure, just set the position of the number to 0
1 shifts 6 bits to the left to reach the bit represented by the number 6, then reverses it by bit, and finally sums it by bit with the original number, so that the position is 0
b[0] = b[0] & (~(1<<6))
b[0] = b[0] & (~(1<<(i%8)))
lookup
As we said earlier, each bit represents a number, 1 means yes (or exists), and 0 means no (or does not exist). If you add and clear the number by setting it to 1 or 0, judging whether a number exists is to judge whether the bit of the number is 0 or 1
Assuming that we want to know whether 3 is present or not, we only need to judge B [0] & (1 < < 3). If this value is 0, it does not exist. If it is 1, it means it exists
2. What is the use of bitmap
Quick sorting, searching and de duplication of a large amount of data
Quick sort
Assuming that we want to sort the five elements (4,7,5,3) in 0-7 (assuming that these elements are not repeated here), we can use the bit map method to achieve the purpose of sorting.
To represent 8 numbers, we only need 8 bits (1bytes). First, we open up 1byte space, set all bits of these spaces to 0, and then set the corresponding position to 1.
Finally, traverse the bit area and output the number of the bit that is one (2, 3, 4, 5, 7), so as to achieve the purpose of sorting, with time complexity O (n).
advantage:
Disadvantages:
Fast weight removal
Find out the number of non repeated integers in the 2 billion integers, and the memory is not enough to accommodate the 2 billion integers.
First of all, according to "the memory space is not enough to accommodate these 0.5 billion integers", we can quickly associate bit map. The key question below is how to design our bit map to represent the state of these 2 billion numbers. In fact, this problem is very simple. There are only three states of a number: nonexistence, only one, and repetition. Therefore, we only need 2 bits to store the state of a number. Suppose we set a number as 00 if it does not exist, 01 if it exists once, and 11 if it exists twice or more. Then we need about 2G of storage space.
The next task is to put (store) these 2 billion numbers. If the corresponding status bit is 00, it will be changed to 01, indicating that it exists once; if the corresponding status bit is 01, it will be changed to 11, indicating that there has been one, that is, it has occurred many times; if it is 11, the corresponding status bit will remain unchanged, indicating that it still exists many times.
Finally, count the number of status bits 01 to get the number of non repeated numbers, and the time complexity is O (n).
Quick find
This is what we said earlier. An element in the int array is 4 bytes, accounting for 32 bits. Then divide by 32 to know the subscript of the element. Find the remainder of 32 (% 32) to know which bit it is. If the bit is 1, it means it exists.
Summary & Review
Bitmap is mainly used to quickly retrieve keyword status. It is usually required that the keyword is a continuous sequence (or the keyword is most of a continuous sequence). In the most basic case, 1bit is used to represent the status of a keyword (two states can be marked), but 2bit (four states) and 3bit (eight states) can also be used as needed.
Main applications of bitmap: it represents the status of continuous (or nearly continuous, that is, most of them will appear) keyword sequences (the smaller the number of states / keywords, the better).
On a 32-bit machine, an integer, such as int a = 1, occupies 32 bits in memory to facilitate computer operation. However, for some application scenarios, this is a huge waste, because we can store 0-31 decimal numbers with the corresponding 32bit bits, which is the basic idea of bit map. Bit map algorithm uses this idea to deal with the sorting, query and de duplication of a large amount of data.
Supplement 1
On the premise that the number does not overflow, for positive and negative numbers, moving one bit to the left is equivalent to multiplying 2 to the power of 1, moving n bits to the left is equivalent to multiplying 2 to the power of N, moving one bit to the right is equivalent to dividing 2, and moving n bits to the right is equivalent to dividing 2 to the power of n.
< < shift left is equivalent to multiplying by 2 to the nth power, for example: 1 < < 6 is equivalent to 1 × 64 = 64, 3 < < 4 is equivalent to 3 × 16=48
>>Shift right, which is equivalent to dividing by 2 to the nth power. For example, 64 > > 3 is equivalent to 64 ÷ 8 = 8
^XOR is equivalent to finding the remainder. For example, 48 ^ 32 is equivalent to 48% 32 = 16
Supplement 2
Exchange the values of two variables without using third-party variables
3. BitSet
BitSet implements a bit vector that can grow as needed. Each bit has a Boolean value. The bits of a BitSet can be indexed by non negative integers (PS: meaning that each bit can represent a non negative integer). A bit can be found, set and cleared. The contents of another BitSet can be modified through logical operators. By default, all bits have a default value of false.
@H_ 829_ 419@
As you can see, it's almost what we thought before
Use a long array to store. The initial length is 64. When setting the value, first shift 6 bits to the right (equivalent to dividing by 64) to calculate the position in the array, and then change the status bit
It doesn't matter if you don't understand anything else. It's enough to understand these two sentences:
4. Bloom Filters
Bloom filter is a data structure that can be used to judge whether an element is in the collection. It has the characteristics of fast operation and small memory consumption.
The cost of efficient insertion and query is that bloom filter is a probability based data structure: it can only tell us that an element is definitely not or may be in the set.
The basic data structure of Bloom filter is a bit vector (which can be understood as an array).
It is mainly used in scenarios that do not require accurate filtering under large-scale data, such as checking spam addresses, de duplication of crawler URL addresses, solving cache penetration problems, etc
If you want to judge whether an element is in a set, you generally think of saving all the elements in the set and determining it by comparison. Linked list, tree, hash table (hash table) and other data structures are all based on this idea, but with the increase of elements in the set, the storage space required is larger and larger; at the same time, the retrieval speed is slower and slower, and the retrieval time complexity is O (n), O (log n), O (1).
The principle of Bloom filter is that when an element is added to the set, This element is mapped into k points in a bit array through k hash functions, and they are set to 1. During retrieval, you can know whether the element is in the set by looking at whether these points are all 1; if any of these points has 0, the checked element must not be in; if they are all 1, the checked element is likely to be in (the reason for saying "possible") Is the existence of error).
Bloomfilter process
com. google. common. hash. BloomFilter
5. Documentation
http://llimllib.github.io/bloomfilter-tutorial/zh_CN/
https://www.cnblogs.com/geaozhang/p/11373241.html
https://www.cnblogs.com/huangxincheng/archive/2012/12/06/2804756.html
https://www.cnblogs.com/DarrenChan/p/9549435.html