Elimination of big data by bit mapping
The problem is: m (such as 1 billion) int integers, only n of which have been repeated, read them into memory and delete the repeated integers.
Problem analysis: we must first think of opening up m int integer data arrays in the computer memory, reading m int arrays by one, then comparing the values one by one, and finally removing the duplicate data. Of course, this is feasible in dealing with small-scale data.
We consider the case of big data: for example, in the Java language, 1 billion int type data are rearranged.
An int type in Java takes up 4 bytes in memory. Then a total of 1 billion int type data needs to open up a continuous memory space of 10 ^ 9th power * 4 bytes ≈ 4GB. Taking a 32-bit operating system computer as an example, the maximum supported memory is 4G, and the available memory is less than 4G. Therefore, the above methods do not work when dealing with big data.
Thinking Transformation: since we can't open up int type arrays for all int type data, we can take smaller data types to read cached int type data. Considering that the data processed inside the computer are bits of 01 sequence, can we use 1bit to represent an int type data.
Export of bit mapping: use smaller data types to refer to larger data types. For the above problem, we can use 1 bit
To correspond to an int integer. If the corresponding int type data exists, assign its corresponding bit to 1. Otherwise, Assign value to 0 (boolean type). The int range in Java is - 2 ^ 31 to 2 ^ 31. Then the length of all possible numeric components is 2 ^ 32. The corresponding bit length is also 2 ^ 32. After this processing, you only need to open up a memory space of 2 ^ 32 bit = 2 ^ 29 byte = 512M. Obviously, this processing can meet the requirements. Although the memory consumption is not too small, it is handled in this way for the time being All right.
Solution: first define the int - byte mapping relationship as shown in the figure below. Of course, the mapping relationship can be customized. But the premise is to ensure that your array superscript and subscript cannot cross the boundary.
However, the bit [] array defined above obviously does not exist in the computer, so we need to convert it into a basic data type storage in Java. Obviously, byte [] is the best choice.
Convert it to byte [] array scheme:
In the user-defined mapping table, each bit corresponds to an int value. I will correspond the maximum and minimum values of int to the maximum and minimum indexes of the array. As can be seen from the above figure, the difference between int value and bit index is 2 ^ 31 power. Of course, you can also define other mapping relationships, just be careful not to cross the array. Since the maximum value may be 2 ^ 32, use long to receive.
long bitIndex = num + (1l << 31);
When calculating the index converted into byte [] array, since the bitindex index defined above is a non negative number, there is no need to introduce bit operation to remove the symbol.
int index = (int) (bitIndex / 8);
Calculate the specific position of bitindex in byte [] array index.
int innerIndex = (int) (bitIndex % 8);
Bit operation is introduced to add the bits of byte [] array index by weight
dataBytes[index] = (byte) (dataBytes[index] | (1 << innerIndex));
This solves the problem of the whole big data reading and duplication.
So how to read it out? How to sort data?
Then you only need to carry out one-to-one correspondence according to the byte [] array to the int type data.
Take the ascending output of the following code as an example.
Traverse the array and restore the data by the inverse operation of the previous mapping relationship.
for (int i = 0; i < bytes.length; i++) {
for (int j = 0; j < 8; j++) {
if (!(((bytes[i]) & (1 << j)) == 0)) {
int number = (int) ((((long) i * 8 + j) - (1l << 31)));
}
}
}
}
Since the JVM memory set by default for compiling software is about 128-400m, it is obviously not enough to test this program, so you need to adjust the memory allocated to the JVM. Otherwise, no matter how you run it, exception in thread "main" Java will appear lang.OutOfMemoryError: Java heap space...
Eclipse: select Run - > run configuration - > arguments, and enter - xms256m - xmx1024m (- XMS represents the memory size allocated during JVM startup, and - Xmx represents the maximum memory that can be allocated)
IntelliJ idea: modify the installation directory / IntelliJ idea 7.0/bin exe. Vmoption file
-Xms256M -Xmx1024M
Source code:
Transferred from: http://blog.csdn.net/xqy_zyl/article/details/14456041