Java – packed numeric array
I have a large array (~ 400.000.000 entries) with integers of {0,1,..., 8}
At present, I use byte array and save 2 numbers in each entry
I want to know if there is a good way to compress this array I did a quick research and found algorithms like huffmann or LZW But these algorithms are used to compress data, send compressed data to someone and decompress it
I just want to have a table with less memory space, so I can load it into RAM A 200MB table is easy to fit, but I'm considering a bigger table
Importantly, I can still determine the value of some positions
Do you have a tip?
More information: I just did some experiments and it turns out that an average of 2.14 consecutive numbers have the same value There are 1 zero, 154, 10373, 385990 three points, 8146188, 85008968, 265638366 six, 70791576 seven and 80 eight So more than half of the number is 6S
I just need a quick getValue (IDX) function. SetValue (IDX, value) is not important
Solution
It depends on what your data looks like Are there duplicate entries, or are they just changing slowly, or what?
If so, you can try to compress the data block and decompress it if necessary The larger the block, the more memory can be saved and the worse the speed With all due respect, it's no good You can also compress and decompress the data into memory
Otherwise, if there is no regularity, each entry needs at least log (9) / log (2) = 3.17 bits, and nothing can improve it
You can get very close to this value by packing five numbers into a short number For example, 9 * * 5 = 59049 < 65536 = 2 * * 16, almost perfect You need 3.2 per person, no big win The packaging of five numbers is given by this formula
a + 9 * (b + 9 * (c + 9 * (d + 9 * e)))
And decompressing through precomputed tables is trivial
Update after issue update
In fact, on average, about 2.14 consecutive numbers are the same, which may lead to some compression, but in this case it doesn't say anything There are almost five and six, so it seems to imply that two equal consecutive numbers are encountered
In view of this fact, you can forget my above optimization Since you can handle a single zero individually, there are actually only eight values Therefore, each value needs only 3 bits, and zero needs an index
You can even create a HashMap for all values below 4 or above 7, storing 1 154 10373 385990 80 entries, using only 2 bits per value But this is far from ideal
Assuming there is no regularity, you need 1.44 bits for each value, because this is entropy You can browse all 5 tuples, calculate their histograms, and use 1 byte to encode 255 most frequent tuples All the remaining tuples will be mapped to the 256 th value, which tells you that you must find rare tuple values in HashMap
Some evaluation
I wonder if it works It takes 85996340 bytes to pack five numbers into one byte There are nearly 5 million tuples that don't fit, so my idea is to use hash mapping for them Assuming that it makes sense to start over instead of linking, it may be 50% full, so we need 10 million entries Assuming tintshorthashmap (mapping indexes to tuples), each entry occupies 6 bytes, resulting in 60 MB It's terrible.
It contains only 4 numbers in a byte, consumes 107495425 bytes, and leaves 159531 unsuitable tuples It looks better, but I'm sure the denser packing can be greatly improved
Results generated by this small program:
*** Packing 5 numbers in a byte. *** Normal packed size: 85996340. Number of tuples in need of special handling: 4813535. *** Packing 4 numbers in a byte. *** Normal packed size: 107495425. Number of tuples in need of special handling: 159531.