Java – a space saving data structure used to store and find a large number of (evenly distributed) integers
I need to find and find a million evenly distributed integers in memory
Background: the above integer problem is a simplification of the following problem: I have a million strings (my "dictionary"), and I want to know whether the dictionary contains a given string The dictionary is too big for memory, so I am willing to sacrifice a little precision to reduce memory consumption I'll do this by switching to a dictionary containing the hashcode value (integer) of each string instead of the actual chars I assume that the probability of collision for each string is only 1m / 2 ^ 32
Solution
Although Jon skeet's answer has brought considerable savings for small investment, I think you can do better Because your numbers are fairly evenly distributed, you can use interpolation search to get faster lookup (roughly o (log n) instead of O (log n)) For a million projects, you can plan about 4 comparisons instead of about 20
If you want to do more to cut memory in half again, you can build it into a two-level lookup table, basically a simple version of trie
You can divide your 32 - bit integer into two 16 - bit integers You use the first 16 bits as the first level index of the lookup table At this level, you have 65536 pointers, and each possible 16 bit value corresponds to that part of the integer That will take you to the second floor of the table For this section, we perform a binary or interpolated search between the selected pointer and the next pointer – that is, all values in the second level that have the same value in the first 16 bits
However, when we look at the second table, we already know the 16 bit value - so we only need to store the other 16 bits of the value, not all 32 bits of the value
This means that instead of the second stage occupying 4 megabytes, we have reduced it to 2 megabytes In addition, we need the first level table, but it is only 65536 × 4 = 256K bytes
This will almost certainly speed up the binary search of the entire data set In the worst case (using the second level search for the second level), we can make up to 17 comparisons (1 log2 65536) The average would be better than that - because we have only a million projects, and there can only be an average of 1 in each secondary "partition"_ 000_ 000 / 65536 = ~ 15 items, about 1 log2 (16) = 5 comparison Using interpolation search in the second level may further reduce, but when you make only five comparisons, you don't have much room for real improvement Since the secondary average is only about 15 items, the type of search you use won't vary much - even linear search will be very fast
Of course, if you want, you can go further and use a level 4 table (one per byte of an integer) However, it may be a doubtful question whether it will save you more money to help you At least close it right away, My direct guess is that you have to do quite little extra work to get quite a small savings (storing only the last byte of a million integers obviously takes up 1 Megabyte, and a three-level table will lead to a considerable amount of consumption, so you can double the savings to save half a megabyte. If you are in a situation where you can save a little more, it will be very different, go ahead – but in addition, I doubt whether this return can be proved The extra investment is reasonable