Java – a space saving data structure used to store and find a large number of (evenly distributed) integers

2020-08-07 • Java

I need to find and find a million evenly distributed integers in memory

Background: the above integer problem is a simplification of the following problem: I have a million strings (my "dictionary"), and I want to know whether the dictionary contains a given string The dictionary is too big for memory, so I am willing to sacrifice a little precision to reduce memory consumption I'll do this by switching to a dictionary containing the hashcode value (integer) of each string instead of the actual chars I assume that the probability of collision for each string is only 1m / 2 ^ 32

Solution

Although Jon skeet's answer has brought considerable savings for small investment, I think you can do better Because your numbers are fairly evenly distributed, you can use interpolation search to get faster lookup (roughly o (log n) instead of O (log n)) For a million projects, you can plan about 4 comparisons instead of about 20

If you want to do more to cut memory in half again, you can build it into a two-level lookup table, basically a simple version of trie

You can divide your 32 - bit integer into two 16 - bit integers You use the first 16 bits as the first level index of the lookup table At this level, you have 65536 pointers, and each possible 16 bit value corresponds to that part of the integer That will take you to the second floor of the table For this section, we perform a binary or interpolated search between the selected pointer and the next pointer – that is, all values in the second level that have the same value in the first 16 bits

However, when we look at the second table, we already know the 16 bit value - so we only need to store the other 16 bits of the value, not all 32 bits of the value

This means that instead of the second stage occupying 4 megabytes, we have reduced it to 2 megabytes In addition, we need the first level table, but it is only 65536 × 4 = 256K bytes

This will almost certainly speed up the binary search of the entire data set In the worst case (using the second level search for the second level), we can make up to 17 comparisons (1 log2 65536) The average would be better than that - because we have only a million projects, and there can only be an average of 1 in each secondary "partition"_ 000_ 000 / 65536 = ~ 15 items, about 1 log2 (16) = 5 comparison Using interpolation search in the second level may further reduce, but when you make only five comparisons, you don't have much room for real improvement Since the secondary average is only about 15 items, the type of search you use won't vary much - even linear search will be very fast

Of course, if you want, you can go further and use a level 4 table (one per byte of an integer) However, it may be a doubtful question whether it will save you more money to help you At least close it right away, My direct guess is that you have to do quite little extra work to get quite a small savings (storing only the last byte of a million integers obviously takes up 1 Megabyte, and a three-level table will lead to a considerable amount of consumption, so you can double the savings to save half a megabyte. If you are in a situation where you can save a little more, it will be very different, go ahead – but in addition, I doubt whether this return can be proved The extra investment is reasonable

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Java – what is the difference between these two urlencodings of URL

< <上一篇

Java – read from InputStream and write to OutputStream

下一篇>>

搜索内容

Java – a space saving data structure used to store and find a large number of (evenly distributed) integers

Solution

热门文章