Rest – possible multithreading algorithms list all keys in large S3 buckets?

2020-08-02 • Java

In S3 buckets containing a large number of keys, listing keys through the rest API is a very slow process

>You can only list 1000 keys at a time. > The only way to determine the 5001st key (as far as I know) is to list the first 1000 keys, list the next key according to the next tag in the response, and then recurse until 5001. > The latency of S3 rest API request is very high, and the request for 1000 keys usually takes a few seconds

Since making 100 concurrent key lists, rest requests should not slow down any single request, otherwise the process will be optimized by parallelization But if my algorithm is "stupid" and just divides the possible key space into predefined tags (for example, ",'a ',' B ',' C ','d', 'e'...), it won't really speed up the list keys in the list, where each key starts with" images / "

So I want to know if someone with S3 experience really knows a better way to traverse the key space of the bucket, or if someone tries to use adaptive (i.e. "non stupid") algorithm to improve the concurrent key list

Solution

Perhaps some form of "binary search" algorithm will help? Eg starts with "and'm ', then midway, and so on I think you'll eventually get each key up to about two times - when you already have 'nextmarker', you stop asking for more

How to choose how many to start? I think it is possible to subdivide in each cycle: "start" and then when these results return, if "the results indicate more keys, start 'nextmarker' on the search and make a new search in half between 'nextmarker' and 'Z' Repeat Use something like hash to store all keys only once

Since all requests are entered in different threads, you need to lock to add all keys Then you have a problem that keeping the lock open won't slow down, so it depends on the language you're using, etc

If your process is running on an EC2 instance in the same area as the S3 file, you can perform this operation faster Suppose the document is an American "standard" Then you're lucky that you can use things like ruby and ironworker to enter and download all the keys When finished, it can publish to your server or create a file on S3, which is a list of all keys or similar For different regions or languages, you may need to start your own EC2 instance

I found that on the EC2 instance, the S3 key list is much faster because each request has a lot of bandwidth (you don't have to pay for EC2) S3 has no gzip response, which is very fluffy XML, so the bandwidth between you and S3 is very important

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java