Java – in Hadoop map reduce, will any class see the entire key list after sorting and before partitioning?
I'm using Hadoop to analyze very uneven data distribution Some keys have thousands of values, but most have only one For example, network traffic associated with an IP address will have many packets associated with a small number of talkative IPS and a few packets with most IPS Another saying is that Gini index is very high
In order to deal with this problem effectively, each reducer should get some high volume keys or many low volume keys in order to obtain a roughly uniform load If I'm writing a partitioning process, I know how to do this: I'll get a sorted list of keys generated by the mapper, including all duplicate keys, and reduce the number n of mappers and place splits
split[i] = keys[floor(i*len(keys)/N)]
Reducer I will get the key K so that split [i] < = k < split [I 1] for 0 < = I < for I = = n-1, n-1 and split [i] < = K. I am willing to write my own partition in Java, but the partition < key, value > is similar to accessing only one key value record at a time, not the whole list I know that Hadoop will sort the records generated by the mapper, so the list must exist somewhere It may be distributed among multiple partition nodes. In this case, I will perform the splitting process on one of the sub lists and pass the results to all other partition nodes in some way (assuming that the selected partition node sees a randomized subset, the result is still approximately load balanced.) Does anyone know where the sorted list of stored keys is located and how to access it?
I don't want to write two map reduce jobs, one for finding splits and the other for actually using them, because it seems wasteful The map maker must do the same work twice This seems to be a common problem: uneven distribution is common
Solution
I've been thinking about it If someone forces me, this is the advanced method I will take
>In addition to the mapper logic used to solve business problems, write logic to collect any statistics needed in the partitioner to allocate key value pairs in a balanced manner Of course, each mapper will only see some data. > Each mapper can find its task ID and use it to build a unique file name in the specified HDFS folder to save the collected statistics Write this file to the cleanup () method that runs at the end of the task. > Use deferred initialization in the partition program to read all files in the specified HDFS directory This will give you all the statistics collected during the mapper phase From there, you will implement any partitioning logic required to partition data correctly
All this assumes that the partitioner will not be called until all the mappers are complete, but this is the best I can do so far