Java – use Hadoop to process a large number of small files

I am using the Hadoop sample program wordcount to handle a large number of small files / Web pages (CCA. 2-3 KB) Since this is far from the optimal file size for Hadoop files, the program is very slow I think this is because the cost of setting up and tearing up the work is much higher than the work itself These small files will also cause the namespace of file names to run out

I read that I should use HDFS Archive (HAR) in this case, but I don't know how to modify this program wordcount to read from this file Can the program continue to work without modification or some modification?

Even if I pack a lot of files in the archive, the problem still exists. Will this improve performance I read that even if I package multiple files, these files in an archive will not be processed by one mapper, but many of them will not improve performance in my case (I guess)

If this question is too simple, please understand that I am a novice to Hadoop and have little experience in it

Solution

Using HDFS will not change the situation that you cause Hadoop to process a large number of small files In this case, the best option may be to capture the file into a single (or several large) file

If you are running on a distributed system, using HDFS can improve performance If you are only pseudo distributed (a machine), HDFS will not improve performance The limit is the machine

When you operate on a large number of small files, you need a large number of mappers and reducers Setting / closing can be compared with the processing time of the file itself, resulting in great overhead Processing files will reduce the number of Hadoop runs of the job mapper and improve performance

When using HDFS to store files, you can see the advantage of distributed mode, with multiple machines These files will be stored in blocks across machines (64 MB by default), and each machine can process data blocks residing on the machine This reduces the use of network bandwidth, so it will not become a processing bottleneck

Archive files. If Hadoop will cancel archiving, they will only cause Hadoop to still have a large number of small files

I hope this will help you understand

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>