Java – streaming or custom jars in Hadoop

I run streaming media jobs on Hadoop (Amazon's EMR) and use mapper and reducer written in Python I want to know if I implement the same mapper and Restorer in Java (or using pig), I will experience a speed increase

In particular, I'm looking for people's experience from streaming to custom jar deployment and / or pig and benchmarking documents containing these options I found this question, but the answer is not specific enough for me I'm not looking for a comparison between Java and python, but a comparison between custom jar deployment and Python based streaming in Hadoop

My job is to read NGram counts from Google Books NGG dataset and calculate overall measurements It seems that the CPU utilization on the compute node is close to 100% (I'd like to hear your opinion on the difference between limited CPU or IO bound jobs)

thank you!

AMAC

Solution

Why consider deploying custom jars?

Ability to use more powerful custom input formats For streaming media jobs, even if pluggable I / O is used, like here above, you can only use it as the key and value of the mapper / reducer of text / string You will need some CPU cycles to convert to the desired type. > Ive also heard that Hadoop can intelligently reuse JVMs in multiple jobs, which is not possible (unconfirmed) in streaming

When to use pigs?

>Pig Latin is very cool. It is a much higher data flow language than Java / Python or Perl Your pig script is often much smaller than the equivalent task of writing any other language

When not to use pigs?

>Even if the pig itself naturally knows how many maps / reductions and when to generate maps or reductions and countless such things, if you determine how many maps / reductions you need, you have some very specific calculations you need to perform in the map / reduce function, and you are very specific about performance, you should consider deploying your own jar This link shows that pigs can lag behind the performance of native Hadoop M / R You can also take a look at writing your own pig UDFs, which separates some compute intensive functions (you can even use JNI to call some native C / C + + code in UDF)

Notes on Io and CPU binding jobs:

>Technically, the whole point of Hadoop and map reduce is to parallelize compute intensive functions, so I think your map and reduce jobs are compute intensive When data is sent through the network, the only time that Hadoop subsystem is busy executing IO is between maps and reduce phase Moreover, if you have a large amount of data, manually configure too few mappings, and reduce disk leakage (although too many tasks will lead to too much time to start / stop the JVM and too many small files) The streaming job will also have the additional overhead of starting the python / Perl VM and copying the data between the JVM and the script VM

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>