Java – running standalone Hadoop applications on multiple CPU cores
My team built a Java application using Hadoop library to convert a pile of input files into useful output
When I run this application on the command line (or on eclipse or NetBeans), I haven't been able to convince it to use more maps and / or reduce threads at once Given that the tool takes up a lot of CPU, this "single thread" is my current bottleneck
When running it in the NetBeans profiler, I did see that the application started several threads for various purposes, but only one map / reduce was running at the same time
The input data consists of several input files, so Hadoop should be able to run at least one thread for each input file at the same time for the map phase
What should I do with at least 2 or even 4 active threads running (it should be possible for most of the processing time of this application)?
I expected it to be a very stupid thing that I ignored
I just found this: https://issues.apache.org/jira/browse/MAPREDUCE-1367 This implements the functionality I looked for in Hadoop 0.21, which introduces MapReduce local. map. tasks. The maximum flag to control it
Now I have also found the solution described in here in this question
Solution
I'm not sure if I'm right, but when you run a task in local mode, you can't have multiple mappers / reducers
In any case, to set the maximum number of running mappers and reducers, use the configuration option mapred tasktracker. map. tasks. Maximum and mapred tasktracker. reduce. tasks. Maximum by default, these options are set to 2, so I may be right
Finally, if you want to prepare for a multi node cluster, run it directly in a fully distributed manner, but let all servers (namenode, datanode, tasktracker, jobtracker,...) run on one machine