Detailed explanation of data locality on Hadoop

Detailed explanation of data locality on Hadoop

Data locality in Hadoop refers to the "proximity" of the data with respect to the mapper tasks working on the data

1. why data locality is imporant?

When the dataset is stored in HDFS, it is divided into blocks and stored on the datanode in the Hadoop cluster. When a MapReduce job is executed on a dataset, Each mapper will process these blocks (input segmentation). If mapper cannot obtain data from the node it executes, the data needs to be copied from the datanode with these data to the node executing mapper task through the network (the data needs to be copied over the network from the datanode which has the data to the datanode which is executing the mapper task). Suppose a MapReduce job has more than 1000 mappers, and each mapper tries to copy data from another datanode node on the cluster at the same time, which will lead to serious network congestion because all mappers are blocked All try to copy data at the same time (this is not an ideal method). Therefore, moving the computing task to a node closer to the data is a more effective and cheap method than moving the data to a node closer to the computing task (it is always effective and cheap to move the computation closer to the data than to move the data closer to the computation)。

2. How is data proximity defined?

When jobtracker (mrv1) or applicationmaster (mrv2) receives a request to run a job, it checks which nodes in the cluster have sufficient resources to execute mappers and reducers of the job. At the same time, it needs to consider and determine the nodes executed by each mapper according to the location of mapper's running data (serIoUs consideration is made to decide on which nodes the individual Mappers will be executed based on where the data for the Mapper is located)。

3. Data Local

When the node where the data is located is the same as the node executed by mapper, we call it data local. In this case, the proximity of the data is closer to the calculation. Jobtracker (mrv1) or applicationmaster (mrv2) prefer the node with the data required by mapper to execute mapper.

4. Rack Local

Although data local is an ideal choice, it is limited by the resources on the cluster, Mapper is not always executed on the same node as the data due to resource constraints on a busy cluster (although data local is the ideal choice, it is not always possible to execute the mapper on the same node as the data due to resource constraints on a busy cluster) In such instances it is preferred to run the mapper on a different node but on the same rack as the node which has the data.

5. Different Rack

In busy clusters, sometimes rack local is impossible. In this case, nodes on different racks are selected to execute mapper, and data is copied from nodes with data to nodes executing mapper on different racks. This is the most undesirable situation.

If you have any questions, please leave a message or go to the community of this site for exchange and discussion. Thank you for reading. I hope it can help you. Thank you for your support to this site!

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>