Java – what is the difference between using mappartitions and combining broadcast variables and maps in Apache spark

In spark, we use broadcast variables to make each machine read-only copies of variables We usually create a broadcast variable outside the closure (such as the lookup table required by the closure) to improve performance

We also have a spark conversion operator called mappartitions, which tries to achieve the same function (using shared variables to improve performance) For example, in map partitions, we can share database connections for each partition

So what's the difference between the two? Can we use it interchangeably to share variables?

Solution

Broadcast is used to send objects to each work node The object will be shared among all partitions on the node (and the value / object is the same for each node in the cluster) The goal of broadcasting is to save network cost when using the same data in many different tasks / partitions on the work node

In contrast, mappartitions are available methods on RDD and work like maps only on partitions Yes, you can define new objects, such as JDBC connections, and then be unique to each partition However, you cannot share it between different partitions, let alone between different nodes

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>