Java – what is the difference between using mappartitions and combining broadcast variables and maps in Apache spark
In spark, we use broadcast variables to make each machine read-only copies of variables We usually create a broadcast variable outside the closure (such as the lookup table required by the closure) to improve performance
We also have a spark conversion operator called mappartitions, which tries to achieve the same function (using shared variables to improve performance) For example, in map partitions, we can share database connections for each partition
So what's the difference between the two? Can we use it interchangeably to share variables?
Solution
Broadcast is used to send objects to each work node The object will be shared among all partitions on the node (and the value / object is the same for each node in the cluster) The goal of broadcasting is to save network cost when using the same data in many different tasks / partitions on the work node
In contrast, mappartitions are available methods on RDD and work like maps only on partitions Yes, you can define new objects, such as JDBC connections, and then be unique to each partition However, you cannot share it between different partitions, let alone between different nodes