Spring batch differences between multithreading and partitioning
I can't understand the difference between multithreading and partitioning in spring batch processing The implementation is of course different: in a partition, you need to prepare the partition and then process it I want to know the difference and which is more effective when the bottleneck is the item processor
Solution
TL; DR;
Spring batch scalability overview scaling a spring batch job has five options:
>Multithreaded steps > parallel steps > partitioning > remote blocking > asyncitemprocessor / asyncitemwriter
Everyone has his own pros and cons Let's look at each:
Multithreaded step a multithreaded step takes only one step and executes each block in that step on a separate thread This means that the same instance of each batch component (reader, writer, etc.) is shared between threads In most cases, this can improve performance by adding some parallelism to the steps, but at the cost of restartability You sacrifice restartability because, in most cases, the ability to restart depends on the state maintained in the reader / writer / When multiple threads update the status, it becomes invalid and cannot be restarted Therefore, you usually need to turn off the saved state of a single component and set the restartable flag to false on the job
Parallel steps parallel steps are achieved by splitting It allows you to execute multiple independent steps in parallel through threads This does not sacrifice restartability, but does not help improve the performance of single step or business logic
Partition is to divide the data into smaller blocks (called partitions) in advance through the main step, and then let the slave station work independently on the partition In spring batch, the master server and each slave server are independent steps, so you can gain the benefits of parallelism in one step without sacrificing restartability Partitioning also provides the ability to extend beyond a single JVM because the secondary server does not have to be local (you can use various communication mechanisms to communicate with the remote slave server)
An important note about partitioning is that the only communication between the master and slave is the description of the data, not the data itself For example, the master device can tell slave1 to process records 1-100, slave2 to process records 101-200, etc The master device does not send the actual data, but only the information required by the slave device to obtain the data it should process Therefore, the data must be local to the slave process, and the master server can be located anywhere
Remote partitioning remote partitioning allows you to extend processes and optional write logic across JVMs In this use case, the master reads the data, then sends it to the slave processing it through the line, and then writes to the slave locally or returns to the master to write to the master locally
The important difference between partition and remote block is that remote block does not describe through line, but sends actual data through line Therefore, instead of a single packet saying process records 1-100, remote blocking will send actual records 1-100 This can have a significant impact on the step's I / O profile, but it can be useful if the processor has enough bottlenecks
The last option for asyncitemprocessor / asyncitemwriter to extend the spring batch process is the asyncitemprocessor / asycintemwriter combination In this case, asyncitemprocessor wraps your itemprocessor implementation and executes the call to your implementation in a separate thread Then, the asyncitemprocessor returns a future that is passed to the asyncitemwriter, where it is unpacked and passed to the delegate itemwriter implementation
Due to the nature of data flow through this option, some listener scenarios are not supported (because we do not know the result of itemprocessor call before itemwriter). However, in general, it can provide a useful tool to parallelize the logic in a single JVM of itemprocessor without sacrificing restartability