Multithreading – Apache spark standalone mode: number of cores
I am trying to understand the basic knowledge inside spark and spark documents used to submit applications in local mode. Please refer to spark submit – master settings:
Since all data is stored on a single local computer, it does not benefit from distributed operations on RDD
When spark uses multiple logical cores, how does it benefit and what is happening internally?
Solution
The system will allocate additional threads to process the data Although limited to one machine, it can still take advantage of the high parallelism available in modern servers
If you have a reasonably large data set, for example, a dozen partitions, you can measure the time required to use local [1] vs local [n] (where n is the number of cores in your machine) You can also see the difference in machine utilization If you have only one core specified for use, it will only use 100% of one core (plus some additional garbage collection) If you have 4 cores and specify local [4], it will use 400% of the cores (4 cores) And the execution time can be significantly reduced (although it is usually not reduced by 4 times)