Multithreading – how do I do the same calculations faster on a 4-core CPU: 4 threads or 50 threads?
Let's assume that we have a fixed amount of computing work, no blocking, sleep, I / O waiting This work can be well parallelized - it contains 100m small and independent computing tasks
4-core CPUs are faster – running 4 threads or... Let's say 50? Why should the second variant be wobbly?
I assume that when you run four heavy threads on a 4-core CPU without another process / thread occupying the CPU, the scheduler does not allow threads to move between cores; In this case, it has no reason to do so The core 0 (main CPU) will be responsible for executing the interrupt handler of the hardware timer 250 times per second (basic Linux configuration) and other hardware interrupt handlers, but the other core may not feel any concern
What is the cost of context switching? Is the time to store and recover CPU registers appropriate for different contexts? What about the CPU internal cache, pipeline and various code predictions? Can we say that each context switching will damage the cache, pipeline and some code decoding functions in the CPU? Therefore, executing more threads on a single kernel can do less work together than serial execution?
The problem of caching and other hardware optimization in multithreaded environment is now an interesting problem for me
Solution
As @ Baile mentioned in his comments, this is highly application, system and environment specific
Therefore, I'm not going to take a tough approach to mention one thread for each core (or 2 threads / cores in case of hyper threading)
As an experienced shared memory programmer, I see from my experience that the optimal number of threads (for 4-core machines) can be between 1 and 64
Now I will list the situations that may lead to this scope:
Optimal threads < number of cores in some very fine-grained parallel tasks (such as small FFT), thread overhead is the main performance factor In some cases, full parallelization does not help In some cases, you can get acceleration through 2 threads, but expand backward at 4 threads Another problem is resource contention Even if you have highly parallelizable tasks that can be easily divided into four cores / threads, you may encounter bottlenecks in memory bandwidth and cache effect Usually, you will find that 2 threads are as fast as 4 threads (it seems that very large FFT often occurs) optimal thread = number of cores, which is the best case There is no need to explain here - one thread per core Most embarrassing parallel applications are not suitable for memory or I / O binding< Best threads > number of cores this is where it becomes interesting... Very interesting Have you ever heard of load imbalance? How to over decompose and steal work? Many parallelizable applications are irregular - which means that tasks are not divided into subtasks of the same size So if you might end up splitting large tasks into four unequal sizes, allocate them to four threads and run on four cores... Results? Parallel performance is poor because one thread has 10 times more workload than other threads The common solution here is to over decompose the task into many subtasks You can create threads for each thread (now you can get the thread > > core) Or you can use a task scheduler with a fixed number of threads Not all tasks are suitable for both, so usually, the method of over decomposing tasks into 8 or 16 threads of 4-core machines can get the best results
Although generating more threads can bring better load balancing, the overhead will increase So there is usually a best point somewhere I see up to 64 threads on four cores But as mentioned above, it has high application characteristics You need to experiment
Editor: expand answers and answer questions more directly
This is very environment dependent – and a little difficult to measure directly This might be a good read
Short answer: Yes, when you close the context, you may empty the pipeline and mess up all predictive variables Same as cache A new thread may replace the cache with new data
Although there is a problem In some applications where threads share the same data, one thread may "heat up" the cache for another incoming thread or another thread on a core that shares the same cache (although rare, I've seen this before on one of my NUMA machines – superlinear acceleration: 17.6 times the 16 cores!?!?!)
Depends on, depends on... In addition to hyper threading, there must be overhead But I read a paper in which someone used a second thread to prefetch the main thread... Yes, it's crazy