Multithreading – a memset in parallel with the threads bound to each physical core
In an OpenMP parallel code, would there be any benefit for memset to be run in parallel? Testing the code, I'm observing something unexpected
My system is a single socket Xeon e5-1620. It is an Ivy bridge processor with 4 physical cores and 8 hyper threads I use Ubuntu 14.04 LTS, Linux kernel 3.13, GCC 4.9 0 and eglibc 2.19 I use GCC - fopenmp - O3 mem C compilation
When I run code in a link, it defaults to eight threads and gives
Touch: 11830.448 MB/s Rewrite: 18133.428 MB/s
However, when I bind threads and set the number of threads to such a number of physical cores
export OMP_NUM_THREADS=4 export OMP_PROC_BIND=true
Oh, I see
Touch: 22167.854 MB/s Rewrite: 18291.134 MB/s
Double the touch rate! Running several times after binding is always faster than writing I don't understand that After binding the thread and setting it to the number of physical cores, why is it faster to write? Why double the touch rate?
This is the code I used. I didn't modify hristo Iliev's answer
#include <stdio.h> #include <string.h> #include <omp.h> void zero(char *buf,size_t size) { size_t my_start,my_size; if (omp_in_parallel()) { int id = omp_get_thread_num(); int num = omp_get_num_threads(); my_start = (id*size)/num; my_size = ((id+1)*size)/num - my_start; } else { my_start = 0; my_size = size; } memset(buf + my_start,my_size); } int main (void) { char *buf; size_t size = 1L << 31; // 2 GiB double tmr; buf = malloc(size); // Touch tmr = -omp_get_wtime(); #pragma omp parallel { zero(buf,size); } tmr += omp_get_wtime(); printf("Touch: %.3f MB/s\n",size/(1.e+6*tmr)); // Rewrite tmr = -omp_get_wtime(); #pragma omp parallel { zero(buf,size); } tmr += omp_get_wtime(); printf("Rewrite: %.3f MB/s\n",size/(1.e+6*tmr)); free(buf); return 0; }
Edit: there is no tread binding, but four threads are used, and the result runs eight times
Touch: 14723.115 MB/s,Rewrite: 16382.292 MB/s Touch: 14433.322 MB/s,Rewrite: 16475.091 MB/s Touch: 14354.741 MB/s,Rewrite: 16451.255 MB/s Touch: 21681.973 MB/s,Rewrite: 18212.101 MB/s Touch: 21004.233 MB/s,Rewrite: 17819.072 MB/s Touch: 20889.179 MB/s,Rewrite: 18111.317 MB/s Touch: 14528.656 MB/s,Rewrite: 16495.861 MB/s Touch: 20958.696 MB/s,Rewrite: 18153.072 MB/s
Edit:
I tested this code on two other systems, and I couldn't reproduce their problems
I5-4250u (Haswell) – 2 Physical cores, 4 hyper threads
4 threads unbound Touch: 5959.721 MB/s,Rewrite: 9524.160 MB/s 2 threads bound to each physical core Touch: 7263.175 MB/s,Rewrite: 9246.911 MB/s
Four sockets E7 - 4850 – 10 physical cores, 20 hyper threads per slot
80 threads unbound Touch: 10177.932 MB/s,Rewrite: 25883.520 MB/s 40 threads bound Touch: 10254.678 MB/s,Rewrite: 30665.935 MB/s
This shows that binding threads to the physical kernel does improve touch and rewriting, but touch is slower than rewriting on both systems
I also tested three different variants of memset: my_ memset,my_ memset_ Stream and a_ memset. Function my_ Memset and my_ memset_ Stream is defined as follows Function a_ Memset asmlib. From agner fog
my_ Memset result:
Touch: 22463.186 MB/s Rewrite: 18797.297 MB/s
I think this shows that the problem is not in eglibc's memset function
A_ Memset result:
Touch: 18235.732 MB/s Rewrite: 44848.717 MB/s
my_ memset_ stream:
Touch: 18678.841 MB/s Rewrite: 44627.270 MB/s
Looking at the source code of asmlib, I see a large block of memory for writing non - time storage This is my_ memset_ The reason why stream obtains the same bandwidth as agner FOG's asmlib Maximum throughput of this system is 51.2 GB / s_ Memset and my_ memset_ Stream achieves about 85% of its maximum throughput
void my_memset(int *s,int c,size_t n) { int i; for(i=0; i<n/4; i++) { s[i] = c; } } void my_memset_stream(int *s,size_t n) { int i; __m128i v = _mm_set1_epi32(c); for(i=0; i<n/4; i+=4) { _mm_stream_si128((__m128i*)&s[i],v); } }
Solution
From your figures, you can see that your four binding threads run on two physical cores instead of the expected four physical cores Can you confirm it? It can explain the doubling of touch time I'm not sure how to force threads to the physical core when using hyper threading on the system {I tried to add it as a problem, but I don't have enough "reputation"}