Multithreading – a memset in parallel with the threads bound to each physical core

In an OpenMP parallel code, would there be any benefit for memset to be run in parallel? Testing the code, I'm observing something unexpected

My system is a single socket Xeon e5-1620. It is an Ivy bridge processor with 4 physical cores and 8 hyper threads I use Ubuntu 14.04 LTS, Linux kernel 3.13, GCC 4.9 0 and eglibc 2.19 I use GCC - fopenmp - O3 mem C compilation

When I run code in a link, it defaults to eight threads and gives

Touch:   11830.448 MB/s
Rewrite: 18133.428 MB/s

However, when I bind threads and set the number of threads to such a number of physical cores

export OMP_NUM_THREADS=4 
export OMP_PROC_BIND=true

Oh, I see

Touch:   22167.854 MB/s
Rewrite: 18291.134 MB/s

Double the touch rate! Running several times after binding is always faster than writing I don't understand that After binding the thread and setting it to the number of physical cores, why is it faster to write? Why double the touch rate?

This is the code I used. I didn't modify hristo Iliev's answer

#include <stdio.h>
#include <string.h>
#include <omp.h>

void zero(char *buf,size_t size)
{
    size_t my_start,my_size;

    if (omp_in_parallel())
    {
        int id = omp_get_thread_num();
        int num = omp_get_num_threads();

        my_start = (id*size)/num;
        my_size = ((id+1)*size)/num - my_start;
    }
    else
    {
        my_start = 0;
        my_size = size;
    }

    memset(buf + my_start,my_size);
}

int main (void)
{
    char *buf;
    size_t size = 1L << 31; // 2 GiB
    double tmr;

    buf = malloc(size);

    // Touch
    tmr = -omp_get_wtime();
    #pragma omp parallel
    {
        zero(buf,size);
    }
    tmr += omp_get_wtime();
    printf("Touch:   %.3f MB/s\n",size/(1.e+6*tmr));

    // Rewrite
    tmr = -omp_get_wtime();
    #pragma omp parallel
    {
        zero(buf,size);
    }
    tmr += omp_get_wtime();
    printf("Rewrite: %.3f MB/s\n",size/(1.e+6*tmr));

    free(buf);

    return 0;
}

Edit: there is no tread binding, but four threads are used, and the result runs eight times

Touch:   14723.115 MB/s,Rewrite: 16382.292 MB/s
Touch:   14433.322 MB/s,Rewrite: 16475.091 MB/s 
Touch:   14354.741 MB/s,Rewrite: 16451.255 MB/s  
Touch:   21681.973 MB/s,Rewrite: 18212.101 MB/s 
Touch:   21004.233 MB/s,Rewrite: 17819.072 MB/s 
Touch:   20889.179 MB/s,Rewrite: 18111.317 MB/s 
Touch:   14528.656 MB/s,Rewrite: 16495.861 MB/s
Touch:   20958.696 MB/s,Rewrite: 18153.072 MB/s

Edit:

I tested this code on two other systems, and I couldn't reproduce their problems

I5-4250u (Haswell) – 2 Physical cores, 4 hyper threads

4 threads unbound
    Touch:   5959.721 MB/s,Rewrite: 9524.160 MB/s
2 threads bound to each physical core
    Touch:   7263.175 MB/s,Rewrite: 9246.911 MB/s

Four sockets E7 - 4850 – 10 physical cores, 20 hyper threads per slot

80 threads unbound
    Touch:   10177.932 MB/s,Rewrite: 25883.520 MB/s
40 threads bound
    Touch:   10254.678 MB/s,Rewrite: 30665.935 MB/s

This shows that binding threads to the physical kernel does improve touch and rewriting, but touch is slower than rewriting on both systems

I also tested three different variants of memset: my_ memset,my_ memset_ Stream and a_ memset. Function my_ Memset and my_ memset_ Stream is defined as follows Function a_ Memset asmlib. From agner fog

my_ Memset result:

Touch:   22463.186 MB/s
Rewrite: 18797.297 MB/s

I think this shows that the problem is not in eglibc's memset function

A_ Memset result:

Touch:   18235.732 MB/s
Rewrite: 44848.717 MB/s

my_ memset_ stream:

Touch:   18678.841 MB/s
Rewrite: 44627.270 MB/s

Looking at the source code of asmlib, I see a large block of memory for writing non - time storage This is my_ memset_ The reason why stream obtains the same bandwidth as agner FOG's asmlib Maximum throughput of this system is 51.2 GB / s_ Memset and my_ memset_ Stream achieves about 85% of its maximum throughput

void my_memset(int *s,int c,size_t n) {
    int i;
    for(i=0; i<n/4; i++) {
        s[i] = c;
    }
}

void my_memset_stream(int *s,size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);

    for(i=0; i<n/4; i+=4) {
        _mm_stream_si128((__m128i*)&s[i],v);
    }
}

Solution

From your figures, you can see that your four binding threads run on two physical cores instead of the expected four physical cores Can you confirm it? It can explain the doubling of touch time I'm not sure how to force threads to the physical core when using hyper threading on the system {I tried to add it as a problem, but I don't have enough "reputation"}

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>