Why is the Java division of integers faster than the joy of hackers

I'm testing the throughput of divs10 function from hacker joy book, in my JDK 1.7 64bit version 21 and i7 Intel @ R_ 621_ 2419 @ coded in Java

I want to know why the default Java operator / is faster than the divs10 function in the hacker's joy book. The result shows that divs10 is three times slower than the "/" operator, which surprised me

Anyone can tell me, is there any fancy internal JVM that can be used?

The source code is as follows

public class div10 {

            public static final int divs10(int n) {
                   int q,r;

                   n = n + (n >> 31 & 9);
                   q = (n >> 1) + (n >> 2);
                   q += q >> 4;
                   q += q >> 8;
                   q += q >> 16;
                   q = q >> 3;
                   r = n - ((q << 3) + (q << 1));
                   return q + ((r + 6) >> 4);
            }

            public static void main(String[] args) {
                /*
                long count = 0;
                for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
                    if ( (i/10) != divs10(i) ) {
                        System.err.println("error dividing :" + i );
                    }
                    if ((i & 0xFFFFFFF ) == 0 ) {
                        System.out.println("Finished:" + Long.toHexString(count) + ":" + count + ":" + i);
                    }
                    count++;
                }

                System.out.println("Success:" + count);
                */

                long start = System.nanoTime();
                long count = 0L;
                int iter = 100_000;
                for (int j = 0; j < 10; j++) 
                    for (int i = -iter; i < iter; i++) {
                        count += (i/10);
                    }
                for (int j = 0; j < 10; j++) 
                    for (int i = -iter; i < iter; i++) {
                        count += divs10(i);
                    }
                System.out.println(count + " warm up done ") ;


                start = System.nanoTime();
                count = 0L;
                for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
                    count += i/10;
                }
                System.out.println(count + ",took:" + (System.nanoTime() - start) / 1000_000L + " ms," + (System.nanoTime() - start) / ((long)Integer.MAX_VALUE - (long)Integer.MIN_VALUE) + " ns per ops" ) ;

                start = System.nanoTime();
                count = 0L;
                for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
                    count += divs10(i);
                }
                System.out.println(count + "," + (System.nanoTime() - start) / ((long)Integer.MAX_VALUE - (long)Integer.MIN_VALUE) + " ns per ops" ) ;

           }
    }

Solution

Update: when looking at the newer Ivy bridge table (page 174), I see that all delays are at 1 This means that my previous explanation is incorrect

The attempt to calculate the instruction executed in the divs10 method is the 27 (no function call overhead) instruction You are in the middle of an operation. You are required to complete the previous operation before the next operation starts This means that you should consider the latency of the instruction According to Ivy bridge instruction table, all instructions involved have a delay of 1 clock cycle This gives you a total of 27 clock cycles

This is compared to a single IDIV (8-bit) instruction In the table, I found that this requires a delay of about 20 clock cycles

The original estimate will give: 27 cycles / 20 cycles = 1.35 times slower This is inconsistent with your results I am not an expert in this field, but I think this is because the division of IDIV instructions can run in parallel because they are independent The throughput of IDIV instructions is 8 clock cycles This allows the CPU to optimize instructions in such a way that it can run about 4 partitions every 52 cycles (this is an estimate)

Therefore, to perform division 4 times using the displacement algorithm, you need 108 cycles, while IDIV requires about 64 clock cycles This gives: 108 / 52 = 2.1 times slower

This is close to the ratio you measured I guess the extra time left is the cost of function calls Perhaps the CPU has been optimized more than I expected

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>