Why is the Java division of integers faster than the joy of hackers
I'm testing the throughput of divs10 function from hacker joy book, in my JDK 1.7 64bit version 21 and i7 Intel @ R_ 621_ 2419 @ coded in Java
I want to know why the default Java operator / is faster than the divs10 function in the hacker's joy book. The result shows that divs10 is three times slower than the "/" operator, which surprised me
Anyone can tell me, is there any fancy internal JVM that can be used?
The source code is as follows
public class div10 { public static final int divs10(int n) { int q,r; n = n + (n >> 31 & 9); q = (n >> 1) + (n >> 2); q += q >> 4; q += q >> 8; q += q >> 16; q = q >> 3; r = n - ((q << 3) + (q << 1)); return q + ((r + 6) >> 4); } public static void main(String[] args) { /* long count = 0; for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) { if ( (i/10) != divs10(i) ) { System.err.println("error dividing :" + i ); } if ((i & 0xFFFFFFF ) == 0 ) { System.out.println("Finished:" + Long.toHexString(count) + ":" + count + ":" + i); } count++; } System.out.println("Success:" + count); */ long start = System.nanoTime(); long count = 0L; int iter = 100_000; for (int j = 0; j < 10; j++) for (int i = -iter; i < iter; i++) { count += (i/10); } for (int j = 0; j < 10; j++) for (int i = -iter; i < iter; i++) { count += divs10(i); } System.out.println(count + " warm up done ") ; start = System.nanoTime(); count = 0L; for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) { count += i/10; } System.out.println(count + ",took:" + (System.nanoTime() - start) / 1000_000L + " ms," + (System.nanoTime() - start) / ((long)Integer.MAX_VALUE - (long)Integer.MIN_VALUE) + " ns per ops" ) ; start = System.nanoTime(); count = 0L; for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) { count += divs10(i); } System.out.println(count + "," + (System.nanoTime() - start) / ((long)Integer.MAX_VALUE - (long)Integer.MIN_VALUE) + " ns per ops" ) ; } }
Solution
Update: when looking at the newer Ivy bridge table (page 174), I see that all delays are at 1 This means that my previous explanation is incorrect
The attempt to calculate the instruction executed in the divs10 method is the 27 (no function call overhead) instruction You are in the middle of an operation. You are required to complete the previous operation before the next operation starts This means that you should consider the latency of the instruction According to Ivy bridge instruction table, all instructions involved have a delay of 1 clock cycle This gives you a total of 27 clock cycles
This is compared to a single IDIV (8-bit) instruction In the table, I found that this requires a delay of about 20 clock cycles
The original estimate will give: 27 cycles / 20 cycles = 1.35 times slower This is inconsistent with your results I am not an expert in this field, but I think this is because the division of IDIV instructions can run in parallel because they are independent The throughput of IDIV instructions is 8 clock cycles This allows the CPU to optimize instructions in such a way that it can run about 4 partitions every 52 cycles (this is an estimate)
Therefore, to perform division 4 times using the displacement algorithm, you need 108 cycles, while IDIV requires about 64 clock cycles This gives: 108 / 52 = 2.1 times slower
This is close to the ratio you measured I guess the extra time left is the cost of function calls Perhaps the CPU has been optimized more than I expected