Skip to content

Latest commit

 

History

History
203 lines (152 loc) · 9.28 KB

README.md

File metadata and controls

203 lines (152 loc) · 9.28 KB

MicroBench

Enabling simpler microbenchmarks of Java8 and more traditional code.

Often, when developing performance sensitive code, it is convenient to benchmark that code in order to evaluate and improve its performance. UBench is a tool that can help.

##Maven

Maven dependency is available on Maven Central:

<dependency>
    <groupId>net.tuis.ubench</groupId>
    <artifactId>ubench</artifactId>
    <version>0.2.0</version>
</dependency>

##Example Use Case - Largest Prime Calculator

See the running code in the test section - net.tuis.ubench.PrimeSieve

Consider a function which returns the largest prime number less than a given input value. This is based off a simple "Sieve of Eratosthenes":

/**
 * Basic "Sieve of Eratosthenes" implementation, no optimizations 
 * @param limit only primes less than this limit will be calculated.
 * @return the largest prime less than the supplied limit.
 */
public static final int getMaxPrimeBefore(int limit) {
    boolean[] sieve = new boolean[limit];
    Arrays.fill(sieve, true);
    sieve[0] = false;
    sieve[1] = false;
    int largest = 0;
    for (int p = 2; p < limit; p++) {
        if (sieve[p]) {
            largest = p;
            for (int np = p * 2; np < limit; np += p) {
                sieve[np] = false;
            }
        }
    }
    return largest;
}

What can UBench do for us that is useful?

##Simple Performance

First up, how fast is it? What happens to the performance as the Java JIT compiler optimizes the code. What can we expect when the system is 'cold'?

Simply run the code in a UBench task:

    new UBench("Simple Performance")
           .addIntTask("Prime less than 4000", () -> getMaxPrimeBefore(4000), p -> p == 3989)
           .press(10000)
           .report();

This will produce (on my machine) the results:

Task Simple Performance -> Prime less than 4000: (Unit: MICROSECONDS)
  Count    :     10000      Average  :   17.4710
  Fastest  :   13.0260      Slowest  : 1107.2170
  95Pctile :   23.6840      99Pctile :   42.6310
  TimeBlock : 25.102 16.466 16.442 16.549 16.401 18.256 15.208 17.560 16.564 16.166
  Histogram :  9717   186    61    26     8     1     1

The above results can be interpreted in multiple ways. The TimeBlock and Histogram values are the ones which will be least obvious, though.

###TimeBlock

This is a break down of the run time in to 10 zones. In the above example, with 10,000 runs, the average of the first 1000 runs is 25.102 microseconds. The next 1000 runs averages at 16.466 microseconds, and so on. You can see that the performance is about constant after the first 1000 runs.

###Histogram

This value set counts the number of runs that fall within specific time buckets. Each bucket is twice as slow as the previous bucket. The first bucket is the reference, and is based on the fastest run in the results. In the exampe above, 9717 runs were between 1X and 2X as slow as the fastest run (13.0260 microseconds). 186 runs were between 2X and 4X as slow as the fastest run. 61 runs were between 4X and 8X slower than the fastest, and so on.

The number of buckets is directly related to the number of times slower the slowest run is compared to the fastest run. In the example above, the slowest run is 85 times slower than the fastest run, so it is in bucket 64-to-128 times slower.

##Comparative Algorithms

How about comparing a revised implementation? One which does not do the Arrays.fill(...) operation, but instead negates all values? The Arrays.fill is an operation that can be avoided if the logic using the boolean values in the sieve is negated. Consider code like:

public static final int getMaxPrimeBeforeNeg(int limit) {
    boolean[] sieve = new boolean[limit];
    // Arrays.fill(sieve, true);
    sieve[0] = true;
    sieve[1] = true;
    int largest = 0;
    for (int p = 2; p < limit; p++) {
        if (!sieve[p]) {
            largest = p;
            for (int np = p * 2; np < limit; np += p) {
                sieve[np] = true;
            }
        }
    }
    return largest;
}

Note that the code is the same as the initial example, but the method name is different, it does not do the Arrays.fill, and it has a negated check of the boolean value, and it sets values to true, instead of setting them to false. How much will the removal of the fill improve the performance?

    new UBench("Comparative Performance")
          .addIntTask("Primes Filled", () -> getMaxPrimeBefore(4000), p -> p == 3989)
          .addIntTask("Primes Negated", () -> getMaxPrimeBeforeNeg(4000), p -> p == 3989)
          .press(10000)
          .report("Effects of Arrays.fill()");

This produces the output:

Effects of Arrays.fill()
========================

Task Comparative Performance -> Primes Filled: (Unit: MICROSECONDS)
  Count    :     10000      Average  :   15.9820
  Fastest  :   14.2100      Slowest  : 1458.1320
  95Pctile :   18.5530      99Pctile :   25.2630
  TimeBlock : 15.529 17.707 15.695 15.539 15.944 17.046 15.855 15.546 15.447 15.518
  Histogram :  9931    53     9     4     1     1     1

Task Comparative Performance -> Primes Negated: (Unit: MICROSECONDS)
  Count    :     10000      Average  :   15.0480
  Fastest  :   13.4200      Slowest  : 1403.2650
  95Pctile :   16.9740      99Pctile :   24.4740
  TimeBlock : 14.417 15.140 14.637 14.483 14.823 15.335 16.488 14.954 15.041 15.168
  Histogram :  9938    45    11     4     1     0     1

As you can see, the negated process is consistently about 1 microsecond faster, or, 8% in this case.

##Scalability testing

How does the performance change with different values for the limit? Let's check it doubling input values. As we double the input, what happens to the performance?

    UBench bench = new UBench("Sieve Scalability");
    
    int[] limits = {250, 500, 1000, 2000, 4000, 8000};
    int[] primes = {241, 499, 997, 1999, 3989, 7993};
    
    for (int i = 0; i < limits.length; i++) {
        final int limit = limits[i];
        final int check = primes[i];
        bench.addIntTask("Primes " + limit, () -> getMaxPrimeBefore(limit), p -> p == check);
    }
    
    bench.press(UMode.SEQUENTIAL, 10000).report("Prime Scalability");

We run the code and get the results:

Prime Scalability
=================

Task Sieve Scalability -> Primes 250: (Unit: MICROSECONDS)
  Count    :     10000      Average  :    0.8840
  Fastest  :    0.3940      Slowest  : 1605.3660
  95Pctile :    0.7900      99Pctile :    0.7900
  TimeBlock : 2.409 0.753 0.699 0.701 0.703 0.844 0.702 0.699 0.690 0.639
  Histogram :  2261  7712    14     5     3     2     1     1     0     0     0     1

Task Sieve Scalability -> Primes 500: (Unit: MICROSECONDS)
  Count    :    10000      Average  :   1.3500
  Fastest  :   1.1840      Slowest  :  70.6570
  95Pctile :   1.5790      99Pctile :   1.5790
  TimeBlock : 1.341 1.373 1.416 1.335 1.335 1.332 1.334 1.332 1.371 1.333
  Histogram :  9979    18     0     0     2     1

Task Sieve Scalability -> Primes 1000: (Unit: MICROSECONDS)
  Count    :    10000      Average  :   3.1920
  Fastest  :   2.3680      Slowest  :  61.5780
  95Pctile :   5.1310      99Pctile :   5.1320
  TimeBlock : 2.888 2.834 2.871 3.207 3.396 2.969 2.857 2.875 3.085 4.946
  Histogram :  8691  1302     3     3     1

Task Sieve Scalability -> Primes 2000: (Unit: MICROSECONDS)
  Count    :     10000      Average  :    6.2780
  Fastest  :    5.5260      Slowest  : 1601.4190
  95Pctile :    6.7100      99Pctile :    8.6840
  TimeBlock : 6.067 6.086 6.372 6.035 6.445 6.038 6.041 7.616 6.038 6.048
  Histogram :  9984    14     1     0     0     0     0     0     1

Task Sieve Scalability -> Primes 4000: (Unit: MICROSECONDS)
  Count    :     10000      Average  :   13.3520
  Fastest  :   11.8410      Slowest  : 1562.7350
  95Pctile :   14.6050      99Pctile :   18.5520
  TimeBlock : 13.395 13.266 13.143 13.347 13.207 13.130 14.799 13.314 13.124 12.798
  Histogram :  9979    17     3     0     0     0     0     1

Task Sieve Scalability -> Primes 8000: (Unit: MICROSECONDS)
  Count    :      10000      Average  :    33.0100
  Fastest  :    26.0520      Slowest  : 35412.0070
  95Pctile :    41.0520      99Pctile :    42.6300
  TimeBlock : 28.020 28.138 28.887 27.991 28.095 29.253 71.221 30.652 28.668 29.184
  Histogram :  9951    35    10     0     2     1     0     0     0     0     1

The results show a slightly more than linear progression of the various times, which is what would be expected for an algorithm that's reported to be O( n log( log( n ) ) )

##Profiling

Perhaps you just want to profile the code, and get deeper insights. You can just set large values of iterations for the tests, or simply run the code for a given length of time:

bench.press("For Profiling", 1, TimeUnit.DAYS);

then kill the job when your profiling completes. Your profiles will show some overhead from the UBench itself, but the small footprint in terms of code and stack, makes UBench a convenient platform to use.