- ~0.18 - 0.90 cycles per byte
- 100% C (C++ compatible header) without inline assembly
- Both 32 and 64 bits supported
- Portable scalar functions faster than SIMD functions
- Up to 22 times faster than naive solution
- 🆕 (2022.01) more faster, beats even other very fast assembler functions
- Single thread
- Realistic and practical benchmark with large files.
- No PURE cache benchmark
- Uniform: enwik9
- Skewed: enwik9 bwt generated w. libdivsufsort
- 1GB zeros
- Accurate benchmarking with command "turbohist file -I15"
Uniform distribution - enwik9 Text file, size=1.000.0000.000
Function | MB/s | Cycle/Byte | Language | Package |
---|---|---|---|---|
1:hist_1_8 naiv 8 bits | 2761.01 | 1.3423 | C | TurboHist |
2:hist_4_8 4 bins/ 8 bits | 2725.92 | 1.3249 | C | TurboHist |
3:hist_8_8 8 bins/ 8 bits | 2850.05 | 1.2627 | C | TurboHist |
4:hist_4_32 4 bins/32 bits | 3691.02 | 0.9660 | C | TurboHist |
5:hist_8_32 8 bins/32 bits | 3867.26 | 0.9561 | C | TurboHist |
6:hist_4_64 4 bins/64 bits | 4040.55 | 0.9103 | C | TurboHist |
7:hist_8_64 8 bins/64 bits | 4053.37 | 0.9035 | C | TurboHist |
8:histr_4_64 4/64+run | 3915.85 | 0.9668 | C | TurboHist |
9:histr_8_64 8/64+run | 3916.51 | 0.9286 | C | TurboHist |
10:hist_4_128 4 bins/sse4.1 | 3643.20 | 1.0081 | C | TurboHist |
11:hist_8_128 8 bins/sse4.1 | 3607.06 | 0.9845 | C | TurboHist |
12:hist_4_256 4 bins/avx2 | 3522.27 | 1.0195 | C | TurboHist |
13:hist_8_256 8 bins/avx2 | 3542.25 | 1.0366 | C | TurboHist |
15:hist_8_64asm inline asm | 4161.87 | 0.8787 | inline asm | TurboHist |
18:count2x64 inline asm | 3963.91 | 0.9172 | inline asm | Countbench |
20:histo_ref | 2702.57 | 1.3567 | C | Ryg |
21:histo_cpp_1x | 1876.13 | 1.8236 | C | Ryg |
22:histo_cpp_2x | 2664.78 | 1.5935 | C | Ryg |
23:histo_cpp_4x | 2817.77 | 1.2944 | C | Ryg |
24:histo_asm_scalar4 | 3130.08 | 1.1609 | asm | Ryg |
25:histo_asm_scalar8 | 3353.08 | 1.0636 | asm | Ryg |
26:histo_asm_scalar8_var | 3704.88 | 0.9856 | asm | Ryg |
27:histo_asm_scalar8_var2 | 4085.48 | 0.8913 | asm | Ryg |
28:histo_asm_scalar8_var3 | 4132.54 | 0.8870 | asm | Ryg |
29:histo_asm_scalar8_var4 | 4083.92 | 0.8970 | asm | Ryg |
30:histo_asm_scalar8_var5 | 4002.21 | 0.9025 | asm | Ryg |
31:histo_asm_sse4 | 3153.01 | 1.1445 | asm | Ryg |
32:memcpy | 13724.29 | 0.2698 | C |
Skewed distribution - enwik9.bwt Text file, size=1.000.0000.000
Function | MB/s | Cycle/Byte | Language |
---|---|---|---|
1:hist_1_8 naiv 8 bits | 1170.89 | 3.0642 | C |
2:hist_4_8 4 bins/ 8 bits | 2707.74 | 1.3321 | C |
3:hist_8_8 8 bins/ 8 bits | 2804.08 | 1.3208 | C |
4:hist_4_32 4 bins/32 bits | 3118.54 | 1.1402 | C |
5:hist_8_32 8 bins/32 bits | 3780.16 | 0.9714 | C |
6:hist_4_64 4 bins/64 bits | 3646.25 | 0.9980 | C |
7:hist_8_64 8 bins/64 bits | 3941.96 | 0.9282 | C |
8:histr_4_64 4/64+run | 5061.62 | 0.7270 | C |
9:histr_8_64 8/64+run | 5135.29 | 0.7229 | C |
10:hist_4_128 4 bins/sse4.1 | 3535.36 | 1.0365 | C |
11:hist_8_128 8 bins/sse4.1 | 3654.41 | 0.9791 | C |
12:hist_4_256 4 bins/avx2 | 3329.87 | 1.1022 | C |
13:hist_8_256 8 bins/avx2 | 3540.36 | 1.0343 | C |
15:hist_8_64asm inline asm | 4047.74 | 0.9013 | inline asm |
18:count2x64 inline asm | 3969.92 | 0.9262 | inline asm |
20:histo_ref | 1182.61 | 3.0718 | C |
21:histo_cpp_1x | 1213.42 | 2.9748 | C |
22:histo_cpp_2x | 2115.60 | 1.7373 | C |
23:histo_cpp_4x | 1801.97 | 2.0024 | C |
24:histo_asm_scalar4 | 3092.87 | 1.1561 | asm |
25:histo_asm_scalar8 | 3203.95 | 1.1139 | asm |
26:histo_asm_scalar8_var | 3460.45 | 1.0422 | asm |
27:histo_asm_scalar8_var2 | 3659.61 | 0.9878 | asm |
28:histo_asm_scalar8_var3 | 3769.96 | 0.9569 | asm |
29:histo_asm_scalar8_var4 | 3996.75 | 0.8905 | asm |
30:histo_asm_scalar8_var5 | 4642.10 | 0.7719 | asm |
31:histo_asm_sse4 | 3091.36 | 1.1670 | asm |
32:memcpy | 15594.28 | 0.2412 | C |
All zeros: size=1.000.0000.000
Function | MB/s | Cycle/Byte | Language |
---|---|---|---|
1:hist_1_8 naiv 8 bits | 877.27 | 4.0805 | C |
2:hist_4_8 4 bins/ 8 bits | 2650.84 | 1.3485 | C |
3:hist_8_8 8 bins/ 8 bits | 2743.40 | 1.2994 | C |
4:hist_4_32 4 bins/32 bits | 2978.83 | 1.2006 | C |
5:hist_8_32 8 bins/32 bits | 3775.45 | 0.9555 | C |
6:hist_4_64 4 bins/64 bits | 3411.11 | 1.0530 | C |
7:hist_8_64 8 bins/64 bits | 3928.09 | 0.9342 | C |
8:histr_4_64 4/64+run | 18998.87 | 0.1868 | C |
9:histr_8_64 8/64+run | 19629.28 | 0.1869 | C |
10:hist_4_128 4 bins/sse4.1 | 3365.40 | 1.0717 | C |
11:hist_8_128 8 bins/sse4.1 | 3632.61 | 0.9950 | C |
12:hist_4_256 4 bins/avx2 | 3112.15 | 1.1576 | C |
13:hist_8_256 8 bins/avx2 | 3497.08 | 1.0205 | C |
15:hist_8_64asm inline asm | 4089.97 | 0.8817 | inline asm |
18:count2x64 inline asm | 3881.98 | 0.9158 | inline asm |
20:histo_ref | 882.93 | 4.1072 | C |
21:histo_cpp_1x | 873.20 | 4.1069 | C |
22:histo_cpp_2x | 1720.19 | 2.0961 | C |
23:histo_cpp_4x | 1866.99 | 2.0817 | C |
24:histo_asm_scalar4 | 2995.84 | 1.1942 | asm |
25:histo_asm_scalar8 | 3107.30 | 1.1618 | asm |
26:histo_asm_scalar8_var | 3288.67 | 1.1143 | asm |
27:histo_asm_scalar8_var2 | 3290.92 | 1.0957 | asm |
28:histo_asm_scalar8_var3 | 3707.41 | 0.9763 | asm |
29:histo_asm_scalar8_var4 | 3988.01 | 0.9019 | asm |
30:histo_asm_scalar8_var5 | 14076.09 | 0.2564 | asm |
31:histo_asm_sse4 | 3020.32 | 1.1975 | asm |
32:memcpy | 14057.53 | 0.2636 | C |
(bold = pareto) MB=1.000.000
make
or
make AVX2=1
turbohist [-e#] file [-I#] [-z]
options:
-e# # = function numbers separated by ,
-I# # = number of iteration
set to -I15 for accurate timings
-z set read buffer to zeros
./turbohist file
./turbohist -e1,7,9
- Windows: MinGW-w64 makefile
- Linux amd/intel: GNU GCC (>=4.6)
- Linux amd/intel: Clang (>=3.2)
- Linux arm: aarch64 ARMv8: gcc (>=6.3)
- MaxOS: XCode (>=9) + Apple M1
- PowerPC ppc64le: gcc (>=8.0)
Last update: 01 JAN 2022