|
1 | 1 | # x86-simd-sort
|
2 | 2 |
|
3 |
| -C++ header file library for SIMD based 16-bit, 32-bit and 64-bit data type |
4 |
| -sorting algorithms on x86 processors. Source header files are available in src |
5 |
| -directory. We currently only have AVX-512 based implementation of quicksort, |
6 |
| -argsort, quickselect, paritalsort and key-value sort. This repository also |
7 |
| -includes a test suite which can be built and run to test the sorting algorithms |
8 |
| -for correctness. It also has benchmarking code to compare its performance |
9 |
| -relative to std::sort. The following API's are currently supported: |
10 |
| - |
11 |
| -#### Quicksort |
12 |
| - |
13 |
| -```cpp |
14 |
| -void avx512_qsort<T>(T* arr, int64_t arrsize) |
15 |
| -``` |
16 |
| -Supported datatypes: `uint16_t`, `int16_t`, `_Float16`, `uint32_t`, `int32_t`, |
17 |
| -`float`, `uint64_t`, `int64_t` and `double`. |
18 |
| - |
19 |
| -For floating-point types, if `arr` contains NaNs, they are moved to the end and |
20 |
| -replaced with a quiet NaN. That is, the original, bit-exact NaNs in the input |
21 |
| -are not preserved. |
22 |
| - |
23 |
| -#### Argsort |
| 3 | +C++ template library for high performance SIMD based sorting routines for |
| 4 | +16-bit, 32-bit and 64-bit data types. The sorting routines are accelerated |
| 5 | +using AVX-512/AVX2 when available. The library auto picks the best version |
| 6 | +depending on the processor it is run on. If you are looking for the AVX-512 or |
| 7 | +AVX2 specific implementations, please see |
| 8 | +[README](https://github.com/intel/x86-simd-sort/src/README.md) file under |
| 9 | +`src/` directory. The following routines are currently supported: |
24 | 10 |
|
25 | 11 | ```cpp
|
26 |
| -std::vector<int64_t> arg = avx512_argsort<T>(T* arr, int64_t arrsize) |
27 |
| -void avx512_argsort<T>(T* arr, int64_t *arg, int64_t arrsize) |
| 12 | +x86simdsort::qsort(T* arr, size_t size, bool hasnan); |
| 13 | +x86simdsort::qselect(T* arr, size_t k, size_t size, bool hasnan); |
| 14 | +x86simdsort::partial_qsort(T* arr, size_t k, size_t size, bool hasnan); |
| 15 | +std::vector<size_t> arg = x86simdsort::argsort(T* arr, size_t size, bool hasnan); |
| 16 | +std::vector<size_t> arg = x86simdsort::argselect(T* arr, size_t k, size_t size, bool hasnan); |
28 | 17 | ```
|
29 |
| -Supported datatypes: `uint32_t`, `int32_t`, `float`, `uint64_t`, `int64_t` and |
30 |
| -`double`. |
31 | 18 |
|
32 |
| -The algorithm resorts to scalar `std::sort` if the array contains NaNs. |
| 19 | +### Build/Install |
33 | 20 |
|
34 |
| -#### Quickselect |
| 21 | +[meson](https://github.com/mesonbuild/meson) is the used build system. Command |
| 22 | +to build and install the library: |
35 | 23 |
|
36 |
| -```cpp |
37 |
| -void avx512_qselect<T>(T* arr, int64_t arrsize) |
38 |
| -void avx512_qselect<T>(T* arr, int64_t arrsize, bool hasnan) |
39 |
| -``` |
40 |
| -Supported datatypes: `uint16_t`, `int16_t`, `_Float16`, `uint32_t`, `int32_t`, |
41 |
| -`float`, `uint64_t`, `int64_t` and `double`. |
42 |
| - |
43 |
| -For floating-point types, if `bool hasnan` is set, NaNs are moved to the end of |
44 |
| -the array, preserving the bit-exact NaNs in the input. If NaNs are present but |
45 |
| -`hasnan` is `false`, the behavior is undefined. |
46 |
| - |
47 |
| -#### Partialsort |
48 |
| - |
49 |
| -```cpp |
50 |
| -void avx512_partial_qsort<T>(T* arr, int64_t arrsize) |
51 |
| -void avx512_partial_qsort<T>(T* arr, int64_t arrsize, bool hasnan) |
52 | 24 | ```
|
53 |
| -Supported datatypes: `uint16_t`, `int16_t`, `_Float16`, `uint32_t`, `int32_t`, |
54 |
| -`float`, `uint64_t`, `int64_t` and `double`. |
55 |
| - |
56 |
| -For floating-point types, if `bool hasnan` is set, NaNs are moved to the end of |
57 |
| -the array, preserving the bit-exact NaNs in the input. If NaNs are present but |
58 |
| -`hasnan` is `false`, the behavior is undefined. |
59 |
| - |
60 |
| -#### Key-value sort |
61 |
| -```cpp |
62 |
| -void avx512_qsort_kv<T>(T* key, uint64_t* value , int64_t arrsize) |
| 25 | +meson setup --buildtype release builddir && cd builddir |
| 26 | +meson compile |
| 27 | +sudo meson install |
63 | 28 | ```
|
64 |
| -Supported datatypes: `uint64_t, int64_t and double` |
65 | 29 |
|
66 |
| -## Algorithm details |
| 30 | +Once installed, you can use `pkg-config --cflags --libs x86simdsortcpp` to |
| 31 | +populate the right cflags and ldflags to compile and link your C++ program. |
| 32 | +This repository also contains a test suite and benchmarking suite which are |
| 33 | +written using [googletest](https://github.com/google/googletest) and [google |
| 34 | +benchmark](https://github.com/google/benchmark) frameworks respectively. You |
| 35 | +can configure meson to build them both by using `-Dbuild_tests=true` and |
| 36 | +`-Dbuild_benchmarks=true`. |
67 | 37 |
|
68 |
| -The ideas and code are based on these two research papers [1] and [2]. On a |
69 |
| -high level, the idea is to vectorize quicksort partitioning using AVX-512 |
70 |
| -compressstore instructions. If the array size is < 128, then use Bitonic |
71 |
| -sorting network implemented on 512-bit registers. The precise network |
72 |
| -definitions depend on the size of the dtype and are defined in separate files: |
73 |
| -`avx512-16bit-qsort.hpp`, `avx512-32bit-qsort.hpp` and |
74 |
| -`avx512-64bit-qsort.hpp`. Article [4] is a good resource for bitonic sorting |
75 |
| -network. The core implementations of the vectorized qsort functions |
76 |
| -`avx512_qsort<T>(T*, int64_t)` are modified versions of avx2 quicksort |
77 |
| -presented in the paper [2] and source code associated with that paper [3]. |
78 |
| - |
79 |
| -## Example to include and build this in a C++ code |
80 |
| - |
81 |
| -### Sample code `main.cpp` |
| 38 | +### Example usage |
82 | 39 |
|
83 | 40 | ```cpp
|
84 |
| -#include "src/avx512-32bit-qsort.hpp" |
| 41 | +#include "x86simdsort.h" |
85 | 42 |
|
86 | 43 | int main() {
|
87 |
| - const int ARRSIZE = 1000; |
88 |
| - std::vector<float> arr; |
89 |
| - |
90 |
| - /* Initialize elements is reverse order */ |
91 |
| - for (int ii = 0; ii < ARRSIZE; ++ii) { |
92 |
| - arr.push_back(ARRSIZE - ii); |
93 |
| - } |
94 |
| - |
95 |
| - /* call avx512 quicksort */ |
96 |
| - avx512_qsort(arr.data(), ARRSIZE); |
| 44 | + std::vector<float> arr{1000}; |
| 45 | + x86simdsort::qsort(arr, 1000, true); |
97 | 46 | return 0;
|
98 | 47 | }
|
99 |
| - |
100 |
| -``` |
101 |
| - |
102 |
| -### Build using gcc |
103 |
| - |
104 |
| -``` |
105 |
| -g++ main.cpp -mavx512f -mavx512dq -O3 |
106 | 48 | ```
|
107 | 49 |
|
108 |
| -This is a header file only library and we do not provide any compile time and |
109 |
| -run time checks which is recommended while including this your source code. A |
110 |
| -slightly modified version of this source code has been contributed to |
111 |
| -[NumPy](https://github.com/numpy/numpy) (see this [pull |
112 |
| -request](https://github.com/numpy/numpy/pull/22315) for details). This NumPy |
113 |
| -pull request is a good reference for how to include and build this library with |
114 |
| -your source code. |
115 |
| - |
116 |
| -## Build requirements |
117 |
| - |
118 |
| -None, its header files only. However you will need `make` or `meson` to build |
119 |
| -the unit tests and benchmarking suite. You will need a relatively modern |
120 |
| -compiler to build. |
121 |
| - |
122 |
| -``` |
123 |
| -gcc >= 8.x |
124 |
| -``` |
125 |
| - |
126 |
| -### Build using Meson |
127 |
| - |
128 |
| -meson is the recommended build system to build the test and benchmark suite. |
129 |
| - |
130 |
| -``` |
131 |
| -meson setup builddir && cd builddir && ninja |
132 |
| -``` |
133 |
| - |
134 |
| -It build two executables: |
135 |
| - |
136 |
| -- `testexe`: runs a bunch of tests written in ./tests directory. |
137 |
| -- `benchexe`: measures performance of these algorithms for various data types. |
138 |
| - |
139 |
| - |
140 |
| -### Build using Make |
141 |
| - |
142 |
| -Makefile uses `-march=sapphirerapids` as a global compile flag and hence it |
143 |
| -will require g++-12. `make` command builds two executables: |
144 |
| -- `testexe`: runs a bunch of tests written in ./tests directory. |
145 |
| -- `benchexe`: measures performance of these algorithms for various data types |
146 |
| - and compares them to std::sort. |
147 |
| - |
148 |
| -You can use `make test` and `make bench` to build just the `testexe` and |
149 |
| -`benchexe` respectively. |
150 |
| - |
151 |
| -## Requirements and dependencies |
152 |
| - |
153 |
| -The sorting routines relies only on the C++ Standard Library and requires a |
154 |
| -relatively modern compiler to build (gcc 8.x and above). Since they use the |
155 |
| -AVX-512 instruction set, they can only run on processors that have AVX-512. |
156 |
| -Specifically, the 32-bit and 64-bit require AVX-512F and AVX-512DQ instruction |
157 |
| -set. The 16-bit sorting requires the AVX-512F, AVX-512BW and AVX-512 VMBI2 |
158 |
| -instruction set. The test suite is written using the Google test framework. The |
159 |
| -benchmark is written using the google benchmark framework. |
160 |
| - |
161 |
| -## References |
162 |
| - |
163 |
| -* [1] Fast and Robust Vectorized In-Place Sorting of Primitive Types |
164 |
| - https://drops.dagstuhl.de/opus/volltexte/2021/13775/ |
165 |
| - |
166 |
| -* [2] A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel |
167 |
| -Skylake https://arxiv.org/pdf/1704.08579.pdf |
168 |
| - |
169 |
| -* [3] https://github.com/simd-sorting/fast-and-robust: SPDX-License-Identifier: MIT |
170 |
| - |
171 |
| -* [4] http://mitp-content-server.mit.edu:18180/books/content/sectbyfn?collid=books_pres_0&fn=Chapter%2027.pdf&id=8030 |
172 | 50 |
|
173 |
| -* [5] https://bertdobbelaere.github.io/sorting_networks.html |
| 51 | +### Details |
| 52 | + |
| 53 | +- `x86simdsort::qsort` is equivalent to `qsort` in |
| 54 | + [C](https://www.tutorialspoint.com/c_standard_library/c_function_qsort.htm) |
| 55 | + or `std::sort` in [C++](https://en.cppreference.com/w/cpp/algorithm/sort). |
| 56 | +- `x86simdsort::qselect` is equivalent to `std::nth_element` in |
| 57 | + [C++](https://en.cppreference.com/w/cpp/algorithm/nth_element) or |
| 58 | + `np.partition` in |
| 59 | + [NumPy](https://numpy.org/doc/stable/reference/generated/numpy.partition.html). |
| 60 | +- `x86simdsort::partial_qsort` is equivalent to `std::partial_sort` in |
| 61 | + [C++](https://en.cppreference.com/w/cpp/algorithm/partial_sort). |
| 62 | +- `x86simdsort::argsort` is equivalent to `np.argsort` in |
| 63 | + [NumPy](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html). |
| 64 | +- `x86simdsort::argselect` is equivalent to `np.argpartition` in |
| 65 | + [NumPy](https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html). |
| 66 | + |
| 67 | +Supported datatypes: `uint16_t, int16_t, _Float16, uint32_t, int32_t, float, |
| 68 | +uint64_t, int64_t, double`. Note that `_Float16` will require building this |
| 69 | +library with g++ >= 12.x. All the functions have an optional argument `bool |
| 70 | +hasnan` set to `false` by default (these are relevant to floating point data |
| 71 | +types only). If your array has NAN's, the the behaviour of the sorting routine |
| 72 | +is undefined. If `hasnan` is set to true, NAN's are always sorted to the end of |
| 73 | +the array. In addition to that, qsort will replace all your NAN's with |
| 74 | +`std::numeric_limits<T>::quiet_NaN`. The original bit-exact NaNs in |
| 75 | +the input are not preserved. Also note that the arg methods (argsort and |
| 76 | +argselect) will not use the SIMD based algorithms if they detect NAN's in the |
| 77 | +array. You can read details of all the implementations |
| 78 | +[here](https://github.com/intel/x86-simd-sort/src/README.md). |
| 79 | + |
| 80 | +### Downstream projects using x86-simd-sort |
| 81 | + |
| 82 | +- NumPy uses this as a [submodule](https://github.com/numpy/numpy/pull/22315) to accelerate `np.sort, np.argsort, np.partition and np.argpartition`. |
| 83 | +- A slightly modifed version this library has been integrated into [openJDK](https://github.com/openjdk/jdk/pull/14227). |
| 84 | +- [GRAPE](https://github.com/alibaba/libgrape-lite.git): C++ library for parallel graph processing. |
| 85 | +- AVX-512 version of the key-value sort has been submitted to [Oceanbase](https://github.com/oceanbase/oceanbase/pull/1325). |
0 commit comments