|
1 | 1 | # x86-simd-sort
|
2 | 2 |
|
3 | 3 | C++ template library for high performance SIMD based sorting routines for
|
4 |
| -16-bit, 32-bit and 64-bit data types. The sorting routines are accelerated |
5 |
| -using AVX-512/AVX2 when available. The library auto picks the best version |
6 |
| -depending on the processor it is run on. If you are looking for the AVX-512 or |
7 |
| -AVX2 specific implementations, please see |
8 |
| -[README](https://github.com/intel/x86-simd-sort/blob/main/src/README.md) file under |
9 |
| -`src/` directory. The following routines are currently supported: |
| 4 | +built-in integers and floats (16-bit, 32-bit and 64-bit data types) and custom |
| 5 | +defined C++ objects. The sorting routines are accelerated using AVX-512/AVX2 |
| 6 | +when available. The library auto picks the best version depending on the |
| 7 | +processor it is run on. If you are looking for the AVX-512 or AVX2 specific |
| 8 | +implementations, please see |
| 9 | +[README](https://github.com/intel/x86-simd-sort/blob/main/src/README.md) file |
| 10 | +under `src/` directory. The following routines are currently supported: |
| 11 | + |
| 12 | +## Sort an array of custom defined class objects (uses `O(N)` space) |
| 13 | +``` cpp |
| 14 | +template <typename T, typename Func> |
| 15 | +void x86simdsort::object_qsort(T *arr, uint32_t arrsize, Func key_func) |
| 16 | +``` |
| 17 | +`T` is any user defined struct or class and `arr` is a pointer to the first |
| 18 | +element in the array of objects of type `T`. `Func` is a lambda function that |
| 19 | +computes the `key` value for each object which is the metric used to sort the |
| 20 | +objects. `Func` needs to have the following signature: |
10 | 21 |
|
| 22 | +```cpp |
| 23 | +[] (T obj) -> key_t { key_t key; /* compute key for obj */ return key; } |
| 24 | +``` |
11 | 25 |
|
12 |
| -### Sort routines on arrays |
| 26 | +Note that the return type of the key `key_t` needs to be one of the following |
| 27 | +: `[float, uint32_t, int32_t, double, uint64_t, int64_t]`. `object_qsort` has a |
| 28 | +space complexity of `O(N)`. Specifically, it requires `arrsize * |
| 29 | +sizeof(key_t)` bytes to store a vector with all the keys and an additional |
| 30 | +`arrsize * sizeof(uint32_t)` bytes to store the indexes of the object array. |
| 31 | +For performance reasons, we support `object_qsort` only when the array size is |
| 32 | +less than or equal to `UINT32_MAX`. An example usage of `object_qsort` is |
| 33 | +provided in the [examples](#Sort-an-array-of-Points-using-object_qsort) |
| 34 | +section. Refer to [section](#Performance-of-object_qsort) to get a sense of |
| 35 | +how fast this is relative to `std::sort`. |
| 36 | + |
| 37 | +## Sort an array of built-in integers and floats |
13 | 38 | ```cpp
|
14 |
| -x86simdsort::qsort(T* arr, size_t size, bool hasnan); |
15 |
| -x86simdsort::qselect(T* arr, size_t k, size_t size, bool hasnan); |
16 |
| -x86simdsort::partial_qsort(T* arr, size_t k, size_t size, bool hasnan); |
| 39 | +void x86simdsort::qsort(T* arr, size_t size, bool hasnan); |
| 40 | +void x86simdsort::qselect(T* arr, size_t k, size_t size, bool hasnan); |
| 41 | +void x86simdsort::partial_qsort(T* arr, size_t k, size_t size, bool hasnan); |
17 | 42 | ```
|
18 | 43 | Supported datatypes: `T` $\in$ `[_Float16, uint16_t, int16_t, float, uint32_t,
|
19 | 44 | int32_t, double, uint64_t, int64_t]`
|
20 | 45 |
|
21 |
| -### Key-value sort routines on pairs of arrays |
| 46 | +## Key-value sort routines on pairs of arrays |
22 | 47 | ```cpp
|
23 |
| -x86simdsort::keyvalue_qsort(T1* key, T2* val, size_t size, bool hasnan); |
| 48 | +void x86simdsort::keyvalue_qsort(T1* key, T2* val, size_t size, bool hasnan); |
24 | 49 | ```
|
25 | 50 | Supported datatypes: `T1`, `T2` $\in$ `[float, uint32_t, int32_t, double,
|
26 | 51 | uint64_t, int64_t]` Note that keyvalue sort is not yet supported for 16-bit
|
27 | 52 | data types.
|
28 | 53 |
|
29 |
| -### Arg sort routines on arrays |
| 54 | +## Arg sort routines on arrays |
30 | 55 | ```cpp
|
31 | 56 | std::vector<size_t> arg = x86simdsort::argsort(T* arr, size_t size, bool hasnan);
|
32 | 57 | std::vector<size_t> arg = x86simdsort::argselect(T* arr, size_t k, size_t size, bool hasnan);
|
@@ -55,16 +80,38 @@ can configure meson to build them both by using `-Dbuild_tests=true` and
|
55 | 80 |
|
56 | 81 | ## Example usage
|
57 | 82 |
|
| 83 | +#### Sort an array of floats |
| 84 | + |
58 | 85 | ```cpp
|
59 | 86 | #include "x86simdsort.h"
|
60 | 87 |
|
61 | 88 | int main() {
|
62 | 89 | std::vector<float> arr{1000};
|
63 |
| - x86simdsort::qsort(arr, 1000, true); |
| 90 | + x86simdsort::qsort(arr.data(), 1000, true); |
64 | 91 | return 0;
|
65 | 92 | }
|
66 | 93 | ```
|
67 | 94 |
|
| 95 | +#### Sort an array of Points using object_qsort |
| 96 | +```cpp |
| 97 | +#include "x86simdsort.h" |
| 98 | +#include <cmath> |
| 99 | + |
| 100 | +struct Point { |
| 101 | + double x, y, z; |
| 102 | +}; |
| 103 | + |
| 104 | +int main() { |
| 105 | + std::vector<Point> arr{1000}; |
| 106 | + // Sort an array of Points by its x value: |
| 107 | + x86simdsort::object_qsort(arr.data(), 1000, [](Point p) { return p.x; }); |
| 108 | + // Sort an array of Points by its distance from origin: |
| 109 | + x86simdsort::object_qsort(arr.data(), 1000, [](Point p) { |
| 110 | + return sqrt(p.x*p.x+p.y*p.y+p.z*p.z); |
| 111 | + }); |
| 112 | + return 0; |
| 113 | +} |
| 114 | +``` |
68 | 115 |
|
69 | 116 | ## Details
|
70 | 117 |
|
@@ -95,6 +142,33 @@ argselect) will not use the SIMD based algorithms if they detect NAN's in the
|
95 | 142 | array. You can read details of all the implementations
|
96 | 143 | [here](https://github.com/intel/x86-simd-sort/blob/main/src/README.md).
|
97 | 144 |
|
| 145 | +## Performance comparison on AVX-512: `object_qsort` v/s `std::sort` |
| 146 | +Performance of `object_qsort` can vary significantly depending on the defintion |
| 147 | +of the custom class and we highly recommend benchmarking before using it. For |
| 148 | +the sake of illustration, we provide a few examples in |
| 149 | +[./benchmarks/bench-objsort.hpp](./benchmarks/bench-objsort.hpp) which measures |
| 150 | +performance of `object_qsort` relative to `std::sort` when sorting an array of |
| 151 | +3D points represented by the class: `struct Point {double x, y, z;}` and |
| 152 | +`struct Point {float x, y, x;}`. We sort these points based on several |
| 153 | +different metrics: |
| 154 | +
|
| 155 | ++ sort by coordinate `x` |
| 156 | ++ sort by manhanttan distance (relative to origin): `abs(x) + abx(y) + abs(z)` |
| 157 | ++ sort by Euclidean distance (relative to origin): `sqrt(x*x + y*y + z*z)` |
| 158 | ++ sort by Chebyshev distance (relative to origin): `max(abs(x), abs(y), abs(z))` |
| 159 | +
|
| 160 | +The performance data (shown in the plot below) can be collected by building the |
| 161 | +benchmarks suite and running `./builddir/benchexe --benchmark_filter==*obj*`. |
| 162 | +The data plot shown below was collected on a processor with AVX-512 because |
| 163 | +`object_qsort` is currently accelerated only on AVX-512 (we plan to add the |
| 164 | +AVX2 version soon). For the simplest of cases where we want to sort an array of |
| 165 | +struct by one of its members, `object_qsort` can be up-to 5x faster for 32-bit |
| 166 | +data type and about 4x for 64-bit data type. It tends to do even better when |
| 167 | +the metric to sort by gets more complicated. Sorting by Euclidean distance can |
| 168 | +be up-to 10x faster. |
| 169 | +
|
| 170 | + |
| 171 | +
|
98 | 172 | ## Downstream projects using x86-simd-sort
|
99 | 173 |
|
100 | 174 | - NumPy uses this as a [submodule](https://github.com/numpy/numpy/pull/22315) to accelerate `np.sort, np.argsort, np.partition and np.argpartition`.
|
|
0 commit comments