Use 64-bit Morton indices by default in the BVH construction #637

aprokop · 2022-02-21T05:16:04Z

Original motivation

ArborX is unable to run GeoLife (https://www.microsoft.com/en-us/research/publication/geolife-gps-trajectory-dataset-user-guide/) dataset in any reasonable manner (i.e., it takes hundreds of seconds to find a closest neighbor for this 24M problem). The problem is that hundreds of thousands of points are assigned the same Morton index. As we just "randomly" combine such pairs during the hierarchy construction, it leads to dramatic overlap in the bounding volumes for the lowest levels.

Statistics for some datasets

I have checked the number of duplicate codes and their distribution for two 3D datasets: HACC and GeoLife (those are the ones I had readily available).

"# M codes (>3 reps)": number of Morton codes that have more than 3 duplicates, i.e. number of Morton grid cells with more than 3 points in them
"# points with duplicate M code": number of points that share their Morton code with another point
" max # duplicate": highest number of points with the same Morton code

Dataset	# 32-bit M codes (>3 reps)	# points with duplicate 32-bit M code	max # duplicate 32-bit	# 64-bit M codes (>3 reps)	# points with duplicate 64-bit M code	max # duplicate 64-bit
HACC 37M	1,311,912 [ 7.52%]	23,639,027 [64.06%]	3,569	0 [ 0.00%]	528 [ 0.00%]	2
GeoLife	9,863 [60.24%]	24,872,748 [99.98%]	5,706,009	1,002,529 [ 8.02%]	16,217,910 [65.19%]	8,009

Number of hierarchy levels

Because of the way Karras' LBVH construction works, increasing resolution in Morton indices will typically result in a deeper hierarchy. This is particularly true for a hierarchy with multiple duplicate Morton codes corresponding to a single Morton box.

Dataset	# levels with 32-bit M codes	# levels with 64-bit M codes
HACC 37M	43	49
GeoLife	52	72

Despite this, as I will show, the queries actually run faster.

Results (master `707a5b3` vs branch `c5852e9` (`5407512` after rebase onmaster)

Standard Benchmark

morton64_results.zip

Summary: expected penalty in the construction, which is < 20% everywhere except HIP where it is 50% (HIP Thrust not optimized for long long?).

CUDA V100

10-20% penalty in construction. No difference in any of the searches.

-BM_construction<ArborX::BVH<Cuda>>/10000/0/manual_time_median                                              +0.1449         +0.1120           943          1080          1216          1352
-BM_construction<ArborX::BVH<Cuda>>/10000/1/manual_time_median                                              +0.1433         +0.1117           944          1079          1217          1353
+BM_construction<ArborX::BVH<Cuda>>/100000/0/manual_time_median                                             -0.1974         -0.1770          2499          2006          2781          2289
+BM_construction<ArborX::BVH<Cuda>>/100000/1/manual_time_median                                             -0.1748         -0.1899          2414          1992          2695          2184
-BM_construction<ArborX::BVH<Cuda>>/1000000/0/manual_time_median                                            +0.0956         +0.0916          5709          6255          6109          6669
-BM_construction<ArborX::BVH<Cuda>>/1000000/1/manual_time_median                                            +0.0965         +0.0909          5702          6253          6111          6667
-BM_construction<ArborX::BVH<Cuda>>/10000000/0/manual_time_median                                           +0.2307         +0.2033         27446         33777         28882         34755
-BM_construction<ArborX::BVH<Cuda>>/10000000/1/manual_time_median                                           +0.1841         +0.1684         27117         32108         28696         33528

HIP MI100

Up to 50% penalty in construction (which is significantly higher than Cuda; possibly HIP Thrust is less optimized for sorting long long than Cuda Thrust). No difference in radius and knn (except for the smallest size).

-BM_construction<ArborX::BVH<HIP>>/10/0/manual_time_median                                                 +0.4003         +0.3977           638           893           640           895
-BM_construction<ArborX::BVH<HIP>>/10/1/manual_time_median                                                 +0.3885         +0.3893           644           894           645           896
-BM_construction<ArborX::BVH<HIP>>/100/0/manual_time_median                                                +0.4861         +0.4846           662           983           664           986
-BM_construction<ArborX::BVH<HIP>>/100/1/manual_time_median                                                +0.4772         +0.4748           666           984           669           987
-BM_construction<ArborX::BVH<HIP>>/1000/0/manual_time_median                                               +0.2786         +0.2778           764           977           766           979
-BM_construction<ArborX::BVH<HIP>>/1000/1/manual_time_median                                               +0.2779         +0.2778           768           982           770           984
-BM_construction<ArborX::BVH<HIP>>/10000/0/manual_time_median                                              +0.3155         +0.3144           813          1069           815          1072
-BM_construction<ArborX::BVH<HIP>>/10000/1/manual_time_median                                              +0.3080         +0.3067           820          1072           823          1075
-BM_construction<ArborX::BVH<HIP>>/100000/0/manual_time_median                                             +0.2224         +0.2173          1041          1272          1064          1295
-BM_construction<ArborX::BVH<HIP>>/100000/1/manual_time_median                                             +0.2176         +0.2119          1060          1291          1084          1314
-BM_construction<ArborX::BVH<HIP>>/1000000/0/manual_time_median                                            +0.3408         +0.3386          2418          3242          2432          3255
-BM_construction<ArborX::BVH<HIP>>/1000000/1/manual_time_median                                            +0.3300         +0.3290          2426          3227          2440          3242
-BM_construction<ArborX::BVH<HIP>>/10000000/0/manual_time_median                                           +0.4351         +0.4351         11948         17146         11961         17166
-BM_construction<ArborX::BVH<HIP>>/10000000/1/manual_time_median                                           +0.3939         +0.3941         11804         16455         11815         16471
+BM_radius_search<ArborX::BVH<HIP>>/10/10/10/1/0/0/2/manual_time_median                                    -0.1136         -0.1284           941           834          1017           887
+BM_radius_callback_search<ArborX::BVH<HIP>>/10/10/10/1/0/0/2/manual_time_median                           -0.0598         -0.0721           569           535           626           581
+BM_radius_search<ArborX::BVH<HIP>>/10/10/10/1/0/1/3/manual_time_median                                    -0.1175         -0.1322           935           825          1012           878
+BM_knn_search<ArborX::BVH<HIP>>/10/10/10/1/0/2/manual_time_median                                         -0.1728         -0.1825          1128           933          1206           986
+BM_knn_callback_search<ArborX::BVH<HIP>>/10/10/10/1/0/2/manual_time_median                                -0.1185         -0.1266           639           563           694           606

Serial (Power9)

Somehow, the construction is faster on Power9 for small sizes, and no difference for larger. No difference in radius and knn (except in knn for the smallest size).

+BM_construction<ArborX::BVH<Serial>>/10/0/manual_time_median                                              -0.5081         -0.5208           250           123           279           134
+BM_construction<ArborX::BVH<Serial>>/10/1/manual_time_median                                              -0.5050         -0.5175           249           123           278           134
+BM_construction<ArborX::BVH<Serial>>/100/0/manual_time_median                                             -0.4652         -0.4801           263           141           293           152
+BM_construction<ArborX::BVH<Serial>>/100/1/manual_time_median                                             -0.4670         -0.4828           263           140           293           151
+BM_construction<ArborX::BVH<Serial>>/1000/0/manual_time_median                                            -0.3131         -0.3374           377           259           406           269
+BM_construction<ArborX::BVH<Serial>>/1000/1/manual_time_median                                            -0.3156         -0.3399           375           257           404           267
+BM_knn_search<ArborX::BVH<Serial>>/10/10/10/1/0/2/manual_time_median                                      -0.2418         -0.2383            78            59            79            60
+BM_knn_callback_search<ArborX::BVH<Serial>>/10/10/10/1/0/2/manual_time_median                             -0.3538         -0.3446            55            36            57            37
+BM_knn_search<ArborX::BVH<Serial>>/10/10/10/1/1/3/manual_time_median                                      -0.2498         -0.2465            78            58            79            60
+BM_knn_callback_search<ArborX::BVH<Serial>>/10/10/10/1/1/3/manual_time_median                             -0.3547         -0.3453            56            36            57            37

Serial (AMD EPYC)

10% penalty in construction and a 3 random slowdowns in radius search (all for very small sizes).

-BM_construction<ArborX::BVH<Serial>>/10000/0/manual_time_median                                           +0.0857         +0.0848           467           507           472           512
-BM_construction<ArborX::BVH<Serial>>/10000/1/manual_time_median                                           +0.0745         +0.0736           481           517           486           521
-BM_construction<ArborX::BVH<Serial>>/100000/0/manual_time_median                                          +0.0748         +0.0749          4302          4624          4307          4629
-BM_construction<ArborX::BVH<Serial>>/100000/1/manual_time_median                                          +0.0802         +0.0801          4380          4731          4384          4735
-BM_construction<ArborX::BVH<Serial>>/500000/0/manual_time_median                                          +0.0949         +0.0948         24763         27113         24771         27120
-BM_construction<ArborX::BVH<Serial>>/500000/1/manual_time_median                                          +0.0906         +0.0906         25003         27269         25011         27276
-BM_radius_callback_search<ArborX::BVH<Serial>>/100/100/10/1/0/0/2/manual_time_median                      +0.4387         +0.4297            44            64            45            65
-BM_radius_search<ArborX::BVH<Serial>>/100/100/10/1/0/1/3/manual_time_median                               +0.1194         +0.1179            66            74            67            75
-BM_radius_callback_search<ArborX::BVH<Serial>>/100/100/10/1/0/1/3/manual_time_median                      +0.0527         +0.0512            28            30            29            31

V100 (with also using 64-bit for sorting queries)

Here is just recording why I opted out of using 64-bit for sorting queries (using eb43475 for branch):

-BM_radius_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/0/2/manual_time_median                               +0.1556         +0.1382           823           950           920          1047
-BM_radius_callback_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/0/2/manual_time_median                      +0.3172         +0.2812           431           568           487           624
-BM_knn_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/2/manual_time_median                                    +0.1209         +0.1098          1058          1186          1155          1282
-BM_knn_callback_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/2/manual_time_median                           +0.2131         +0.1951           627           760           682           815
-BM_radius_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/1/3/manual_time_median                               +0.1614         +0.1435           797           925           893          1021
-BM_radius_callback_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/1/3/manual_time_median                      +0.3267         +0.2886           418           554           474           610
-BM_knn_search<ArborX::BVH<Cuda>>/10000/10000/1/1/1/3/manual_time_median                                    +0.1178         +0.1071          1089          1218          1186          1313
-BM_knn_callback_search<ArborX::BVH<Cuda>>/10000/10000/1/1/1/3/manual_time_median                           +0.2000         +0.1841           668           802           724           857
-BM_radius_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/0/2/manual_time_median                            +0.3959         +0.3174          2558          3571          2918          3844
-BM_radius_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/0/2/manual_time_median                   +0.6240         +0.5986          1370          2225          1433          2292
-BM_knn_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/2/manual_time_median                                 +0.1655         +0.1536          4421          5152          4769          5501
-BM_knn_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/2/manual_time_median                        +0.2005         +0.1962          3646          4377          3712          4440
-BM_radius_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/1/3/manual_time_median                            +0.9732         +0.7810          1465          2892          1819          3240
-BM_radius_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/1/3/manual_time_median                   +0.6575         +0.6244          1197          1983          1262          2049
-BM_knn_search<ArborX::BVH<Cuda>>/100000/100000/10/1/1/3/manual_time_median                                 +0.1462         +0.1319          5092          5836          5443          6161
-BM_knn_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/1/3/manual_time_median                        +0.1701         +0.1624          4479          5241          4540          5278
-BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/0/2/manual_time_median                 +0.0910         +0.0797          6210          6775          6922          7474
-BM_radius_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time_median                          +0.0871         +0.0801          6093          6624          6701          7238
-BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time_median                 +0.1712         +0.1382          3276          3837          3988          4539
-BM_radius_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/0/2/manual_time_median                         +0.1489         +0.1461         41810         48033         42602         48827
-BM_radius_callback_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/0/2/manual_time_median                +0.3277         +0.3034         18924         25126         20394         26581
-BM_knn_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/2/manual_time_median                              +0.1395         +0.1394         43987         50124         44657         50880
-BM_knn_callback_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/2/manual_time_median                     +0.2065         +0.1977         30102         36319         31547         37784
-BM_radius_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/1/3/manual_time_median                         +0.2700         +0.2613         22607         28711         23286         29371
-BM_radius_callback_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/1/3/manual_time_median                +0.5856         +0.5129         10407         16501         11867         17953
-BM_knn_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/1/3/manual_time_median                              +0.1060         +0.1039         55311         61175         56071         61897
-BM_knn_callback_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/1/3/manual_time_median                     +0.1365         +0.1316         42923         48780         44384         50224

Algorithms on actual datasets

Compared to the benchmark, where we are using uniform distributions (whether in volume or on the surface), real datasets exhibit different characteristics. The data is typically more localized, with density increased towards interesting features (clusters in HACC, roads in GPS tracking datasets like GeoLife, etc).

DBSCAN

Run with minpts = 2 to stress just a single traversal.

HAC Run with ./ArborX_DBSCAN.exe --core_min_size 2 --eps [eps] --binary --filename [filename]

eps	V100 (32-bit)	V100 (64-bit)	MI100 (32-bit)	MI100 (64-bit)	A100 (32-bit)	A100 (64-bit)
0.001	0.169	0.190	0.190	0.188	0.058	0.049
0.01	0.219	0.200	0.226	0.203	0.141	0.069
0.042	0.676	0.505	0.661	0.505	0.433	0.225
0.1	1.925	1.663	1.933	1.662	1.198	0.794

GeoLife I only ran with 5M samples (out of 24M total points) as master because unbelievably slow with increasing sample sizes (as the density increases).

eps	V100 (32-bit)	V100 (64-bit)	MI100 (32-bit)	MI100 (64-bit)	A100 (32-bit)	A100 (64-bit)
0.001	3.090	0.261	2.500	0.377	3.610	0.136
0.01	18.568	3.829	8.423	3.649	12.341	1.191
0.05	104.478	13.096	41.826	15.011	51.725	4.599

Nearest neighbor

Here we use the time for the first iteration of Boruvka in MST (with --core-min-size 1 --algorithm mst) as a our time for the nearest neighbor. Note that it is an approximation as we set the radius to already something pretty before the traversal.

HACC

V100 (32-bit)	V100 (64-bit)	MI100 (32-bit)	MI100 (64-bit)	A100 (32-bit)	A100 (64-bit)
0.246	0.143	0.188	0.123	0.218	0.133

GeoLife

V100 (32-bit)	V100 (64-bit)	MI100 (32-bit)	MI100 (64-bit)	A100 (32-bit)	A100 (64-bit)
93.912	0.141	57.698	0.162	36.652	0.150

Further thoughts

While the situation seems to be fully resolved for the HACC dataset, GeoLife may still be problematic. It does seem that the overall approach is not really suitable for this kind of datasets as they are very far from ray tracing that BVH was developed for. We may want to explore other approaches, like recursive partitioning of data (e.g., kd-tree). Something like this paper. But it will be the whole new area to explore, and it's low priority for us given lack of interested applications.

aprokop · 2022-02-21T06:56:53Z

Currently the tests fail because some of them have a point in the scene_bounding_box.maxCorner(). The problem is the computation of the delta() in the tree construction. LLONG_MAX is actually a valid Morton code for 64-bits as is uses 63-bits. Which leads to a bug in the hierarchy construction.

There are two approaches to fix this.

We could switch to use 60 bits instead of 63 for the 64-bit Morton codes.
Alternatively, we can shift down by one the computations in delta as deltas are only being used to compare with each other

--- a/src/details/ArborX_DetailsTreeConstruction.hpp
+++ b/src/details/ArborX_DetailsTreeConstruction.hpp
@@ -206,7 +206,15 @@ public:
     // Morton comparison. Thus, we add INT_MIN to it.
     // We also avoid if/else statement by doing a "x + !x*<blah>" trick.
     auto x = _sorted_morton_codes(i) ^ _sorted_morton_codes(i + 1);
-    return x + (!x) * (LLONG_MIN + (i ^ (i + 1)));
+    if (x != 0)
+    {
+      // When using 63 bits for Morton codes, the LLONG_MAX is actually a valid
+      // code. As we want the return statement above to return a value always
+      // greater than anything here, we downshift by 1.
+      return x - 1;
+    }
+
+    return LLONG_MIN + (i ^ (i + 1));
   }

   KOKKOS_FUNCTION Node *getNodePtr(int i) const

Because LLONG_MIN + UINT_MAX < -1, it works. The only cost is O(n) additions.

I don't think this fundamentally affects any of the posted results.

dalg24

Partial review. Haven't had a chance to look into enabling both the use of unsigned int and unsigned long long yet.

src/details/ArborX_DetailsMortonCode.hpp

test/tstDetailsTreeConstruction.cpp

src/details/ArborX_DetailsTreeConstruction.hpp

aprokop · 2022-02-23T16:55:25Z

Partial review. Haven't had a chance to look into enabling both the use of unsigned int and unsigned long long yet.

@dalg24 It does not have to be a part of this PR. The default behavior will still be changed to use 64-bit.

… codes

aprokop · 2022-03-02T00:57:38Z

Force rebased on master due to conflicts with #642.

src/details/ArborX_DetailsTreeConstruction.hpp

aprokop added the performance Something is slower than it should be label Feb 21, 2022

aprokop marked this pull request as draft February 21, 2022 06:32

aprokop marked this pull request as ready for review February 21, 2022 07:05

aprokop force-pushed the morton_64bit_default branch from 2b69f24 to 22c53c4 Compare February 22, 2022 02:14

aprokop requested a review from dalg24 February 22, 2022 19:14

dalg24 reviewed Feb 22, 2022

View reviewed changes

aprokop force-pushed the morton_64bit_default branch from 22c53c4 to b2f84ee Compare February 22, 2022 20:46

dalg24 mentioned this pull request Feb 23, 2022

Fix a warning for 64-bit 2D Morton codes #640

Merged

aprokop force-pushed the morton_64bit_default branch from b2f84ee to 82f9f5f Compare February 23, 2022 16:56

aprokop added 3 commits March 1, 2022 19:56

Use 64bit Morton codes by default

c5852e9

Fix a bug in the hierarchy construction when using 63 bits for Morton…

3c194b2

… codes

Address review comments

31192a8

aprokop force-pushed the morton_64bit_default branch from 82f9f5f to 31192a8 Compare March 2, 2022 00:56

dalg24 reviewed Mar 3, 2022

View reviewed changes

src/details/ArborX_DetailsTreeConstruction.hpp Show resolved Hide resolved

dalg24 approved these changes Mar 3, 2022

View reviewed changes

dalg24 merged commit 6460d8c into arborx:master Mar 3, 2022

aprokop deleted the morton_64bit_default branch March 4, 2022 00:23

dalg24 mentioned this pull request Mar 4, 2022

Enable alternate projection onto space filling curve #646

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use 64-bit Morton indices by default in the BVH construction #637

Use 64-bit Morton indices by default in the BVH construction #637

aprokop commented Feb 21, 2022 •

edited

Loading

aprokop commented Feb 21, 2022 •

edited

Loading

dalg24 left a comment

aprokop commented Feb 23, 2022

aprokop commented Mar 2, 2022

Use 64-bit Morton indices by default in the BVH construction #637

Use 64-bit Morton indices by default in the BVH construction #637

Conversation

aprokop commented Feb 21, 2022 • edited Loading

Original motivation

Statistics for some datasets

Number of hierarchy levels

Results (master 707a5b3 vs branch c5852e9 (5407512 after rebase onmaster)

Standard Benchmark

CUDA V100

HIP MI100

Serial (Power9)

Serial (AMD EPYC)

V100 (with also using 64-bit for sorting queries)

Algorithms on actual datasets

DBSCAN

Nearest neighbor

Further thoughts

aprokop commented Feb 21, 2022 • edited Loading

dalg24 left a comment

Choose a reason for hiding this comment

aprokop commented Feb 23, 2022

aprokop commented Mar 2, 2022

aprokop commented Feb 21, 2022 •

edited

Loading

Results (master `707a5b3` vs branch `c5852e9` (`5407512` after rebase onmaster)

aprokop commented Feb 21, 2022 •

edited

Loading