Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 64-bit Morton indices by default in the BVH construction #637

Merged
merged 3 commits into from
Mar 3, 2022

Conversation

aprokop
Copy link
Contributor

@aprokop aprokop commented Feb 21, 2022

Original motivation

ArborX is unable to run GeoLife (https://www.microsoft.com/en-us/research/publication/geolife-gps-trajectory-dataset-user-guide/) dataset in any reasonable manner (i.e., it takes hundreds of seconds to find a closest neighbor for this 24M problem). The problem is that hundreds of thousands of points are assigned the same Morton index. As we just "randomly" combine such pairs during the hierarchy construction, it leads to dramatic overlap in the bounding volumes for the lowest levels.

Statistics for some datasets

I have checked the number of duplicate codes and their distribution for two 3D datasets: HACC and GeoLife (those are the ones I had readily available).

"# M codes (>3 reps)": number of Morton codes that have more than 3 duplicates, i.e. number of Morton grid cells with more than 3 points in them
"# points with duplicate M code": number of points that share their Morton code with another point
" max # duplicate": highest number of points with the same Morton code

Dataset # 32-bit M codes (>3 reps) # points with duplicate 32-bit M code max # duplicate 32-bit # 64-bit M codes (>3 reps) # points with duplicate 64-bit M code max # duplicate 64-bit
HACC 37M 1,311,912 [ 7.52%] 23,639,027 [64.06%] 3,569 0 [ 0.00%] 528 [ 0.00%] 2
GeoLife 9,863 [60.24%] 24,872,748 [99.98%] 5,706,009 1,002,529 [ 8.02%] 16,217,910 [65.19%] 8,009

Number of hierarchy levels

Because of the way Karras' LBVH construction works, increasing resolution in Morton indices will typically result in a deeper hierarchy. This is particularly true for a hierarchy with multiple duplicate Morton codes corresponding to a single Morton box.

Dataset # levels with 32-bit M codes # levels with 64-bit M codes
HACC 37M 43 49
GeoLife 52 72

Despite this, as I will show, the queries actually run faster.

Results (master 707a5b3 vs branch c5852e9 (5407512 after rebase onmaster)

Standard Benchmark

morton64_results.zip

Summary: expected penalty in the construction, which is < 20% everywhere except HIP where it is 50% (HIP Thrust not optimized for long long?).

CUDA V100

10-20% penalty in construction. No difference in any of the searches.

-BM_construction<ArborX::BVH<Cuda>>/10000/0/manual_time_median                                              +0.1449         +0.1120           943          1080          1216          1352
-BM_construction<ArborX::BVH<Cuda>>/10000/1/manual_time_median                                              +0.1433         +0.1117           944          1079          1217          1353
+BM_construction<ArborX::BVH<Cuda>>/100000/0/manual_time_median                                             -0.1974         -0.1770          2499          2006          2781          2289
+BM_construction<ArborX::BVH<Cuda>>/100000/1/manual_time_median                                             -0.1748         -0.1899          2414          1992          2695          2184
-BM_construction<ArborX::BVH<Cuda>>/1000000/0/manual_time_median                                            +0.0956         +0.0916          5709          6255          6109          6669
-BM_construction<ArborX::BVH<Cuda>>/1000000/1/manual_time_median                                            +0.0965         +0.0909          5702          6253          6111          6667
-BM_construction<ArborX::BVH<Cuda>>/10000000/0/manual_time_median                                           +0.2307         +0.2033         27446         33777         28882         34755
-BM_construction<ArborX::BVH<Cuda>>/10000000/1/manual_time_median                                           +0.1841         +0.1684         27117         32108         28696         33528

HIP MI100

Up to 50% penalty in construction (which is significantly higher than Cuda; possibly HIP Thrust is less optimized for sorting long long than Cuda Thrust). No difference in radius and knn (except for the smallest size).

-BM_construction<ArborX::BVH<HIP>>/10/0/manual_time_median                                                 +0.4003         +0.3977           638           893           640           895
-BM_construction<ArborX::BVH<HIP>>/10/1/manual_time_median                                                 +0.3885         +0.3893           644           894           645           896
-BM_construction<ArborX::BVH<HIP>>/100/0/manual_time_median                                                +0.4861         +0.4846           662           983           664           986
-BM_construction<ArborX::BVH<HIP>>/100/1/manual_time_median                                                +0.4772         +0.4748           666           984           669           987
-BM_construction<ArborX::BVH<HIP>>/1000/0/manual_time_median                                               +0.2786         +0.2778           764           977           766           979
-BM_construction<ArborX::BVH<HIP>>/1000/1/manual_time_median                                               +0.2779         +0.2778           768           982           770           984
-BM_construction<ArborX::BVH<HIP>>/10000/0/manual_time_median                                              +0.3155         +0.3144           813          1069           815          1072
-BM_construction<ArborX::BVH<HIP>>/10000/1/manual_time_median                                              +0.3080         +0.3067           820          1072           823          1075
-BM_construction<ArborX::BVH<HIP>>/100000/0/manual_time_median                                             +0.2224         +0.2173          1041          1272          1064          1295
-BM_construction<ArborX::BVH<HIP>>/100000/1/manual_time_median                                             +0.2176         +0.2119          1060          1291          1084          1314
-BM_construction<ArborX::BVH<HIP>>/1000000/0/manual_time_median                                            +0.3408         +0.3386          2418          3242          2432          3255
-BM_construction<ArborX::BVH<HIP>>/1000000/1/manual_time_median                                            +0.3300         +0.3290          2426          3227          2440          3242
-BM_construction<ArborX::BVH<HIP>>/10000000/0/manual_time_median                                           +0.4351         +0.4351         11948         17146         11961         17166
-BM_construction<ArborX::BVH<HIP>>/10000000/1/manual_time_median                                           +0.3939         +0.3941         11804         16455         11815         16471
+BM_radius_search<ArborX::BVH<HIP>>/10/10/10/1/0/0/2/manual_time_median                                    -0.1136         -0.1284           941           834          1017           887
+BM_radius_callback_search<ArborX::BVH<HIP>>/10/10/10/1/0/0/2/manual_time_median                           -0.0598         -0.0721           569           535           626           581
+BM_radius_search<ArborX::BVH<HIP>>/10/10/10/1/0/1/3/manual_time_median                                    -0.1175         -0.1322           935           825          1012           878
+BM_knn_search<ArborX::BVH<HIP>>/10/10/10/1/0/2/manual_time_median                                         -0.1728         -0.1825          1128           933          1206           986
+BM_knn_callback_search<ArborX::BVH<HIP>>/10/10/10/1/0/2/manual_time_median                                -0.1185         -0.1266           639           563           694           606

Serial (Power9)

Somehow, the construction is faster on Power9 for small sizes, and no difference for larger. No difference in radius and knn (except in knn for the smallest size).

+BM_construction<ArborX::BVH<Serial>>/10/0/manual_time_median                                              -0.5081         -0.5208           250           123           279           134
+BM_construction<ArborX::BVH<Serial>>/10/1/manual_time_median                                              -0.5050         -0.5175           249           123           278           134
+BM_construction<ArborX::BVH<Serial>>/100/0/manual_time_median                                             -0.4652         -0.4801           263           141           293           152
+BM_construction<ArborX::BVH<Serial>>/100/1/manual_time_median                                             -0.4670         -0.4828           263           140           293           151
+BM_construction<ArborX::BVH<Serial>>/1000/0/manual_time_median                                            -0.3131         -0.3374           377           259           406           269
+BM_construction<ArborX::BVH<Serial>>/1000/1/manual_time_median                                            -0.3156         -0.3399           375           257           404           267
+BM_knn_search<ArborX::BVH<Serial>>/10/10/10/1/0/2/manual_time_median                                      -0.2418         -0.2383            78            59            79            60
+BM_knn_callback_search<ArborX::BVH<Serial>>/10/10/10/1/0/2/manual_time_median                             -0.3538         -0.3446            55            36            57            37
+BM_knn_search<ArborX::BVH<Serial>>/10/10/10/1/1/3/manual_time_median                                      -0.2498         -0.2465            78            58            79            60
+BM_knn_callback_search<ArborX::BVH<Serial>>/10/10/10/1/1/3/manual_time_median                             -0.3547         -0.3453            56            36            57            37

Serial (AMD EPYC)

10% penalty in construction and a 3 random slowdowns in radius search (all for very small sizes).

-BM_construction<ArborX::BVH<Serial>>/10000/0/manual_time_median                                           +0.0857         +0.0848           467           507           472           512
-BM_construction<ArborX::BVH<Serial>>/10000/1/manual_time_median                                           +0.0745         +0.0736           481           517           486           521
-BM_construction<ArborX::BVH<Serial>>/100000/0/manual_time_median                                          +0.0748         +0.0749          4302          4624          4307          4629
-BM_construction<ArborX::BVH<Serial>>/100000/1/manual_time_median                                          +0.0802         +0.0801          4380          4731          4384          4735
-BM_construction<ArborX::BVH<Serial>>/500000/0/manual_time_median                                          +0.0949         +0.0948         24763         27113         24771         27120
-BM_construction<ArborX::BVH<Serial>>/500000/1/manual_time_median                                          +0.0906         +0.0906         25003         27269         25011         27276
-BM_radius_callback_search<ArborX::BVH<Serial>>/100/100/10/1/0/0/2/manual_time_median                      +0.4387         +0.4297            44            64            45            65
-BM_radius_search<ArborX::BVH<Serial>>/100/100/10/1/0/1/3/manual_time_median                               +0.1194         +0.1179            66            74            67            75
-BM_radius_callback_search<ArborX::BVH<Serial>>/100/100/10/1/0/1/3/manual_time_median                      +0.0527         +0.0512            28            30            29            31

V100 (with also using 64-bit for sorting queries)

Here is just recording why I opted out of using 64-bit for sorting queries (using eb43475 for branch):

-BM_radius_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/0/2/manual_time_median                               +0.1556         +0.1382           823           950           920          1047
-BM_radius_callback_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/0/2/manual_time_median                      +0.3172         +0.2812           431           568           487           624
-BM_knn_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/2/manual_time_median                                    +0.1209         +0.1098          1058          1186          1155          1282
-BM_knn_callback_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/2/manual_time_median                           +0.2131         +0.1951           627           760           682           815
-BM_radius_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/1/3/manual_time_median                               +0.1614         +0.1435           797           925           893          1021
-BM_radius_callback_search<ArborX::BVH<Cuda>>/10000/10000/1/1/0/1/3/manual_time_median                      +0.3267         +0.2886           418           554           474           610
-BM_knn_search<ArborX::BVH<Cuda>>/10000/10000/1/1/1/3/manual_time_median                                    +0.1178         +0.1071          1089          1218          1186          1313
-BM_knn_callback_search<ArborX::BVH<Cuda>>/10000/10000/1/1/1/3/manual_time_median                           +0.2000         +0.1841           668           802           724           857
-BM_radius_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/0/2/manual_time_median                            +0.3959         +0.3174          2558          3571          2918          3844
-BM_radius_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/0/2/manual_time_median                   +0.6240         +0.5986          1370          2225          1433          2292
-BM_knn_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/2/manual_time_median                                 +0.1655         +0.1536          4421          5152          4769          5501
-BM_knn_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/2/manual_time_median                        +0.2005         +0.1962          3646          4377          3712          4440
-BM_radius_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/1/3/manual_time_median                            +0.9732         +0.7810          1465          2892          1819          3240
-BM_radius_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/1/3/manual_time_median                   +0.6575         +0.6244          1197          1983          1262          2049
-BM_knn_search<ArborX::BVH<Cuda>>/100000/100000/10/1/1/3/manual_time_median                                 +0.1462         +0.1319          5092          5836          5443          6161
-BM_knn_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/1/3/manual_time_median                        +0.1701         +0.1624          4479          5241          4540          5278
-BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/0/2/manual_time_median                 +0.0910         +0.0797          6210          6775          6922          7474
-BM_radius_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time_median                          +0.0871         +0.0801          6093          6624          6701          7238
-BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time_median                 +0.1712         +0.1382          3276          3837          3988          4539
-BM_radius_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/0/2/manual_time_median                         +0.1489         +0.1461         41810         48033         42602         48827
-BM_radius_callback_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/0/2/manual_time_median                +0.3277         +0.3034         18924         25126         20394         26581
-BM_knn_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/2/manual_time_median                              +0.1395         +0.1394         43987         50124         44657         50880
-BM_knn_callback_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/2/manual_time_median                     +0.2065         +0.1977         30102         36319         31547         37784
-BM_radius_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/1/3/manual_time_median                         +0.2700         +0.2613         22607         28711         23286         29371
-BM_radius_callback_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/0/1/3/manual_time_median                +0.5856         +0.5129         10407         16501         11867         17953
-BM_knn_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/1/3/manual_time_median                              +0.1060         +0.1039         55311         61175         56071         61897
-BM_knn_callback_search<ArborX::BVH<Cuda>>/10000000/10000000/1/1/1/3/manual_time_median                     +0.1365         +0.1316         42923         48780         44384         50224

Algorithms on actual datasets

Compared to the benchmark, where we are using uniform distributions (whether in volume or on the surface), real datasets exhibit different characteristics. The data is typically more localized, with density increased towards interesting features (clusters in HACC, roads in GPS tracking datasets like GeoLife, etc).

DBSCAN

Run with minpts = 2 to stress just a single traversal.

HAC Run with ./ArborX_DBSCAN.exe --core_min_size 2 --eps [eps] --binary --filename [filename]

eps V100 (32-bit) V100 (64-bit) MI100 (32-bit) MI100 (64-bit) A100 (32-bit) A100 (64-bit)
0.001 0.169 0.190 0.190 0.188 0.058 0.049
0.01 0.219 0.200 0.226 0.203 0.141 0.069
0.042 0.676 0.505 0.661 0.505 0.433 0.225
0.1 1.925 1.663 1.933 1.662 1.198 0.794

GeoLife I only ran with 5M samples (out of 24M total points) as master because unbelievably slow with increasing sample sizes (as the density increases).

eps V100 (32-bit) V100 (64-bit) MI100 (32-bit) MI100 (64-bit) A100 (32-bit) A100 (64-bit)
0.001 3.090 0.261 2.500 0.377 3.610 0.136
0.01 18.568 3.829 8.423 3.649 12.341 1.191
0.05 104.478 13.096 41.826 15.011 51.725 4.599

Nearest neighbor

Here we use the time for the first iteration of Boruvka in MST (with --core-min-size 1 --algorithm mst) as a our time for the nearest neighbor. Note that it is an approximation as we set the radius to already something pretty before the traversal.

HACC

V100 (32-bit) V100 (64-bit) MI100 (32-bit) MI100 (64-bit) A100 (32-bit) A100 (64-bit)
0.246 0.143 0.188 0.123 0.218 0.133

GeoLife

V100 (32-bit) V100 (64-bit) MI100 (32-bit) MI100 (64-bit) A100 (32-bit) A100 (64-bit)
93.912 0.141 57.698 0.162 36.652 0.150

Further thoughts

While the situation seems to be fully resolved for the HACC dataset, GeoLife may still be problematic. It does seem that the overall approach is not really suitable for this kind of datasets as they are very far from ray tracing that BVH was developed for. We may want to explore other approaches, like recursive partitioning of data (e.g., kd-tree). Something like this paper. But it will be the whole new area to explore, and it's low priority for us given lack of interested applications.

@aprokop aprokop added the performance Something is slower than it should be label Feb 21, 2022
@aprokop aprokop marked this pull request as draft February 21, 2022 06:32
@aprokop
Copy link
Contributor Author

aprokop commented Feb 21, 2022

Currently the tests fail because some of them have a point in the scene_bounding_box.maxCorner(). The problem is the computation of the delta() in the tree construction. LLONG_MAX is actually a valid Morton code for 64-bits as is uses 63-bits. Which leads to a bug in the hierarchy construction.

There are two approaches to fix this.

  1. We could switch to use 60 bits instead of 63 for the 64-bit Morton codes.
  2. Alternatively, we can shift down by one the computations in delta as deltas are only being used to compare with each other
--- a/src/details/ArborX_DetailsTreeConstruction.hpp
+++ b/src/details/ArborX_DetailsTreeConstruction.hpp
@@ -206,7 +206,15 @@ public:
     // Morton comparison. Thus, we add INT_MIN to it.
     // We also avoid if/else statement by doing a "x + !x*<blah>" trick.
     auto x = _sorted_morton_codes(i) ^ _sorted_morton_codes(i + 1);
-    return x + (!x) * (LLONG_MIN + (i ^ (i + 1)));
+    if (x != 0)
+    {
+      // When using 63 bits for Morton codes, the LLONG_MAX is actually a valid
+      // code. As we want the return statement above to return a value always
+      // greater than anything here, we downshift by 1.
+      return x - 1;
+    }
+
+    return LLONG_MIN + (i ^ (i + 1));
   }

   KOKKOS_FUNCTION Node *getNodePtr(int i) const

Because LLONG_MIN + UINT_MAX < -1, it works. The only cost is O(n) additions.

I don't think this fundamentally affects any of the posted results.

@aprokop aprokop marked this pull request as ready for review February 21, 2022 07:05
@aprokop aprokop force-pushed the morton_64bit_default branch from 2b69f24 to 22c53c4 Compare February 22, 2022 02:14
@aprokop aprokop requested a review from dalg24 February 22, 2022 19:14
Copy link
Contributor

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. Haven't had a chance to look into enabling both the use of unsigned int and unsigned long long yet.

src/details/ArborX_DetailsMortonCode.hpp Outdated Show resolved Hide resolved
test/tstDetailsTreeConstruction.cpp Outdated Show resolved Hide resolved
test/tstDetailsTreeConstruction.cpp Outdated Show resolved Hide resolved
test/tstDetailsTreeConstruction.cpp Outdated Show resolved Hide resolved
src/details/ArborX_DetailsTreeConstruction.hpp Outdated Show resolved Hide resolved
@aprokop
Copy link
Contributor Author

aprokop commented Feb 23, 2022

Partial review. Haven't had a chance to look into enabling both the use of unsigned int and unsigned long long yet.

@dalg24 It does not have to be a part of this PR. The default behavior will still be changed to use 64-bit.

@aprokop aprokop force-pushed the morton_64bit_default branch from b2f84ee to 82f9f5f Compare February 23, 2022 16:56
@aprokop aprokop force-pushed the morton_64bit_default branch from 82f9f5f to 31192a8 Compare March 2, 2022 00:56
@aprokop
Copy link
Contributor Author

aprokop commented Mar 2, 2022

Force rebased on master due to conflicts with #642.

@dalg24 dalg24 merged commit 6460d8c into arborx:master Mar 3, 2022
@aprokop aprokop deleted the morton_64bit_default branch March 4, 2022 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Something is slower than it should be
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants