Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn on array_cmath vectorization #119

Closed

Conversation

beomki-yeo
Copy link
Contributor

This PR turns on the CPU vectorization for array_cmath which will be followed by dedicated optimization in #116

Not sure if we need to do this commonly for any other plugins. maybe I also need to add cmake condition that checks if the CPU supoorts avx. And is there any potential risk or downside in enabling vectorization?

At the moment, it seems to be useful as detray CPU benchmark gets faster about 10% by enabling the vectorization.

cmake ../detray -DDETRAY_CUSTOM_SCALARTYPE=float -DCMAKE_BUILD_TYPE=Release -DDETRAY_EIGEN_PLUGIN=OFF -DDETRAY_VC_PLUGIN=OFF -DDETRAY_BUILD_CUDA=OFF
./bin/detray_benchmark_cpu_array

detray main

[beomki@pc-54 detray_build]$ ./bin/detray_benchmark_cpu_array 
2024-04-12T22:47:23+02:00
Running ./bin/detray_benchmark_cpu_array
Run on (12 X 4066.38 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 6.67, 6.82, 5.22
----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
BM_FIND_VOLUMES                               ERROR OCCURRED: 'Benchmark disabled for the toy geometry'
BM_GRID_REGULAR_BIN_CAP1/process_time               6.49 ms         6.35 ms          110
BM_GRID_REGULAR_BIN_CAP4/process_time               9.31 ms         9.30 ms           71
BM_GRID_REGULAR_BIN_CAP25/process_time              21.8 ms         21.8 ms           32
BM_GRID_REGULAR_BIN_CAP100/process_time             52.1 ms         52.0 ms           13
BM_GRID_REGULAR_NEIGHBOR_CAP1/process_time          47.0 ms         46.9 ms           15
BM_GRID_REGULAR_NEIGHBOR_CAP4/process_time           238 ms          237 ms            3
BM_GRID_IRREGULAR_BIN_CAP1/process_time             54.5 ms         54.4 ms           13
BM_GRID_IRREGULAR_NEIGHBOR_CAP1/process_time        93.8 ms         93.6 ms            7
BM_GRID2_REGULAR_BIN_CAP1/process_time              4.07 ms         4.07 ms          173
BM_GRID2_REGULAR_BIN_CAP4/process_time              3.89 ms         3.88 ms          179
BM_GRID2_REGULAR_BIN_CAP25/process_time             11.7 ms         11.6 ms           65
BM_GRID2_REGULAR_BIN_CAP100/process_time            42.2 ms         42.2 ms           17
BM_GRID2_REGULAR_NEIGHBOR_CAP1/process_time         89.1 ms         89.0 ms            8
BM_GRID2_REGULAR_NEIGHBOR_CAP4/process_time          843 ms          841 ms            1
BM_GRID2_IRREGULAR_BIN_CAP1/process_time            51.5 ms         51.4 ms           13
BM_GRID2_IRREGULAR_NEIGHBOR_CAP1/process_time        143 ms          142 ms            5
WARNING: No entries in volume finder
WARNING: No entries in volume finder
BM_INTERSECT_ALL                                     406 ms          405 ms            2
BM_INTERSECT_PLANES                                0.744 ms        0.743 ms          936
BM_INTERSECT_CYLINDERS                              3.49 ms         3.48 ms          201
BM_INTERSECT_PORTAL_CYLINDERS                       3.00 ms         2.99 ms          235
BM_INTERSECT_CONCETRIC_CYLINDERS                    4.22 ms         4.21 ms          167
BM_MASK_CUBOID_3D                                   2604 ms         2600 ms            1
BM_MASK_RECTANGLE_2D                                2272 ms         2268 ms            1
BM_MASK_TRAPEZOID_2D                                3018 ms         3013 ms            1
BM_MASK_DISC_2D                                    18307 ms        18279 ms            1
BM_MASK_RING_2D                                    18281 ms        18211 ms            1
BM_MASK_CYLINDER_3D                                20712 ms        20676 ms            1
BM_MASK_CYLINDER_2D                                18651 ms        18617 ms            1
BM_MASK_CONCENTRIC_CYLINDER_2D                     16388 ms        16358 ms            1
BM_MASK_ANNULUS_2D                                 23672 ms        23628 ms            1
BM_MASK_LINE_CIRCULAR                              21320 ms        21263 ms            1
BM_MASK_LINE_SQUARE                                36643 ms        36561 ms            1

detray with this PR

[beomki@pc-54 detray_build]$ ./bin/detray_benchmark_cpu_array 
2024-04-12T22:57:49+02:00
Running ./bin/detray_benchmark_cpu_array
Run on (12 X 4403.37 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 3.28, 4.77, 4.87
----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
BM_FIND_VOLUMES                               ERROR OCCURRED: 'Benchmark disabled for the toy geometry'
BM_GRID_REGULAR_BIN_CAP1/process_time               5.80 ms         5.80 ms          120
BM_GRID_REGULAR_BIN_CAP4/process_time               8.58 ms         8.57 ms           81
BM_GRID_REGULAR_BIN_CAP25/process_time              29.2 ms         29.1 ms           24
BM_GRID_REGULAR_BIN_CAP100/process_time             49.7 ms         49.6 ms           14
BM_GRID_REGULAR_NEIGHBOR_CAP1/process_time          46.3 ms         46.0 ms           16
BM_GRID_REGULAR_NEIGHBOR_CAP4/process_time           234 ms          234 ms            3
BM_GRID_IRREGULAR_BIN_CAP1/process_time             46.1 ms         46.1 ms           15
BM_GRID_IRREGULAR_NEIGHBOR_CAP1/process_time        78.8 ms         78.7 ms            9
BM_GRID2_REGULAR_BIN_CAP1/process_time              4.19 ms         4.18 ms          164
BM_GRID2_REGULAR_BIN_CAP4/process_time              3.71 ms         3.70 ms          192
BM_GRID2_REGULAR_BIN_CAP25/process_time             11.4 ms         11.4 ms           60
BM_GRID2_REGULAR_BIN_CAP100/process_time            38.2 ms         38.2 ms           18
BM_GRID2_REGULAR_NEIGHBOR_CAP1/process_time         87.2 ms         87.0 ms            8
BM_GRID2_REGULAR_NEIGHBOR_CAP4/process_time          828 ms          827 ms            1
BM_GRID2_IRREGULAR_BIN_CAP1/process_time            47.6 ms         47.6 ms           13
BM_GRID2_IRREGULAR_NEIGHBOR_CAP1/process_time        134 ms          134 ms            5
WARNING: No entries in volume finder
WARNING: No entries in volume finder
BM_INTERSECT_ALL                                     374 ms          374 ms            2
BM_INTERSECT_PLANES                                0.675 ms        0.674 ms         1015
BM_INTERSECT_CYLINDERS                              3.13 ms         3.12 ms          223
BM_INTERSECT_PORTAL_CYLINDERS                       2.81 ms         2.81 ms          247
BM_INTERSECT_CONCETRIC_CYLINDERS                    3.68 ms         3.67 ms          190
BM_MASK_CUBOID_3D                                   2253 ms         2250 ms            1
BM_MASK_RECTANGLE_2D                                1933 ms         1930 ms            1
BM_MASK_TRAPEZOID_2D                                2739 ms         2735 ms            1
BM_MASK_DISC_2D                                    16200 ms        16173 ms            1
BM_MASK_RING_2D                                    16199 ms        16169 ms            1
BM_MASK_CYLINDER_3D                                18466 ms        18437 ms            1
BM_MASK_CYLINDER_2D                                17062 ms        17022 ms            1
BM_MASK_CONCENTRIC_CYLINDER_2D                     15152 ms        15044 ms            1
BM_MASK_ANNULUS_2D                                 22031 ms        21964 ms            1
BM_MASK_LINE_CIRCULAR                              18937 ms        18910 ms            1
BM_MASK_LINE_SQUARE                                33487 ms        33444 ms            1

Copy link
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good idea as it is for three reasons:

  1. This will break the code for any non-AVX2-enabled processor; at best, this should be -march=native so that the compiler can pick up the extensions from the native microachitecture.
  2. -ftree-vectorize is included in -O2, which is fairly common anyway.
  3. Most importantly, these kinds of options should never be propagated upwards by an interface library. The decision of which compiler options to use should be up to whoever declares the translation unit (with the exception of preprocessor macros); forcing them by adding them to an interface target can lead to very hard-to-debug problems.

My suggestion would be to add -march=native to the test and benchmark targets, as these are TUs defined by algebra plugins. The compiler flags for the header files should be left up to the clients.

@beomki-yeo
Copy link
Contributor Author

Ok then i will take the change to the detray benchmark side

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree with Stephen, this is not quite the way to do it.

Note for instance how Vc goes about this. They provide some helper CMake code that allows the clients to use flags for their build that they deem appropriate. But the Vc::Vc library does not enforce any flags / options itself.

In the ATLAS offline code we turn on -march=x86-64-v2 support with:

https://gitlab.cern.ch/atlas/atlasexternals/-/blob/main/Build/AtlasCMake/modules/AtlasCompilerSettings.cmake?ref_type=heads#L103-114

Which I'd aim to replicate in some way in all of the appropriate options files that we have.

(Only with V3 of the standard... 🤔)

@krasznaa
Copy link
Member

Notice also how the ATLAS code skips on setting anything itself if an -march flag is already set. Let's say through the $CXXFLAGS environment variable. (Which is something that we do in some special nightlies of ours.) We may want to look out for such a thing even in the R&D projects. 🤔 (Since one could argue that CMake should never set such things at all. And it would be up to a layer above it, let's say in Spack, where we would request such optimizations. But since I'm not a fan of Spack myself, I'm still okay with CMake making some of these decisions itself...)

@beomki-yeo beomki-yeo closed this Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants