Compile-time and runtime CPU feature (SIMD) detection and dispatch #115

mosra · 2021-03-17T14:43:43Z

Moved from mosra/magnum#306, as such low-level scaffolding is better to have here. Corrade's own algorithms for memory copies or shuffles could benefit from these as well.

Things to do (mostly moved from mosra/magnum#306):

Practical use cases inside Corrade to prove this

Needs to be a part of this PR, otherwise I'd likely merge a state that isn't practically usable.

LB-- · 2021-03-18T00:57:11Z

You may already know this, but on Windows at least with Microsoft's CRT they can't implement std::aligned_alloc because the C++ standard requires it to be able to be freed with std::free, whereas Microsoft's memory allocation routines use a separate free function for freeing aligned memory vs ordinary allocations. This means that you have to account for remembering which free function to call on Windows if you want an owning pointer to be able to refer to both aligned and unaligned memory.

mosra · 2021-03-18T08:33:37Z

on Windows at least with Microsoft's CRT they can't implement std::aligned_alloc because the C++ standard requires it to be able to be freed with std::free

Thank you, yeah, AFAIK this is one of the reasons it took so long to get in the standard. In my case I'll be implementing this via custom Array deleters anyway, so a different deallocation function should be no problem here :)

codecov · 2021-03-22T17:27:02Z

Codecov Report

Merging #115 (86bb47f) into master (63a5f4b) will increase coverage by 0.02%.
The diff coverage is 98.40%.

@@            Coverage Diff             @@
##           master     #115      +/-   ##
==========================================
+ Coverage   97.89%   97.92%   +0.02%     
==========================================
  Files         130      132       +2     
  Lines       10402    10739     +337     
==========================================
+ Hits        10183    10516     +333     
- Misses        219      223       +4

Impacted Files	Coverage Δ
src/Corrade/Cpu.h	`100.00% <ø> (ø)`
src/Corrade/Utility/Memory.h	`100.00% <ø> (ø)`
src/Corrade/Containers/StringView.cpp	`98.02% <97.79%> (-0.52%)`	⬇️
src/Corrade/Containers/StringView.h	`93.22% <100.00%> (+3.60%)`	⬆️
src/Corrade/Cpu.cpp	`100.00% <100.00%> (ø)`
src/Corrade/TestSuite/Tester.h	`100.00% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us.

mosra · 2021-05-31T15:00:11Z

Cherry-picked the following features to master, as their design is finished and the API won't need any changes anymore:

CORRADE_LIKELY() and CORRADE_UNLIKELY() macros in 2ab85d1
Aligned allocation in c095551

The remainning SIMD-related changes still need a real-world case that proves their usefulness.

See? This is what it takes to add a new variant and have it tested.

The 31 bytes case is significantly faster due to this falling back to 2 overlapping SSE2 vectors, but we're interested in the scalar fallback too.

Currently it unconditionally adds -msimd128 when building the test. I expect this to bring immense pain later.

Doesn't seem to make the usual case any slower, but makes the small case 1.5-1.8x faster on x86 and 25% on WASM. Which is quite significant. Couldn't make a common helper to cover both because then it stopped being fast on WASM. Not on x86 tho. Strange, hitting some inlined size limit??

I ... didn't expect to hit such massive instruction portability issues so soon. But even with that it's still running circles around the standard memchr().

There was quite a lot of renaming and shuffling around before then, so be sure to verify only the actually finalized variant and not something else.

While the earlier versions support a certain WIP variant of the intrinsics, it's not all, and some of these were renamed, opcodes renumbered etc. Doesn't really make sense to pretend they're usable in that case. And then emsdk DOES NOT MAKE IT ANY SIMPLER by bundling a prerelease of it, thus we *also* have to check emscripten version to be really sure that SIMD128 can be used. UGH

…est(). Snce version 15 Node.js ships with V8 8.6, which contains also the remaining bitmask instructions, so from that version onwards there should be no harm from having SIMD enabled. Older versions have SIMD incomplete, meaning general code won't really run anyway, independently of the flag being set or not.

Need to use NodeJs_VERSION in order to figure out a default for a CMake option. The curse runs deep, yes?

I... can't even. This was supposed to be a simple one-line change, not an investigation spanning several days!!

It surely sounds like there's a lot more buts than actually working features.

Since 9.3 is the stock default on 20.04, it doesn't sound practical to just fail compilation there. I can't tell everyone to apt install g++-9 and then suffer through compiler switching in all their pipelines and CI scripts. Another option I considered was disabling (not defining) *all* CORRADE_ENABLE_* macros on this compiler, but again that would mean Ubuntu 20.04 users would silently have a much worse perf unless they switch compilers or use -march=native. Also far from ideal. Instead I'm just not using trailing return types anymore, which is actually the simplest way to get rid of this problem. Also adding a dedicated test for CORRADE_ENABLE_* and lambdas into CpuTest, since this was so far only tested in production, i.e. inside StringView code. Which is no good. This reverts commit 6615e44.

Comparing regular functions, function pointers, IFUNCs and all that also in an external (dynamic) library.

This was a *rewarning* learning experience.

mosra · 2022-08-02T15:45:20Z

For anyone subscribed to this PR, a post with a detailed overview of the new features has just been published: https://blog.magnum.graphics/backstage/cpu-feature-detection-dispatch/

mosra added this to the 2020.0b milestone Mar 17, 2021

mosra mentioned this pull request Mar 17, 2021

SIMD playground mosra/magnum#306

Draft

33 tasks

mosra force-pushed the simd branch from 4ac3be9 to ec62359 Compare March 17, 2021 14:50

mosra self-assigned this Mar 22, 2021

mosra force-pushed the simd branch 5 times, most recently from b0b6d1d to 28f6249 Compare March 22, 2021 16:18

mosra force-pushed the simd branch from 28f6249 to 10a0b72 Compare May 31, 2021 15:00

This was referenced Mar 31, 2022

String API additions and SIMD optimizations #129

Open

Math: add AABB and bounding sphere algorithms mosra/magnum#557

Closed

mosra force-pushed the simd branch 13 times, most recently from 536bf0b to 8916efb Compare July 10, 2022 00:07

mosra force-pushed the simd branch from f7eb320 to f150b6b Compare August 1, 2022 00:29

mosra added 16 commits August 1, 2022 17:45

Containers: SSE2+BMI1 implementation of StringView::find().

8dd4a33

Containers: AVX2+BMI1 variant of String::find().

4552812

See? This is what it takes to add a new variant and have it tested.

Containers: benchmark both 15 and 31 bytes in AVX2 StringView::find().

07b2027

The 31 bytes case is significantly faster due to this falling back to 2 overlapping SSE2 vectors, but we're interested in the scalar fallback too.

Containers: WASM SIMD128 variant of String::find().

71fa0f1

Currently it unconditionally adds -msimd128 when building the test. I expect this to bring immense pain later.

Containers: further optimize unrolled String::find().

a373cef

Containers: add a perf TODO for WASM SIMD128.

d5d1fcf

I ... didn't expect to hit such massive instruction portability issues so soon. But even with that it's still running circles around the standard memchr().

Cpu: test a WASM SIMD128 instruction that's only in the finalized set.

7319e39

There was quite a lot of renaming and shuffling around before then, so be sure to verify only the actually finalized variant and not something else.

doc: work around Emscripten 2.0.20 being extra picky.

b826e5e

modules: retrieve Node.js version in FindNodeJs.

d15c2ea

CMake: add our module path earlier.

f7f6fac

Need to use NodeJs_VERSION in order to figure out a default for a CMake option. The curse runs deep, yes?

CMake: introduce CORRADE_BUILD_TESTS_FORCE_WASM_SIMD128.

d218e78

I... can't even. This was supposed to be a simple one-line change, not an investigation spanning several days!!

package/ci: I tried EMSDK and SIMD. Wasted whole day. Failed.

6dcf752

Cpu: mention an important caveat with CORRADE_ENABLE().

7a7a872

It surely sounds like there's a lot more buts than actually working features.

mosra force-pushed the simd branch from f150b6b to 8b10daa Compare August 1, 2022 15:58

mosra marked this pull request as ready for review August 1, 2022 16:29

mosra added 5 commits August 1, 2022 19:32

Containers: add a perf TODO.

b6552f9

Cpu: add a benchmark for dispatch overhead.

8efcc20

Comparing regular functions, function pointers, IFUNCs and all that also in an external (dynamic) library.

Containers: ARM NEON variant of String::find().

f0f2569

This was a *rewarning* learning experience.

Containers: more perf TODOs.

86bb47f

mosra force-pushed the simd branch from 8b10daa to 86bb47f Compare August 1, 2022 18:47

mosra added the changelog mention added label Aug 1, 2022

mosra merged commit 86bb47f into master Aug 1, 2022

mosra deleted the simd branch August 1, 2022 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile-time and runtime CPU feature (SIMD) detection and dispatch #115

Compile-time and runtime CPU feature (SIMD) detection and dispatch #115

mosra commented Mar 17, 2021 •

edited

Loading

LB-- commented Mar 18, 2021

mosra commented Mar 18, 2021

codecov bot commented Mar 22, 2021 •

edited

Loading

mosra commented May 31, 2021

mosra commented Aug 2, 2022

Compile-time and runtime CPU feature (SIMD) detection and dispatch #115

Compile-time and runtime CPU feature (SIMD) detection and dispatch #115

Conversation

mosra commented Mar 17, 2021 • edited Loading

Practical use cases inside Corrade to prove this

LB-- commented Mar 18, 2021

mosra commented Mar 18, 2021

codecov bot commented Mar 22, 2021 • edited Loading

Codecov Report

mosra commented May 31, 2021

mosra commented Aug 2, 2022

mosra commented Mar 17, 2021 •

edited

Loading

codecov bot commented Mar 22, 2021 •

edited

Loading