-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compile-time and runtime CPU feature (SIMD) detection and dispatch #115
Conversation
You may already know this, but on Windows at least with Microsoft's CRT they can't implement |
Thank you, yeah, AFAIK this is one of the reasons it took so long to get in the standard. In my case I'll be implementing this via custom |
b0b6d1d
to
28f6249
Compare
Codecov Report
@@ Coverage Diff @@
## master #115 +/- ##
==========================================
+ Coverage 97.89% 97.92% +0.02%
==========================================
Files 130 132 +2
Lines 10402 10739 +337
==========================================
+ Hits 10183 10516 +333
- Misses 219 223 +4
Help us with your feedback. Take ten seconds to tell us how you rate us. |
Cherry-picked the following features to The remainning SIMD-related changes still need a real-world case that proves their usefulness. |
536bf0b
to
8916efb
Compare
See? This is what it takes to add a new variant and have it tested.
The 31 bytes case is significantly faster due to this falling back to 2 overlapping SSE2 vectors, but we're interested in the scalar fallback too.
Currently it unconditionally adds -msimd128 when building the test. I expect this to bring immense pain later.
Doesn't seem to make the usual case any slower, but makes the small case 1.5-1.8x faster on x86 and 25% on WASM. Which is quite significant. Couldn't make a common helper to cover both because then it stopped being fast on WASM. Not on x86 tho. Strange, hitting some inlined size limit??
I ... didn't expect to hit such massive instruction portability issues so soon. But even with that it's still running circles around the standard memchr().
There was quite a lot of renaming and shuffling around before then, so be sure to verify only the actually finalized variant and not something else.
While the earlier versions support a certain WIP variant of the intrinsics, it's not all, and some of these were renamed, opcodes renumbered etc. Doesn't really make sense to pretend they're usable in that case. And then emsdk DOES NOT MAKE IT ANY SIMPLER by bundling a prerelease of it, thus we *also* have to check emscripten version to be really sure that SIMD128 can be used. UGH
…est(). Snce version 15 Node.js ships with V8 8.6, which contains also the remaining bitmask instructions, so from that version onwards there should be no harm from having SIMD enabled. Older versions have SIMD incomplete, meaning general code won't really run anyway, independently of the flag being set or not.
Need to use NodeJs_VERSION in order to figure out a default for a CMake option. The curse runs deep, yes?
I... can't even. This was supposed to be a simple one-line change, not an investigation spanning several days!!
It surely sounds like there's a lot more buts than actually working features.
Since 9.3 is the stock default on 20.04, it doesn't sound practical to just fail compilation there. I can't tell everyone to apt install g++-9 and then suffer through compiler switching in all their pipelines and CI scripts. Another option I considered was disabling (not defining) *all* CORRADE_ENABLE_* macros on this compiler, but again that would mean Ubuntu 20.04 users would silently have a much worse perf unless they switch compilers or use -march=native. Also far from ideal. Instead I'm just not using trailing return types anymore, which is actually the simplest way to get rid of this problem. Also adding a dedicated test for CORRADE_ENABLE_* and lambdas into CpuTest, since this was so far only tested in production, i.e. inside StringView code. Which is no good. This reverts commit 6615e44.
Comparing regular functions, function pointers, IFUNCs and all that also in an external (dynamic) library.
This was a *rewarning* learning experience.
For anyone subscribed to this PR, a post with a detailed overview of the new features has just been published: https://blog.magnum.graphics/backstage/cpu-feature-detection-dispatch/ |
Moved from mosra/magnum#306, as such low-level scaffolding is better to have here. Corrade's own algorithms for memory copies or shuffles could benefit from these as well.
Things to do (mostly moved from mosra/magnum#306):
Utility::System::alignedAllocation()
Utility::allocateAligned()
that uses posix_memalign (or the Windows equivalent), since aligned allocations are not available until C++17.Emscripten alternative?For other platforms go with the "dumb" way of allocatingalignment - 1
and then offseting. -- c095551CORRADE_LIKELY()
andCORRADE_UNLIKELY()
annotations including the C++20 attribute support -- 2ab85d1Addmeh, my use case involves calling large functions that don't suffer from common microbenchmark pitfallsDoNotOptimize
andClobberMemory
utils to TestSuite to make benchmarking more predictableNon-deprecated Release build on GCC 4.8.5 on the CI fails to detect any runtime features, investigateI'm stupid, usedCORRADE_ASSERT()
and notCORRADE_ASSERT_OUTPUT()
, good thing I have a build without assertions!should I name themor not inMath::SseSimdT
(SimdSseT
?) instead to avoid colon cancer?Math
at all,but rather in root,in CorradeSimd::SseT
, as there will be non-math things using those as well?should I put them into the Tags.h header to avoid having an excessive number of extremely small headers? probably not since the runtime dispatch thingy may be quite involved (including system headers etc)Corrade/Simd.h
foo({})
should not pickfoo(Simd::ScalarT)
and should also not result in ambiguous overloads)T{typename T::Init{}}
? any chance to run a test case directly astestFoo<Simd::Sse3>()
without having to specify the type as well (testFoo<Simd::Sse3T, Simd::Sse3>()
is ew) --Simd::tag<T>()
andSimd::feature<T>()
is itSimd::DefaultT
typedef that's one of these based on what the code was compiled forfor a runtime dispatch (and testing?)Simd::Runtime{}
orSimd::Detect{}
? (doesn't fit well if there would be runtime params to it --Simd::Detect{Simd::Avx2}
feels strange and confusing, butSimd::Runtime{Simd::Avx2}
correctly suggest that an AVX2 path was chosen at runtime)Simd::Features{}
addFma
,Popcnt
, ... tags not derived from anything (otherwise we'd have diamond inheritance and that's only solveable withvirtual
inheritance which has non-zero runtime impact)add implicit conversion operators to the tag types for pathological combinations likeuseless, as inheritance gets picked over conversion operators and soSimd::Avx2Fma
->Simd::AvxFma
, ifSimd::Avx*Fma
are types derived fromSimd::Avx*
andSimd::Fma
?Avx2Fma
->Scalar
has a higher priority thanAvx2Fma
->AvxFma
:(some other option? could be also e.g.foo(Simd::Avx2T, Simd::FmaT, ...)
but then there needs to be a template dispatcher that takes combined flags and expands them this waytagged functions have different signatures and while it makes dispatching easier, putting them into a runtime dispatch table is impossible unless wrapped in lambdassolved nowdocument the decision process on why the tags are ordered like this(maybe reuse the diagram from the blog post) -- no suspicious decision to document after all, but the diagram is still niceShould the particular tagged implementations be exposed as well? (if not, then the SIMD tags are useless to have publicly)nope, only the high-level dispatcher taking theSimd::Features
"enum", and only internallyThink about doc workflow -- combine their docs into a single documented function explaining what tags can be used instead of having it repeated 4+ times? or just useobsolete as only the high-level dispatcher is exposed@overload
and, given the tags are implicit, the user doesn't really need to know what all variants are there --> ideally i should fix the doc generator to collapse the overloads together alreadyinvestigate FMVnot portable, can't use, need to do our own dispatch insteadGCC is warning about unused functions in the multi-versioned case, not sure why? is that a bug?obsoleteinvestigate what subset of FMV is available on GCC 4.8 (is AVX2 there?)obsoleteany possibility of FMV-like functionality on MSVC or we need to select at compile timethere isn'tSimd::Features
is bundling libcpuid an option?i don't want any dependenciesor https://github.com/Mysticial/FeatureDetector? or reimplement it ourselves?i don't want any dependenciesis _may_i_use_cpu_feature() a thing on AMD? on all compilers? or should i just use cpuid insteadusing cpuid, everything higher-level is too fragile and buggySimd::Features
instancetarget
attribute, check if that works on Clang as well and what's the alternative on MSVCSimd.cpp
the obvious name conflicts withCORRADE_TARGET_*
CORRADE_ENABLE_AVX_FMA
, e.g., also defined only if under x86 and the compiler supports that so we can use it also as an#ifdef
around the functions (and it'll be defined but empty on MSVC, for example)name()
to it, similarly toMath::TypeTraits
in Magnum -- need this for labeling testsSimd::Avx2|Simd::Fma
should result in a new type and not a runtime Features flag to avoid the need forAvx2Fma
,AvxFma
,Avx2FmaF16c
... combined tags, in function signatures it's thendecltype(Simd::Avx2|Simd::Fma)
but hidden behind a macro because the reality is always more complex than just one decltypeprovide something likepostponed, possible if IFUNC is not enabled to get env accessGLIBC_TUNABLES
to override CPU feature detection at runtime? https://stackoverflow.com/a/61048314 would need env access in the low level code, which is rather heavyweightCpu
?? Since it's not really just SIMD anymore...which is nicewhich is the only way to implement this, use thatgetauxval()
... just disable this altogether :/x86GCC 4.8 works,probably need to inline the cpuid query??--coverage
was brokendoes it allow me to overwrite the function pointer later?-- it doesn't, need a fallback for testingCpuTest
alone worked fine, but whenCorradeUtility
got ifuncs as well, it started crashing on older GCC and on all Clang) -- have to inline it, same case for ARMwhat's the difference from ifunc, actually?-- ability to go w/o function pointer, one indirection lesspatch the TARGET_ defines and runtime-detected features to be really supersets as documented, to handle outlier Celeron or Atom not following the orderobsolete, i now handle the extras properly so AVX+FMA, AVX+F16C and AVX2 without FMA is all possible and correctly handled__attribute__((__target__(...)))
annotations work only on Clang, GCC takes just the last one -- make aCORRADE_ENABLE()
macroTZCNT
with-mbmi
on GCC 4.8 and then later use it in a function that has-mavx -mbmi
set, it'll fail to link -- disable all "extra"CORRADE_ENABLE_
macros there :(maybe look which commit fixed it for 4.9 and check if there could be a workaround?there's no waylook into optimizing the dispatcher to use equality comparison instead ofmeh, the disassembly is good enough and it's not called in a hot loop either&
so compilers can optimize betterPractical use cases inside Corrade to prove this
Needs to be a part of this PR, otherwise I'd likely merge a state that isn't practically usable.
CORRADE_BUILD_CPU_RUNTIME_DISPATCH
, enabled by default only on Linux / Android, other platforms have it experimental andOFF
) ...or have a possibility to disable runtime CPU detection as well?it will rely just on compile-time options and have the functions truly baked inCORRADE_BUILD_TESTS_FORCE_CPU_POINTER_DISPATCH
)String::find()
and(other variants postponed, not critical to verify the design)findLast()
memchr()
yetexploit LZCNT / TZCNT returning 32 for zero input to reduce branchingpostponed, although still requiring themAdd unified macros tonot really possible, it's all just aliases anywayVisibilityMacros.h
that unify IFUNC, pointer and compile-time dispatchvisibility.h
similarly to the export macros*TestLib
explicitly use function pointers, independently of how the main build is configuredhopefully the old Emscripten on CI knows it well enough alreadyimplicitly disabled for old Emscripten and ancient Node.js in EMSDK (thus we can't test on the CI at all)How to test that they actually get picked, apart from printing the function pointer and eyeballing it?based on the benchmark numbers, if they're not differing much then I'm doing something wrong, also coverage reportsmemrchr()
on LinuxSomething in Magnum to verify cross-SO usability (SIMD playground magnum#306)it didn't work even in Corrade, thus nothing to verifyCorradeUtility
andCorradeUtilityTestLib
have the dispatch function pointers duplicated (well, yes, but so are also all functions), solve this somehow (or just suppress?) -- link to CorradeTestSuiteTestLib (which would link to CorradeUtlityTestLib instead of CorradeUtility)