-
Notifications
You must be signed in to change notification settings - Fork 30
scx: Branchless implementation of highest_bit #208
Conversation
e24cf06
to
52514a5
Compare
Thanks for the thorough PR but this unfortunately would break on 32bit builds as the highest bit would be lost in 32bit unsigned long. |
Oh I see, what do you think about doing a type conversion from
|
Yeah, that should work. Can you verify it on 32bit builds too? That's probably going to be slower but as long as it's not drastic, that should be okay. |
Sorry I don't really understand the meaning of 32 bit builds, do you mean to re-compile the kernel with a config setting of 32 bit builds and test the correctness inside 32bit builded kernel ? |
You can just build the test binary with |
Origin implementation of function highest_bit utilize the function of "fls()" to calculate the most significant bit of the input parameter "flags". Normally we can return the mask with "1 << (fls(flags) - 1)", but "fls(flags)" will return 0 if the value of "flags" is 0, which will cause the evaluation become "1 << (-1)" and it's illegal. So we use a branch to determine whether the return value of "fls(flags)" is 0. We can remove the use of branch first left shift "fls(flags)" number of bits and then right shift 1 bit. When the value of "fls(flags)" is 0 then this evaluation will simply become 0 without any error. As more values other than 0, evaluation is the same as it did for "1 << (fls(flags) - 1)". This implementation can prevent any possible branch prediction fault and pure shift operations are cheaper than branch operation for if-else statements.
52514a5
to
baff66e
Compare
Just used the proposed implementation to built with |
Also ran the comparison test again, the result shown as the following. Btw this time I iterate through the whole range within Branch version
Branch-less version
Improvements can be seen in |
Summary
Origin implementation of function
highest_bit
utilize the function offls()
to calculate the most significant bit of the input parameterflags
. Normally we can return the mask with1 << (fls(flags) - 1)
, butfls(flags)
will return 0 if the value offlags
is 0, which will cause the evaluation become1 << (-1)
and it's illegal. So we use a branch to determine whether the return value offls(flags)
is 0.We can remove the use of branch first left shift
fls(flags)
number of bits and then right shift 1 bit. When the value offls(flags)
is 0 then this evaluation will simply become 0 without any error. As more values other than 0, evaluation is the same as it did for1 << (fls(flags) - 1)
.This implementation can prevent any possible branch prediction fault and pure shift operations are cheaper than branch operation for if-else statements.
Experiments
Correctness
In order to prove the correctness of my implementation, I use the following userspace program to test whether the two different implementation generate exactly the same output for every number within the range of
u32
which is usuallyunsigned
.Compile the program and execute
Test Passed ! The correctness are proved.
Performance
Change the user space program abit , only test a subset within the region of
unsigned
(so it won't take too much time for 1 single execution) . Usingperf stat
to observe the performance of the two different implementationBranch version
Compile and use
perf
to observe the performance.Branchless version
Compile and use
perf
to observe the performance.We can see significant improvements on cycles, stalled-cycles-frontend, especially branches and branch-misses.
The test runs on x86_64 AMD Ryzen 7 7700X 8-Core Processor , the operating system is Ubuntu 22.04.4 LTS .