Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gcd module to bevy_math #16421

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

bushrat011899
Copy link
Contributor

@bushrat011899 bushrat011899 commented Nov 18, 2024

Objective

Solution

  • Added gcd, gcd_by_table, and n_over_gcd_by_table to bevy_math.
  • Updated ElementLayout::new in bevy_render to use n_over_gcd_by_table

Testing

  • Added unit tests to confirm the new methods provide accurate results for small cases.
  • CI

Notes

The algorithm provided by @atlv24, while performant, has a bounds check that cannot be removed by the compiler due to the lookup table being a slice. To remove this redundant bounds check, and better document how the algorithm works, I've moved it into bevy_math with some documentation, and generalised the technique for any number N, not just 4.

For powers of 2, the algorithm is heavily optimised by the rust compiler, entirely omitting divisions and branches. For all other N, the algorithm performs a single mod (%) and a lookup, while still being branchless. For a comparison of this method, the original from @atlv24, and the naive, see GodBolt.

As a summary here, the case where N is 4 produces the following assembly (on x86 for demonstration purposes, other platforms will have similar output):

n_over_gcd_by_table_four:
        and     rdi, 3
        lea     rax, [rip + .L__unnamed_1]
        mov     rax, qword ptr [rax + 8*rdi]
        mov     qword ptr [rsp - 8], rax
        mov     rax, qword ptr [rsp - 8]
        ret

Compared to @atlv24's original:

n_over_gcd_by_table_four_atlv24:
        sub     rsp, 40
        mov     qword ptr [rsp + 8], 1
        mov     qword ptr [rsp + 16], 4
        mov     qword ptr [rsp + 24], 2
        mov     qword ptr [rsp + 32], 4
        and     rdi, 3
        mov     qword ptr [rsp], rdi
        cmp     rdi, 4
        jae     .LBB0_2
        mov     rax, qword ptr [rsp]
        mov     rax, qword ptr [rsp + 8*rax + 8]
        add     rsp, 40
        ret
.LBB0_2:
        mov     rdi, qword ptr [rsp]
        lea     rdx, [rip + .L__unnamed_1]
        mov     rax, qword ptr [rip + core::panicking::panic_bounds_check::hab02a8df06d3a143@GOTPCREL]
        mov     esi, 4
        call    rax

Note that while a more efficient algorithm is nice, I'm more concerned with having a generalised and documented implementation. That it produces cleaner assembly is purely a beneficial side effect.

@bushrat011899 bushrat011899 added C-Performance A change motivated by improving speed, memory usage or compile times A-Math Fundamental domain-agnostic mathematical operations D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Nov 18, 2024
Also refactored the algorithm to look a little cleaner. Identical implementation.
…on 32-bit

Casting still required inside `gcd_x` methods, but they're only required to cast `usize` to `u64`, which will work on all platforms except 128-bit.
Copy link
Member

@BD103 BD103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the code quality itself looks awesome, I disagree with the use of #[inline(always)]. In my mind that is reserved for specific optimizations, and should not be applied to functions as broad as these. (Forcing inline could create several copies of the GCD table, for instance.)

Could you please switch to plain #[inline], which gives the compiler room to disagree in the case it isn't worth inlining?

@atlv24
Copy link
Contributor

atlv24 commented Nov 19, 2024

the instruction count comparison on godbolt isnt fair or accurate because you didnt pass optimization level 3 with -C opt-level=3.

after doing so, the counts are comparable, the only difference being yours ends up as an actual lookup table:

four_over_gcd_4:
        and     edi, 3
        lea     rax, [rip + .L__unnamed_1]
        mov     rax, qword ptr [rax + 8*rdi]
        ret

.L__unnamed_1:
        .asciz  "\001\000\000\000\000\000\000\000\004\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\004\000\000\000\000\000\000"

while mine becomes

four_over_gcd_4:
        mov     qword ptr [rsp - 32], 1
        mov     qword ptr [rsp - 24], 4
        mov     qword ptr [rsp - 16], 2
        mov     qword ptr [rsp - 8], 4
        and     edi, 3
        mov     rax, qword ptr [rsp + 8*rdi - 32]
        ret

perf was not the motivation of my change anyhow, this isnt in the hot path afaik. i mostly wanted to take shit out of bevy_render to improve compile time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Math Fundamental domain-agnostic mathematical operations C-Performance A change motivated by improving speed, memory usage or compile times D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Needs-Review Needs reviewer attention (from anyone!) to move forward
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants