Add `gcd` module to `bevy_math` #16421

bushrat011899 · 2024-11-18T01:36:33Z

Objective

Add a gcd module to bevy_math
Generalize and document the 4 / gcd(4,size) algorithm developed by @atlv24
Improves upon remove gcd impl from bevy_render #16419

Solution

Added gcd, gcd_by_table, and n_over_gcd_by_table to bevy_math.
Updated ElementLayout::new in bevy_render to use n_over_gcd_by_table

Testing

Added unit tests to confirm the new methods provide accurate results for small cases.
CI

Notes

The algorithm provided by @atlv24, while performant, has a bounds check that cannot be removed by the compiler due to the lookup table being a slice. To remove this redundant bounds check, and better document how the algorithm works, I've moved it into bevy_math with some documentation, and generalised the technique for any number N, not just 4.

For powers of 2, the algorithm is heavily optimised by the rust compiler, entirely omitting divisions and branches. For all other N, the algorithm performs a single mod (%) and a lookup, while still being branchless. For a comparison of this method, the original from @atlv24, and the naive, see GodBolt.

As a summary here, the case where N is 4 produces the following assembly (on x86 for demonstration purposes, other platforms will have similar output):

n_over_gcd_by_table_four:
        and     rdi, 3
        lea     rax, [rip + .L__unnamed_1]
        mov     rax, qword ptr [rax + 8*rdi]
        mov     qword ptr [rsp - 8], rax
        mov     rax, qword ptr [rsp - 8]
        ret

Compared to @atlv24's original:

n_over_gcd_by_table_four_atlv24:
        sub     rsp, 40
        mov     qword ptr [rsp + 8], 1
        mov     qword ptr [rsp + 16], 4
        mov     qword ptr [rsp + 24], 2
        mov     qword ptr [rsp + 32], 4
        and     rdi, 3
        mov     qword ptr [rsp], rdi
        cmp     rdi, 4
        jae     .LBB0_2
        mov     rax, qword ptr [rsp]
        mov     rax, qword ptr [rsp + 8*rax + 8]
        add     rsp, 40
        ret
.LBB0_2:
        mov     rdi, qword ptr [rsp]
        lea     rdx, [rip + .L__unnamed_1]
        mov     rax, qword ptr [rip + core::panicking::panic_bounds_check::hab02a8df06d3a143@GOTPCREL]
        mov     esi, 4
        call    rax

Note that while a more efficient algorithm is nice, I'm more concerned with having a generalised and documented implementation. That it produces cleaner assembly is purely a beneficial side effect.

crates/bevy_math/src/gcd.rs

Also refactored the algorithm to look a little cleaner. Identical implementation.

crates/bevy_render/src/mesh/allocator.rs

…on 32-bit Casting still required inside `gcd_x` methods, but they're only required to cast `usize` to `u64`, which will work on all platforms except 128-bit.

BD103

While the code quality itself looks awesome, I disagree with the use of #[inline(always)]. In my mind that is reserved for specific optimizations, and should not be applied to functions as broad as these. (Forcing inline could create several copies of the GCD table, for instance.)

Could you please switch to plain #[inline], which gives the compiler room to disagree in the case it isn't worth inlining?

atlv24 · 2024-11-19T02:10:23Z

the instruction count comparison on godbolt isnt fair or accurate because you didnt pass optimization level 3 with -C opt-level=3.

after doing so, the counts are comparable, the only difference being yours ends up as an actual lookup table:

four_over_gcd_4:
        and     edi, 3
        lea     rax, [rip + .L__unnamed_1]
        mov     rax, qword ptr [rax + 8*rdi]
        ret

.L__unnamed_1:
        .asciz  "\001\000\000\000\000\000\000\000\004\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\004\000\000\000\000\000\000"

while mine becomes

four_over_gcd_4:
        mov     qword ptr [rsp - 32], 1
        mov     qword ptr [rsp - 24], 4
        mov     qword ptr [rsp - 16], 2
        mov     qword ptr [rsp - 8], 4
        and     edi, 3
        mov     rax, qword ptr [rsp + 8*rdi - 32]
        ret

perf was not the motivation of my change anyhow, this isnt in the hot path afaik. i mostly wanted to take shit out of bevy_render to improve compile time

Add gcd module to bevy_math

a4338d9

bushrat011899 requested a review from atlv24 November 18, 2024 01:36

Formatting

25b1204

BenjaminBrienen approved these changes Nov 18, 2024

View reviewed changes

IQuick143 reviewed Nov 18, 2024

View reviewed changes

crates/bevy_math/src/gcd.rs Show resolved Hide resolved

IQuick143 approved these changes Nov 18, 2024

View reviewed changes

Adding some symmetry tests to gcd

83452e3

Also refactored the algorithm to look a little cleaner. Identical implementation.

IQuick143 reviewed Nov 18, 2024

View reviewed changes

crates/bevy_render/src/mesh/allocator.rs Outdated Show resolved Hide resolved

bushrat011899 added 2 commits November 19, 2024 10:50

Switched to explicit u64 to avoid concerns around usize -> u64 …

790078d

…on 32-bit Casting still required inside `gcd_x` methods, but they're only required to cast `usize` to `u64`, which will work on all platforms except 128-bit.

Fix tests

dcec14f

BD103 suggested changes Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `gcd` module to `bevy_math` #16421

Add `gcd` module to `bevy_math` #16421

bushrat011899 commented Nov 18, 2024 •

edited

Loading

BD103 left a comment

atlv24 commented Nov 19, 2024

Add gcd module to bevy_math #16421

Are you sure you want to change the base?

Add gcd module to bevy_math #16421

Conversation

bushrat011899 commented Nov 18, 2024 • edited Loading

Objective

Solution

Testing

Notes

BD103 left a comment

Choose a reason for hiding this comment

atlv24 commented Nov 19, 2024

Add `gcd` module to `bevy_math` #16421

Add `gcd` module to `bevy_math` #16421

bushrat011899 commented Nov 18, 2024 •

edited

Loading