Add `bounds_t` with pre-aligned `mins` and `maxs` #1482

illwieckz · 2024-12-30T19:22:40Z

This is not urgent at all. This is complementary to:

qmath: micro-optimize the BoxOnPlaneSide function #1142

It also improves over:

introduce plane_t struct with normal and dist members #1043

The BoxOnPlaneSide() function is known to be a hot spot, being called by multiple recursive functions. Right now we spend a lot of time in _mm_loadu_ps, we have to call sseLoadVec3Unsafe() explicitly because we can't guess if the data is aligned or not. This comes with multiple downside:

_mm_loadu_ps is said to be slower than _mm_load_ps, and that fits my experience¹.
The compiler doesn't optimize _mm_loadu_ps and will always call it if we ask for it explicitely, even if the data is already aligned, and by experience, even if no copy is needed.

So the idea is to introduce some nice bounds_t struct that uses aligned mins and maxs, same for cplane_t with an aligned normal. When doing that, we can write an explicit _mm_load_ps that is faster than _mm_loadu_ps, and most of the cases, because the compiler knows the data is const and already aligned, the compiler just removes the _mm_load_ps and process the data without any copy.

See also:

Add bounds_t with pre-aligned mins and maxs Unvanquished/Unvanquished#3265

¹ Some times ago I tried to write optimized SSE code for some other functions, but the code was slower because of the explicit _mm_loadu_ps call. I even noticed copying an unaligned vec3 to an aligned vec3 before calling _mm_load_ps could make the code faster when the compiler notices the data is already aligned and skips the copy and calls _mm_load_ps (or even doesn't need to call it at all).

illwieckz · 2024-12-30T19:25:05Z

For now there is a bug somewhere I have not found yet. The bug can be seen by loading the plat23 map and looking to the left (some surfaces will disappear).

In my first iteration of the branch I made a stupid mistakes by passing the bounds_t by value instead of by reference, meaning functions modifying it would not modify it. I assume this is now fixed but I guess I left somewhere another mistake as stupid as that.

illwieckz · 2024-12-30T19:38:01Z

Note: the percentage of time spent is not reliable in those screenshots because I use a viewpos on some acid tubes spamming acid gas and the amount of particles is very variable.

Before (C), a lot of time is spent in sseLoadVec3Unsafe():

After (C), the first loads are just completely skipped:

Before (Asm), we see three movups instructions:

After (Asm), se see only two movaps instructions:

illwieckz · 2024-12-30T19:44:31Z

So, as I said, this is not urgent. But I would appreciate people re-reading what I did in hope someone finds the stupid mistake I may have done. The code change should be straightforward because it's basically rewriting with a different data layout, there is no algorithm change to be expected. Unfortunately this data struct is used in many places so the patch is massive.

There are many functions in CM not using the new structure yet that can be migrated, but that can be done later as the patch is already heavy as it is. Migrating other functions would be mostly code clean-up and only about code purity, it is not required to make it work. Though maybe it can also help the compiler to optimize in other places. For the same reasons, I only modified the game to be compatible with the new structs and shared functions, using some wrapping around the new struct, to keep the patch light on game side (porting the game to the new struct for the sake of purity can be done later as well).

illwieckz · 2024-12-30T19:46:09Z

I also added some constness to some function input, which may help the compiler to optimize a bit more some other places of the code. I doubt this extra-constness is the cause of the bug I'm facing because the compiler doesn't complain I'm writing to some const structs, so hopefully I only added some const keywords to structs that are only read and not written.

illwieckz · 2024-12-30T22:06:59Z

src/engine/renderer/tr_bsp.cpp

+		out->bounds.mins[ 1 ] = LittleLong( in->mins[ 1 ] );
+		out->bounds.mins[ 2 ] = LittleLong( in->mins[ 2 ] );
+		out->bounds.maxs[ 1 ] = LittleLong( in->maxs[ 1 ] );
+		out->bounds.maxs[ 2 ] = LittleLong( in->maxs[ 2 ] );


Here I forgot:

out->bounds.maxs[ 0 ] = LittleLong( in->maxs[ 0 ] );

That was the bug I was looking for.

illwieckz · 2024-12-30T22:29:47Z

So I reread everything and found the bug I was looking for. I missed the conversion of one line (I just did not re-added the new one, that's why there was no compile type error).

It works.

illwieckz · 2024-12-30T23:10:50Z

I found another bug affecting BSP movers like doors. One can test it in the Vega map. The doors disappear according to the viewing angle / point of view.

DolceTriade · 2024-12-31T12:42:48Z

Any fps difference?

slipher · 2025-01-05T03:54:15Z

I think you can make a drop-in aligned replacement for vec3_t like this:

struct alignas(16) alignedVec3_t : public std::array<float, 3> {};

This can be used in arrays so then you don't need bounds_t with the ugly at function.

src/shared/client/cg_api.cpp

src/engine/renderer/tr_world.cpp

src/engine/renderer/tr_model_iqm.cpp

src/engine/qcommon/q_math.cpp

VReaperV · 2025-01-24T22:17:35Z

src/engine/qcommon/q_shared.h

@@ -1718,7 +1720,7 @@ void MatrixTransformBounds( const matrix_t m, const bounds_t &b, bounds_t &o );
 		vec_t dist;
 		byte   type; // for fast side tests: 0,1,2 = axial, 3 = nonaxial
 		byte   signbits; // signx + (signy<<1) + (signz<<2), used as lookup during collision
-		byte   pad[ 2 ];
+		byte pad[ 12 ];


This looks wrong. The previous variables in the struct add up to 6 bytes.

That was because of:

Add bounds_t with pre-aligned mins and maxs Unvanquished/Unvanquished#3265 (comment)

But maybe I'm wrong with the values.

Sure, but it should be byte pad[ 14 ], rather than byte pad[ 12 ].

(small correction to my previous comment: they add up to 18 bytes, not 6)

illwieckz · 2025-01-25T07:05:10Z

I think you can make a drop-in aligned replacement for vec3_t like this:
struct alignas(16) alignedVec3_t : public std::array<float, 3> {};
This can be used in arrays so then you don't need bounds_t with the ugly at function.

Interesting, how such type can be used and where?

slipher · 2025-01-25T07:17:25Z

As a direct replacement for vec3_t. alignedVec3_t localBounds[ 2 ]; This would automatically work with indexing operations and functions like VectorCopy. But you would need a .data() or whatever to pass it to functions taking a float* (this includes functions with vec3_t parameter)

illwieckz · 2025-01-25T16:51:22Z

I like the idea of having have an alignedVec3_t but I don't see the benefit of making bounds_t an array of it. There is only a few usage of at(), while using .data() everywhere would be much verbose. Also I like the idea of explicit mins and maxs naming, I guess we may be able to do some mins() and maxs() and I tried to implement that with an alignedVec3_t but that because much more complex.

I just rewrote at() in a way it is now guaranteed to be branchless.

slipher · 2025-01-25T16:56:07Z

I like the idea of having have an alignedVec3_t but I don't see the benefit of making bounds_t an array of it.

I'm suggesting that no bounds_t is needed. As I wrote in the previous comment, it could be used like alignedVec3_t localBounds[ 2 ];

illwieckz · 2025-01-25T16:58:57Z

But I like having a explicit bounds_t! It being implemented the way I did it or doing it like using bounds_t = alignedVec3_t[2], I want to do it…

slipher · 2025-01-25T17:04:38Z

cmake/DaemonFlags.cmake

@@ -68,10 +68,19 @@ macro(set_c_cxx_flag FLAG)
    set_c_flag(${FLAG} ${ARGN})
    set_cxx_flag(${FLAG} ${ARGN})
 endmacro()
+
+macro(set_kind_linker_flag KIND FLAG)


There are random cmake changes in here by mistake

slipher · 2025-01-25T17:05:17Z

src/engine/qcommon/q_shared.h

+
+	vec3_t& at( bool index )
+	{
+		return ( &mins + ( index * 16 ) )[ 0 ];


That's undefined behavior, like the (&v.x)[i] thing with GLM.

But how can this be undefined if the struct layout is guaranteed? And if it isn't guaranteed, how can we use structs for defining file storage format or network packet formats if the layout of the data isn't guaranteed?

Is there a type in C/C++ that guarantees that alignas(16) vec4_t, alignas(16) vec4_t has an offset of sizeof(vec4_t) or 16?

Can this be solved with static assertion (if the compiler doesn't do as expected, don't let the code be compiled?).

It doesn't matter that the layout is guaranteed. It's illegal to use an index that goes outside the bounds of an array. (A non-array object is treated like an array of length 1.)

But this is not an index, this is a pointer. Pointer arithmetic isn't undefined.

OK I didn't explain it precisely. Let's cite people who can. https://en.cppreference.com/w/cpp/language/operator_arithmetic#Pointer_arithmetic

Or one would say that if you have float array[3], doing &array[0] + 2 isn't defined because that goes outside of the object (array[0]).

This is the case Otherwise, if P points to the ith element of an array object x with n elements, given the value of J as j, P is added or subtracted as follows: . J can be from 0 to 3.

If you do &array instead of &a[0], it's the case Otherwise, if P points to a complete object, a base class subobject or a member subobject y, given the value of J as j, P is added or subtracted as follows: . J can be 0 or 1. Note that if it is 1, the resulting pointer is a "pointer past the end" and is illegal to dereference, even if it would happen to have the same numerical value as a valid object.

So, is this defined?

vec_t* at( bool index ) { return (vec_t*) this + ( index * 16 ); }

The pointer doesn't go past the end of the object this.

Well, that doesn't work, actually only the if (index) return maxs; return mins work:

https://godbolt.org/z/ETEqnG9Gv

What's curious is that the compilers (GCC, Clang) seem to produce the same code for all three functions, but produce garbage on the functions doing pointer arithmetic. Edit: In fact the computed offsets are wrong.

Edit: In fact the computed offsets are wrong.

The wrong offsets are the ones in your at() functions, i. e. it should be index * 4, rather than index * 16.

I've seen claims that, essentially, doing this would not be UB:
return (vec_t*) ( (char*) &mins[0] + ( index ? offsetof( boundsA_t, maxs ) : offsetof( boundsA_t, mins ) ) );
However, don't quote me on that, & it produces the same assembly anyway.

illwieckz added 3 commits December 30, 2024 16:43

padding

c9ac2b9

cmake: make possible to enable LTO when compiling dll games

6962a41

bounds_t

06862cb

illwieckz mentioned this pull request Dec 30, 2024

Add bounds_t with pre-aligned mins and maxs Unvanquished/Unvanquished#3265

Draft

illwieckz marked this pull request as draft December 30, 2024 19:25

illwieckz force-pushed the illwieckz/bounds/sync branch from da96dc5 to b9fbbab Compare December 30, 2024 19:48

illwieckz commented Dec 30, 2024

View reviewed changes

illwieckz added 2 commits December 30, 2024 23:10

fixup: fix mistake introduced

cad4024

aligned-sse-load

0b07c24

illwieckz force-pushed the illwieckz/bounds/sync branch from b9fbbab to 3176224 Compare December 30, 2024 22:28

illwieckz force-pushed the illwieckz/bounds/sync branch 2 times, most recently from 4bea319 to e99d1aa Compare December 30, 2024 22:44

illwieckz force-pushed the illwieckz/bounds/sync branch 4 times, most recently from 92be21c to d7d9d1e Compare December 31, 2024 02:18

illwieckz force-pushed the illwieckz/bounds/sync branch from d7d9d1e to 3239c6e Compare December 31, 2024 14:27

VReaperV reviewed Jan 24, 2025

View reviewed changes

src/shared/client/cg_api.cpp Outdated Show resolved Hide resolved

src/engine/renderer/tr_world.cpp Outdated Show resolved Hide resolved

src/engine/renderer/tr_model_iqm.cpp Show resolved Hide resolved

VReaperV reviewed Jan 24, 2025

View reviewed changes

illwieckz added 3 commits January 25, 2025 07:54

static-const

838c164

fixup-boundscopy-bspmodel

22bd35f

fixup-bounding-mins

cb98f7a

illwieckz force-pushed the illwieckz/bounds/sync branch from 3239c6e to 881fb23 Compare January 25, 2025 07:01

padding

bf2896f

illwieckz force-pushed the illwieckz/bounds/sync branch from ce2df22 to b30fef6 Compare January 25, 2025 16:51

slipher reviewed Jan 25, 2025

View reviewed changes

illwieckz force-pushed the illwieckz/bounds/sync branch 3 times, most recently from 73a5d78 to bf2896f Compare January 30, 2025 00:42

illwieckz added T-Improvement Improvement for an existing feature A-Renderer labels Mar 12, 2025

Add bounds_t with pre-aligned mins and maxs #1482

Are you sure you want to change the base?

Add bounds_t with pre-aligned mins and maxs #1482

Uh oh!

Conversation

illwieckz commented Dec 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

illwieckz commented Dec 30, 2024

Uh oh!

illwieckz commented Dec 30, 2024

Uh oh!

illwieckz commented Dec 30, 2024

Uh oh!

illwieckz commented Dec 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

illwieckz commented Dec 30, 2024

Uh oh!

illwieckz commented Dec 30, 2024

Uh oh!

DolceTriade commented Dec 31, 2024

Uh oh!

slipher commented Jan 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VReaperV Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

illwieckz commented Jan 25, 2025

Uh oh!

slipher commented Jan 25, 2025

Uh oh!

illwieckz commented Jan 25, 2025

Uh oh!

slipher commented Jan 25, 2025

Uh oh!

illwieckz commented Jan 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

illwieckz Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Add `bounds_t` with pre-aligned `mins` and `maxs` #1482

Add `bounds_t` with pre-aligned `mins` and `maxs` #1482

illwieckz commented Dec 30, 2024 •

edited

Loading

VReaperV Jan 25, 2025 •

edited

Loading

illwieckz Jan 25, 2025 •

edited

Loading