Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bounds_t with pre-aligned mins and maxs #1482

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

illwieckz
Copy link
Member

@illwieckz illwieckz commented Dec 30, 2024

This is not urgent at all. This is complementary to:

It also improves over:

The BoxOnPlaneSide() function is known to be a hot spot, being called by multiple recursive functions. Right now we spend a lot of time in _mm_loadu_ps, we have to call sseLoadVec3Unsafe() explicitly because we can't guess if the data is aligned or not. This comes with multiple downside:

  • _mm_loadu_ps is said to be slower than _mm_load_ps, and that fits my experience¹.
  • The compiler doesn't optimize _mm_loadu_ps and will always call it if we ask for it explicitely, even if the data is already aligned, and by experience, even if no copy is needed.

So the idea is to introduce some nice bounds_t struct that uses aligned mins and maxs, same for cplane_t with an aligned normal. When doing that, we can write an explicit _mm_load_ps that is faster than _mm_loadu_ps, and most of the cases, because the compiler knows the data is const and already aligned, the compiler just removes the _mm_load_ps and process the data without any copy.

See also:


¹ Some times ago I tried to write optimized SSE code for some other functions, but the code was slower because of the explicit _mm_loadu_ps call. I even noticed copying an unaligned vec3 to an aligned vec3 before calling _mm_load_ps could make the code faster when the compiler notices the data is already aligned and skips the copy and calls _mm_load_ps (or even doesn't need to call it at all).

@illwieckz
Copy link
Member Author

For now there is a bug somewhere I have not found yet. The bug can be seen by loading the plat23 map and looking to the left (some surfaces will disappear).

In my first iteration of the branch I made a stupid mistakes by passing the bounds_t by value instead of by reference, meaning functions modifying it would not modify it. I assume this is now fixed but I guess I left somewhere another mistake as stupid as that.

@illwieckz illwieckz marked this pull request as draft December 30, 2024 19:25
@illwieckz
Copy link
Member Author

Note: the percentage of time spent is not reliable in those screenshots because I use a viewpos on some acid tubes spamming acid gas and the amount of particles is very variable.

Before (C), a lot of time is spent in sseLoadVec3Unsafe():

20241230-201855-000 orbit-unvanquished-aligned-sse-load

After (C), the first loads are just completely skipped:

20241230-202108-000 orbit-unvanquished-aligned-sse-load

Before (Asm), we see three movups instructions:

20241230-201857-000 orbit-unvanquished-aligned-sse-load

After (Asm), se see only two movaps instructions:

20241230-202121-000 orbit-unvanquished-aligned-sse-load

@illwieckz
Copy link
Member Author

So, as I said, this is not urgent. But I would appreciate people re-reading what I did in hope someone finds the stupid mistake I may have done. The code change should be straightforward because it's basically rewriting with a different data layout, there is no algorithm change to be expected. Unfortunately this data struct is used in many places so the patch is massive.

There are many functions in CM not using the new structure yet that can be migrated, but that can be done later as the patch is already heavy as it is. Migrating other functions would be mostly code clean-up and only about code purity, it is not required to make it work. Though maybe it can also help the compiler to optimize in other places. For the same reasons, I only modified the game to be compatible with the new structs and shared functions, using some wrapping around the new struct, to keep the patch light on game side (porting the game to the new struct for the sake of purity can be done later as well).

@illwieckz
Copy link
Member Author

I also added some constness to some function input, which may help the compiler to optimize a bit more some other places of the code. I doubt this extra-constness is the cause of the bug I'm facing because the compiler doesn't complain I'm writing to some const structs, so hopefully I only added some const keywords to structs that are only read and not written.

@illwieckz illwieckz force-pushed the illwieckz/bounds/sync branch from da96dc5 to b9fbbab Compare December 30, 2024 19:48
out->bounds.mins[ 1 ] = LittleLong( in->mins[ 1 ] );
out->bounds.mins[ 2 ] = LittleLong( in->mins[ 2 ] );
out->bounds.maxs[ 1 ] = LittleLong( in->maxs[ 1 ] );
out->bounds.maxs[ 2 ] = LittleLong( in->maxs[ 2 ] );
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I forgot:

out->bounds.maxs[ 0 ] = LittleLong( in->maxs[ 0 ] );

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the bug I was looking for.

@illwieckz illwieckz force-pushed the illwieckz/bounds/sync branch from b9fbbab to 3176224 Compare December 30, 2024 22:28
@illwieckz
Copy link
Member Author

So I reread everything and found the bug I was looking for. I missed the conversion of one line (I just did not re-added the new one, that's why there was no compile type error).

It works.

@illwieckz illwieckz force-pushed the illwieckz/bounds/sync branch 2 times, most recently from 4bea319 to e99d1aa Compare December 30, 2024 22:44
@illwieckz
Copy link
Member Author

I found another bug affecting BSP movers like doors. One can test it in the Vega map. The doors disappear according to the viewing angle / point of view.

@illwieckz illwieckz force-pushed the illwieckz/bounds/sync branch 4 times, most recently from 92be21c to d7d9d1e Compare December 31, 2024 02:18
@DolceTriade
Copy link
Contributor

Any fps difference?

@illwieckz illwieckz force-pushed the illwieckz/bounds/sync branch from d7d9d1e to 3239c6e Compare December 31, 2024 14:27
@slipher
Copy link
Member

slipher commented Jan 5, 2025

I think you can make a drop-in aligned replacement for vec3_t like this:

struct alignas(16) alignedVec3_t : public std::array<float, 3> {};

This can be used in arrays so then you don't need bounds_t with the ugly at function.

src/shared/client/cg_api.cpp Outdated Show resolved Hide resolved
src/engine/renderer/tr_world.cpp Outdated Show resolved Hide resolved
src/engine/renderer/tr_model_iqm.cpp Show resolved Hide resolved
src/engine/qcommon/q_math.cpp Outdated Show resolved Hide resolved
@@ -1718,7 +1720,7 @@ void MatrixTransformBounds( const matrix_t m, const bounds_t &b, bounds_t &o );
vec_t dist;
byte type; // for fast side tests: 0,1,2 = axial, 3 = nonaxial
byte signbits; // signx + (signy<<1) + (signz<<2), used as lookup during collision
byte pad[ 2 ];
byte pad[ 12 ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong. The previous variables in the struct add up to 6 bytes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was because of:

But maybe I'm wrong with the values.

Copy link
Contributor

@VReaperV VReaperV Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but it should be byte pad[ 14 ], rather than byte pad[ 12 ].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(small correction to my previous comment: they add up to 18 bytes, not 6)

@illwieckz illwieckz force-pushed the illwieckz/bounds/sync branch from 3239c6e to 881fb23 Compare January 25, 2025 07:01
@illwieckz
Copy link
Member Author

I think you can make a drop-in aligned replacement for vec3_t like this:

struct alignas(16) alignedVec3_t : public std::array<float, 3> {};

This can be used in arrays so then you don't need bounds_t with the ugly at function.

Interesting, how such type can be used and where?

@slipher
Copy link
Member

slipher commented Jan 25, 2025

As a direct replacement for vec3_t. alignedVec3_t localBounds[ 2 ]; This would automatically work with indexing operations and functions like VectorCopy. But you would need a .data() or whatever to pass it to functions taking a float* (this includes functions with vec3_t parameter)

@illwieckz
Copy link
Member Author

I like the idea of having have an alignedVec3_t but I don't see the benefit of making bounds_t an array of it. There is only a few usage of at(), while using .data() everywhere would be much verbose. Also I like the idea of explicit mins and maxs naming, I guess we may be able to do some mins() and maxs() and I tried to implement that with an alignedVec3_t but that because much more complex.

I just rewrote at() in a way it is now guaranteed to be branchless.

@illwieckz illwieckz force-pushed the illwieckz/bounds/sync branch from ce2df22 to b30fef6 Compare January 25, 2025 16:51
@slipher
Copy link
Member

slipher commented Jan 25, 2025

I like the idea of having have an alignedVec3_t but I don't see the benefit of making bounds_t an array of it.

I'm suggesting that no bounds_t is needed. As I wrote in the previous comment, it could be used like alignedVec3_t localBounds[ 2 ];

@illwieckz
Copy link
Member Author

But I like having a explicit bounds_t! It being implemented the way I did it or doing it like using bounds_t = alignedVec3_t[2], I want to do it…

@@ -68,10 +68,19 @@ macro(set_c_cxx_flag FLAG)
set_c_flag(${FLAG} ${ARGN})
set_cxx_flag(${FLAG} ${ARGN})
endmacro()

macro(set_kind_linker_flag KIND FLAG)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are random cmake changes in here by mistake


vec3_t& at( bool index )
{
return ( &mins + ( index * 16 ) )[ 0 ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's undefined behavior, like the (&v.x)[i] thing with GLM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how can this be undefined if the struct layout is guaranteed? And if it isn't guaranteed, how can we use structs for defining file storage format or network packet formats if the layout of the data isn't guaranteed?

Is there a type in C/C++ that guarantees that alignas(16) vec4_t, alignas(16) vec4_t has an offset of sizeof(vec4_t) or 16?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be solved with static assertion (if the compiler doesn't do as expected, don't let the code be compiled?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter that the layout is guaranteed. It's illegal to use an index that goes outside the bounds of an array. (A non-array object is treated like an array of length 1.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is not an index, this is a pointer. Pointer arithmetic isn't undefined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I didn't explain it precisely. Let's cite people who can. https://en.cppreference.com/w/cpp/language/operator_arithmetic#Pointer_arithmetic

Or one would say that if you have float array[3], doing &array[0] + 2 isn't defined because that goes outside of the object (array[0]).

This is the case Otherwise, if P points to the ith element of an array object x with n elements, given the value of J as j, P is added or subtracted as follows: . J can be from 0 to 3.

If you do &array instead of &a[0], it's the case Otherwise, if P points to a complete object, a base class subobject or a member subobject y, given the value of J as j, P is added or subtracted as follows: . J can be 0 or 1. Note that if it is 1, the resulting pointer is a "pointer past the end" and is illegal to dereference, even if it would happen to have the same numerical value as a valid object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this defined?

	vec_t* at( bool index )
	{
		return (vec_t*) this + ( index * 16 );
	}

The pointer doesn't go past the end of the object this.

Copy link
Member Author

@illwieckz illwieckz Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that doesn't work, actually only the if (index) return maxs; return mins work:

https://godbolt.org/z/ETEqnG9Gv

What's curious is that the compilers (GCC, Clang) seem to produce the same code for all three functions, but produce garbage on the functions doing pointer arithmetic. Edit: In fact the computed offsets are wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: In fact the computed offsets are wrong.

The wrong offsets are the ones in your at() functions, i. e. it should be index * 4, rather than index * 16.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen claims that, essentially, doing this would not be UB:
return (vec_t*) ( (char*) &mins[0] + ( index ? offsetof( boundsA_t, maxs ) : offsetof( boundsA_t, mins ) ) );
However, don't quote me on that, & it produces the same assembly anyway.

@illwieckz illwieckz force-pushed the illwieckz/bounds/sync branch 3 times, most recently from 73a5d78 to bf2896f Compare January 30, 2025 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants