-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bounds_t
with pre-aligned mins
and maxs
#1482
base: master
Are you sure you want to change the base?
Conversation
For now there is a bug somewhere I have not found yet. The bug can be seen by loading the In my first iteration of the branch I made a stupid mistakes by passing the |
Note: the percentage of time spent is not reliable in those screenshots because I use a Before (C), a lot of time is spent in After (C), the first loads are just completely skipped: Before (Asm), we see three After (Asm), se see only two |
So, as I said, this is not urgent. But I would appreciate people re-reading what I did in hope someone finds the stupid mistake I may have done. The code change should be straightforward because it's basically rewriting with a different data layout, there is no algorithm change to be expected. Unfortunately this data struct is used in many places so the patch is massive. There are many functions in |
I also added some constness to some function input, which may help the compiler to optimize a bit more some other places of the code. I doubt this extra-constness is the cause of the bug I'm facing because the compiler doesn't complain I'm writing to some |
da96dc5
to
b9fbbab
Compare
out->bounds.mins[ 1 ] = LittleLong( in->mins[ 1 ] ); | ||
out->bounds.mins[ 2 ] = LittleLong( in->mins[ 2 ] ); | ||
out->bounds.maxs[ 1 ] = LittleLong( in->maxs[ 1 ] ); | ||
out->bounds.maxs[ 2 ] = LittleLong( in->maxs[ 2 ] ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I forgot:
out->bounds.maxs[ 0 ] = LittleLong( in->maxs[ 0 ] );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the bug I was looking for.
b9fbbab
to
3176224
Compare
So I reread everything and found the bug I was looking for. I missed the conversion of one line (I just did not re-added the new one, that's why there was no compile type error). It works. |
4bea319
to
e99d1aa
Compare
I found another bug affecting BSP movers like doors. One can test it in the Vega map. The doors disappear according to the viewing angle / point of view. |
92be21c
to
d7d9d1e
Compare
Any fps difference? |
d7d9d1e
to
3239c6e
Compare
I think you can make a drop-in aligned replacement for vec3_t like this:
This can be used in arrays so then you don't need |
src/engine/qcommon/q_shared.h
Outdated
@@ -1718,7 +1720,7 @@ void MatrixTransformBounds( const matrix_t m, const bounds_t &b, bounds_t &o ); | |||
vec_t dist; | |||
byte type; // for fast side tests: 0,1,2 = axial, 3 = nonaxial | |||
byte signbits; // signx + (signy<<1) + (signz<<2), used as lookup during collision | |||
byte pad[ 2 ]; | |||
byte pad[ 12 ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. The previous variables in the struct add up to 6 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was because of:
But maybe I'm wrong with the values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but it should be byte pad[ 14 ]
, rather than byte pad[ 12 ]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(small correction to my previous comment: they add up to 18 bytes, not 6)
3239c6e
to
881fb23
Compare
Interesting, how such type can be used and where? |
As a direct replacement for vec3_t. |
I like the idea of having have an I just rewrote |
ce2df22
to
b30fef6
Compare
I'm suggesting that no |
But I like having a explicit |
@@ -68,10 +68,19 @@ macro(set_c_cxx_flag FLAG) | |||
set_c_flag(${FLAG} ${ARGN}) | |||
set_cxx_flag(${FLAG} ${ARGN}) | |||
endmacro() | |||
|
|||
macro(set_kind_linker_flag KIND FLAG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are random cmake changes in here by mistake
src/engine/qcommon/q_shared.h
Outdated
|
||
vec3_t& at( bool index ) | ||
{ | ||
return ( &mins + ( index * 16 ) )[ 0 ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's undefined behavior, like the (&v.x)[i]
thing with GLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how can this be undefined if the struct layout is guaranteed? And if it isn't guaranteed, how can we use structs for defining file storage format or network packet formats if the layout of the data isn't guaranteed?
Is there a type in C/C++ that guarantees that alignas(16) vec4_t, alignas(16) vec4_t
has an offset of sizeof(vec4_t)
or 16
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be solved with static assertion (if the compiler doesn't do as expected, don't let the code be compiled?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter that the layout is guaranteed. It's illegal to use an index that goes outside the bounds of an array. (A non-array object is treated like an array of length 1.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is not an index, this is a pointer. Pointer arithmetic isn't undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I didn't explain it precisely. Let's cite people who can. https://en.cppreference.com/w/cpp/language/operator_arithmetic#Pointer_arithmetic
Or one would say that if you have float array[3], doing &array[0] + 2 isn't defined because that goes outside of the object (array[0]).
This is the case Otherwise, if P points to the ith element of an array object x with n elements, given the value of J as j, P is added or subtracted as follows:
. J can be from 0 to 3.
If you do &array
instead of &a[0]
, it's the case Otherwise, if P points to a complete object, a base class subobject or a member subobject y, given the value of J as j, P is added or subtracted as follows:
. J can be 0 or 1. Note that if it is 1, the resulting pointer is a "pointer past the end" and is illegal to dereference, even if it would happen to have the same numerical value as a valid object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is this defined?
vec_t* at( bool index )
{
return (vec_t*) this + ( index * 16 );
}
The pointer doesn't go past the end of the object this
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that doesn't work, actually only the if (index) return maxs; return mins
work:
https://godbolt.org/z/ETEqnG9Gv
What's curious is that the compilers (GCC, Clang) seem to produce the same code for all three functions, but produce garbage on the functions doing pointer arithmetic. Edit: In fact the computed offsets are wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: In fact the computed offsets are wrong.
The wrong offsets are the ones in your at()
functions, i. e. it should be index * 4
, rather than index * 16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen claims that, essentially, doing this would not be UB:
return (vec_t*) ( (char*) &mins[0] + ( index ? offsetof( boundsA_t, maxs ) : offsetof( boundsA_t, mins ) ) );
However, don't quote me on that, & it produces the same assembly anyway.
73a5d78
to
bf2896f
Compare
This is not urgent at all. This is complementary to:
BoxOnPlaneSide
function #1142It also improves over:
plane_t
struct withnormal
anddist
members #1043The
BoxOnPlaneSide()
function is known to be a hot spot, being called by multiple recursive functions. Right now we spend a lot of time in_mm_loadu_ps
, we have to callsseLoadVec3Unsafe()
explicitly because we can't guess if the data is aligned or not. This comes with multiple downside:_mm_loadu_ps
is said to be slower than_mm_load_ps
, and that fits my experience¹._mm_loadu_ps
and will always call it if we ask for it explicitely, even if the data is already aligned, and by experience, even if no copy is needed.So the idea is to introduce some nice
bounds_t
struct that uses alignedmins
andmaxs
, same forcplane_t
with an alignednormal
. When doing that, we can write an explicit_mm_load_ps
that is faster than_mm_loadu_ps
, and most of the cases, because the compiler knows the data isconst
and already aligned, the compiler just removes the_mm_load_ps
and process the data without any copy.See also:
bounds_t
with pre-alignedmins
andmaxs
Unvanquished/Unvanquished#3265¹ Some times ago I tried to write optimized SSE code for some other functions, but the code was slower because of the explicit
_mm_loadu_ps
call. I even noticed copying an unaligned vec3 to an aligned vec3 before calling_mm_load_ps
could make the code faster when the compiler notices the data is already aligned and skips the copy and calls_mm_load_ps
(or even doesn't need to call it at all).