-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: teach BitPackedEncoding canonical_into #2324
Conversation
Runtime for a 100k element array is reduced to ~60% of `into_canonical`. This runtime reduction only holds when the builder avoids all allocations! The builder must be large enough to contain both the entire decoded array and also any scratch space needed by the decoder. For BitPacked, that means it needs 1023 empty element spots in case it needs to decode a 1024 element chunk to retrieve just one element. ``` canonicalize_bench fastest │ slowest │ median │ mean │ samples │ iters ├─ test 43.24 µs │ 221.6 µs │ 47.62 µs │ 47.49 µs │ 100000 │ 100000 ├─ canonical_into │ │ │ │ │ │ ╰─ u32 │ │ │ │ │ │ ╰─ (100000, 3) 24.12 µs │ 104.7 µs │ 24.29 µs │ 24.39 µs │ 100000 │ 100000 ╰─ into_canonical │ │ │ │ │ ╰─ u32 │ │ │ │ │ ╰─ (100000, 3) 39.54 µs │ 249.6 µs │ 39.91 µs │ 40.95 µs │ 100000 │ 100000 ```
Is there a reason into_canonical doesn't also do this? |
In this PR, into_canonical uses the same code path so for a single BitPackedArray the runtime should be equal. If we changed I could make that change here, but so far it seems like the canonical_into stuff is implemented & benchmarked but unused. |
Is there any reason into_canonical doesn't just allocate an array builder and call this one in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
match_each_unsigned_integer_ptype!(indices.ptype(), |$P| { | ||
for (compressed_index, decompressed_index) in indices.into_primitive()?.as_slice::<$P>().into_iter().enumerate() { | ||
let decompressed_index = *decompressed_index as usize; | ||
let out_index = starting_at + decompressed_index - offset; | ||
if validity.value(compressed_index) { | ||
self.values[out_index] = values[compressed_index] | ||
} else { | ||
self.nulls.set_bit(out_index, false) | ||
} | ||
} | ||
}); | ||
|
||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this live in its own fn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Ok(unpacked) | ||
} | ||
} | ||
builder.nulls.append_validity(array.validity(), length)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this should live in the builder with a method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, all the builder methods act on both the values and the nulls. I think I prefer the interface to either be: update values and nulls together or directly use the nulls and the values separately. I could add a parts(&mut self) -> (&mut BufferMut<T>, &mut LazyNullBufferBuilder)
method.
(first_full_chunk..num_chunks).for_each(|i| { | ||
let chunk: &[T] = &packed[i * elems_per_chunk..][0..elems_per_chunk]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOI, why did you move from a for each to a loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed a ?
and I couldn't quickly sort out how to make for_each take a Result-returning function.
for i in first_full_chunk..num_chunks { | ||
let chunk = &packed[i * elems_per_chunk..][0..elems_per_chunk]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there a lots of chunks (>L2) cache that we want to apply patches after each loop iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasonable, but then I need to slice/chunk/partition the patches which is not free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True
@@ -80,11 +81,51 @@ impl LazyNullBufferBuilder { | |||
.append_buffer(&bool_buffer); | |||
} | |||
|
|||
pub fn append_validity(&mut self, validity: Validity, length: usize) -> VortexResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had been doing this with a validity mask. I wonder if we should choose either of the two methods and use them consistently
match array.validity_mask()?.boolean_buffer() {
AllOr::All => {
self.nulls.append_n_non_nulls(array.len());
}
AllOr::None => self.nulls.append_n_nulls(array.len()),
AllOr::Some(validity) => self.nulls.append_buffer(validity.clone()),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use validity_mask here, it should perf better than creating a validity and then a boolean buffer below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I went with the following and changed append_buffer
to take a reference to a BooleanBuffer
rather than an owned one.
pub fn append_validity_mask(&mut self, validity_mask: Mask) {
match validity_mask {
Mask::AllTrue(len) => self.append_n_non_nulls(len),
Mask::AllFalse(len) => self.append_n_nulls(len),
Mask::Values(is_valid) => self.append_buffer(is_valid.boolean_buffer()),
}
}
let mut primitive_builder = PrimitiveBuilder::<T>::with_capacity( | ||
arr.dtype().nullability(), | ||
arr_len * chunk_count + 1024, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This + 1024 values is really important we need to make sure that it is actually applied in the ctor of the builders if needed, or maybe always?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure how best to approach this. Perhaps arrays should have a size_hint? Perhaps every builder should assume it needs a magic 1024 extra elements of space? 1024 is kind of a magic number in Vortex.
The benefit of this is when we have chunked of bitpacked, correct? So we cannot use that for into_canon |
assert!(starting_at + array_len == self.len()); | ||
let indices = indices.into_primitive()?; | ||
let values = values.into_primitive()?; | ||
let validity = values.validity_mask()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the validity is all true, then remove the check from the below loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if validity.value(compressed_index) { | ||
self.values[out_index] = values[compressed_index] | ||
} else { | ||
self.nulls.set_bit(out_index, false) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if validity.value(compressed_index) { | |
self.values[out_index] = values[compressed_index] | |
} else { | |
self.nulls.set_bit(out_index, false) | |
} | |
self.values[out_index] = values[compressed_index]; | |
self.nulls.set_bit(out_index, validity.value(compressed_index)); |
I think you can remove the branch if you ensure that self.nulls
has the correct backing array below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did something different but indeed there are no branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this.
Can you also add benchmarks for nullable and bitpacked array with patches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates patches
let mut rng = StdRng::seed_from_u64(0);
let values = (0..len)
.map(|_| {
if rng.gen_bool(0.99) {
T::from(rng.gen_range(0..6)).vortex_expect("valid value")
+ T::from(7).vortex_expect("sdf")
} else {
T::from(rng.gen_range(0..100)).vortex_expect("valid value")
* T::from(100).expect("valid value")
}
})
.collect::<BufferMut<T>>()
.into_array()
.into_primitive()
.vortex_unwrap();
let bit = bitpack_encode(values, 12).vortex_unwrap();
Something suspicious happens when the builder's capacity is ~10M. It's a bit fuzzy and changes from compilation-to-compilation but I think something bad happens when the capacity is in the range of Profiles and tracing point at each call to non-nullable
nullable
|
BitPacking's into_canonical does this now, but we should make this change for ChunkedArray too. I prefer to do it in a separate PR so we can benchmark it as an independent change. In particular, I've seen canonical_into be slower sometimes because of the need to grow the BufferMut. I want to make sure we catch that kind of change in the PR that makes ChunkedArray use canonical_into. |
Does this help with the strange behaviour? |
} else { | ||
self.nulls.set_bit(out_index, false) | ||
} | ||
if !matches!(validity, Mask::AllTrue(_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be annoying but can we move this into either two matches or an inner fn? @0ax1 noticed that this match + more inner code can affect perf opts from the compiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also do you think its a good idea to implement this with two loops in the nullable case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If code size is no concern, I think the best code is: switch on nullability, in the nullable case, have one loop over cache-sized chunks of indices with two inner loops that update the nulls and the values.
I could benchmark these approaches. I mostly felt the effect wouldn't be that large and I'm not sure of a way to pick the right size of the chunks. Patches should usually be a pretty small fraction of the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I benchmarked two loops vs one loop for nullable, and your hunch was right that one loop is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on release correct?
} else { | ||
self.nulls.set_bit(out_index, false) | ||
} | ||
if !matches!(validity, Mask::AllTrue(_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can only do this will nullability not validity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm. I was thinking that the validity of a position which will be patched is true, but I think you're right that the unpatched values could indeed be invalid.
It reduces the numbers overall but doesn't change the relationship on the bad examples:
|
Latest numbers:
|
Latest Summary (rows_per_chunk, n_chunks, fraction_patches). non-nullablemicroseconds:
milliseconds:
nullableAll samples have 5% of positions randomly invalid. microseconds:
|
The range [1 << 24, 1 << 25] is number of elements right, how big is it in bytes? |
|x| unsafe { std::mem::transmute(x) }, | ||
|x| unsafe { std::mem::transmute(x) }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some safety and explanation comment would be good here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
for i in first_full_chunk..num_chunks { | ||
let chunk = &packed[i * elems_per_chunk..][0..elems_per_chunk]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True
} | ||
|
||
let packed = array.packed_slice::<UnsignedT>(); | ||
let builder_current_length = builder.len(); | ||
builder.values.reserve((length + 1023) / 1024 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you think this is the part that reallocated and effects perf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know the length of the array, so could you instead allocated a 1024 * |T| and decode the last block into that and then copy values from this to the builder? (If there is not already enough space).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the benchmarks, we ensure the resulting builder is big enough to hold all the elements plus any scratch space needed by the decoded (in this case, 1024 empty spots, though only 1023 are needed). This operation shouldn't allocate in the benchmarks. I've verified this in the profile (there's no BufferMut::reserve
or BytesMut::extend_from_slice
both of which I've seen when it allocates).
Yeah, what you're describing is a reasonable alternative to adding extra space to the builders. I'd be happy to do that, though at this point it feels like we should do that in a follow up PR since this one is fairly large already.
.bench_values(|chunked| { | ||
let mut primitive_builder = PrimitiveBuilder::<i32>::with_capacity( | ||
chunked.dtype().nullability(), | ||
chunk_len * chunk_count + 1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot assume this constant since its not added by the builder or at least we need another bench without this which is the primary one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the benchmark without the extra 1024 elements, and canonical_into ended up slower in all cases. When the builder can’t hold the full result, it has to resize and copy—just like into_canonical.
I can also address your comment about using a temporary buffer for the final chunk and include that in a benchmark.
Looking ahead, I see two possible approaches:
- Never use truncate: canonical_into implementations would only push as many elements as they have (len), without truncating.
- Provide extra capacity: canonical_into implementations would reserve “garbage” space (perhaps via a minimum_builder_capacity()) so child structures can always write directly to their final location.
I’m inclined toward the second option because data goes exactly where it belongs, although it does waste space for smaller arrays. For example, a 500-element array might need space for 1024 elements. Ultimately, this is a key design decision for how builders should work.
Maybe worth a quick sync or a wider discussion? What do you think?
Looking good its almost there |
Yeah, that's right. The elements are i32s. Hundreds of megabytes. I'd believe there's cache effects, but I don't see how that could cause a small window to be better than outside the window. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
|
||
fn main() { | ||
divan::main(); | ||
} | ||
|
||
fn make_array<T: NativePType>(len: usize) -> Array { | ||
#[divan::bench( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mind extract these into a single shared const, see BENCH_ARGS
Runtime for a 100k element array is reduced to ~60% of
into_canonical
.This runtime reduction only holds when the builder avoids all resizes (allocation + extend_from_slice)! The builder must be large enough to contain both the entire decoded array and also any scratch space needed by the decoder. For BitPacked, that means it needs 1023 empty element spots in case it needs to decode a 1024 element chunk to retrieve just one element.