Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: teach BitPackedEncoding canonical_into #2324

Merged
merged 23 commits into from
Feb 14, 2025
Merged

Conversation

danking
Copy link
Member

@danking danking commented Feb 11, 2025

Runtime for a 100k element array is reduced to ~60% of into_canonical.

This runtime reduction only holds when the builder avoids all resizes (allocation + extend_from_slice)! The builder must be large enough to contain both the entire decoded array and also any scratch space needed by the decoder. For BitPacked, that means it needs 1023 empty element spots in case it needs to decode a 1024 element chunk to retrieve just one element.

canonicalize_bench    fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ test               43.24 µs      │ 221.6 µs      │ 47.62 µs      │ 47.49 µs      │ 100000  │ 100000
├─ canonical_into                   │               │               │               │         │
│  ╰─ u32                           │               │               │               │         │
│     ╰─ (100000, 3)  24.12 µs      │ 104.7 µs      │ 24.29 µs      │ 24.39 µs      │ 100000  │ 100000
╰─ into_canonical                   │               │               │               │         │
   ╰─ u32                           │               │               │               │         │
      ╰─ (100000, 3)  39.54 µs      │ 249.6 µs      │ 39.91 µs      │ 40.95 µs      │ 100000  │ 100000

Runtime for a 100k element array is reduced to ~60% of `into_canonical`.

This runtime reduction only holds when the builder avoids all allocations! The builder must be large
enough to contain both the entire decoded array and also any scratch space needed by the
decoder. For BitPacked, that means it needs 1023 empty element spots in case it needs to decode a
1024 element chunk to retrieve just one element.

```
canonicalize_bench    fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ test               43.24 µs      │ 221.6 µs      │ 47.62 µs      │ 47.49 µs      │ 100000  │ 100000
├─ canonical_into                   │               │               │               │         │
│  ╰─ u32                           │               │               │               │         │
│     ╰─ (100000, 3)  24.12 µs      │ 104.7 µs      │ 24.29 µs      │ 24.39 µs      │ 100000  │ 100000
╰─ into_canonical                   │               │               │               │         │
   ╰─ u32                           │               │               │               │         │
      ╰─ (100000, 3)  39.54 µs      │ 249.6 µs      │ 39.91 µs      │ 40.95 µs      │ 100000  │ 100000
```
@danking danking marked this pull request as ready for review February 11, 2025 21:20
@gatesn
Copy link
Contributor

gatesn commented Feb 11, 2025

Is there a reason into_canonical doesn't also do this?

@danking
Copy link
Member Author

danking commented Feb 11, 2025

In this PR, into_canonical uses the same code path so for a single BitPackedArray the runtime should be equal. If we changed ChunkedArray.into_canonical to use canonical_into with a builder, then, yes, that would improve the runtime of ChunkedArray.into_canonical when containing BitPackedArray chunks.

I could make that change here, but so far it seems like the canonical_into stuff is implemented & benchmarked but unused.

@gatesn
Copy link
Contributor

gatesn commented Feb 11, 2025

Is there any reason into_canonical doesn't just allocate an array builder and call this one in this PR?

Copy link
Member

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

Comment on lines 57 to 69
match_each_unsigned_integer_ptype!(indices.ptype(), |$P| {
for (compressed_index, decompressed_index) in indices.into_primitive()?.as_slice::<$P>().into_iter().enumerate() {
let decompressed_index = *decompressed_index as usize;
let out_index = starting_at + decompressed_index - offset;
if validity.value(compressed_index) {
self.values[out_index] = values[compressed_index]
} else {
self.nulls.set_bit(out_index, false)
}
}
});

Ok(())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this live in its own fn

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Ok(unpacked)
}
}
builder.nulls.append_validity(array.validity(), length)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this should live in the builder with a method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, all the builder methods act on both the values and the nulls. I think I prefer the interface to either be: update values and nulls together or directly use the nulls and the values separately. I could add a parts(&mut self) -> (&mut BufferMut<T>, &mut LazyNullBufferBuilder) method.

Comment on lines -260 to -261
(first_full_chunk..num_chunks).for_each(|i| {
let chunk: &[T] = &packed[i * elems_per_chunk..][0..elems_per_chunk];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOI, why did you move from a for each to a loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed a ? and I couldn't quickly sort out how to make for_each take a Result-returning function.

Comment on lines 275 to 276
for i in first_full_chunk..num_chunks {
let chunk = &packed[i * elems_per_chunk..][0..elems_per_chunk];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there a lots of chunks (>L2) cache that we want to apply patches after each loop iteration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable, but then I need to slice/chunk/partition the patches which is not free.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True

@@ -80,11 +81,51 @@ impl LazyNullBufferBuilder {
.append_buffer(&bool_buffer);
}

pub fn append_validity(&mut self, validity: Validity, length: usize) -> VortexResult<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been doing this with a validity mask. I wonder if we should choose either of the two methods and use them consistently

        match array.validity_mask()?.boolean_buffer() {
            AllOr::All => {
                self.nulls.append_n_non_nulls(array.len());
            }
            AllOr::None => self.nulls.append_n_nulls(array.len()),
            AllOr::Some(validity) => self.nulls.append_buffer(validity.clone()),
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use validity_mask here, it should perf better than creating a validity and then a boolean buffer below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I went with the following and changed append_buffer to take a reference to a BooleanBuffer rather than an owned one.

    pub fn append_validity_mask(&mut self, validity_mask: Mask) {
        match validity_mask {
            Mask::AllTrue(len) => self.append_n_non_nulls(len),
            Mask::AllFalse(len) => self.append_n_nulls(len),
            Mask::Values(is_valid) => self.append_buffer(is_valid.boolean_buffer()),
        }
    }

Comment on lines 108 to 111
let mut primitive_builder = PrimitiveBuilder::<T>::with_capacity(
arr.dtype().nullability(),
arr_len * chunk_count + 1024,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This + 1024 values is really important we need to make sure that it is actually applied in the ctor of the builders if needed, or maybe always?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure how best to approach this. Perhaps arrays should have a size_hint? Perhaps every builder should assume it needs a magic 1024 extra elements of space? 1024 is kind of a magic number in Vortex.

@joseph-isaacs
Copy link
Member

The benefit of this is when we have chunked of bitpacked, correct? So we cannot use that for into_canon

assert!(starting_at + array_len == self.len());
let indices = indices.into_primitive()?;
let values = values.into_primitive()?;
let validity = values.validity_mask()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the validity is all true, then remove the check from the below loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 61 to 65
if validity.value(compressed_index) {
self.values[out_index] = values[compressed_index]
} else {
self.nulls.set_bit(out_index, false)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if validity.value(compressed_index) {
self.values[out_index] = values[compressed_index]
} else {
self.nulls.set_bit(out_index, false)
}
self.values[out_index] = values[compressed_index];
self.nulls.set_bit(out_index, validity.value(compressed_index));

I think you can remove the branch if you ensure that self.nulls has the correct backing array below

Copy link
Member Author

@danking danking Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did something different but indeed there are no branches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this.

@joseph-isaacs
Copy link
Member

Can you also add benchmarks for nullable and bitpacked array with patches

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates patches

    let mut rng = StdRng::seed_from_u64(0);
    let values = (0..len)
        .map(|_| {
            if rng.gen_bool(0.99) {
                T::from(rng.gen_range(0..6)).vortex_expect("valid value")
                    + T::from(7).vortex_expect("sdf")
            } else {
                T::from(rng.gen_range(0..100)).vortex_expect("valid value")
                    * T::from(100).expect("valid value")
            }
        })
        .collect::<BufferMut<T>>()
        .into_array()
        .into_primitive()
        .vortex_unwrap();

    let bit = bitpack_encode(values, 12).vortex_unwrap();

@danking
Copy link
Member Author

danking commented Feb 12, 2025

Something suspicious happens when the builder's capacity is ~10M. It's a bit fuzzy and changes from compilation-to-compilation but I think something bad happens when the capacity is in the range of 1 << 24 to 1 << 25. Outside that capacity "Bermuda interval", canonical_into is the best choice.

Profiles and tracing point at each call to unpack_into. Both canonical_into and into_canonical use unpack_into. Each one calls unpack_into once per chunk; however, canonical_into uses one big builder and the latter uses a smaller builder per chunk. Tracing shows that when given a big builder, canonical_into takes ~50us and a small builder takes ~15us. lto=fat has no effect. inline(always) on unpack_into has no effect.

non-nullable

├─ non_nullable            │ c_into  │ into_c │ratio│
│  ├─ (100000, 1, 0.0)     │ 11.08 µs│14.99 µs│0.74 │
│  ├─ (100000, 1, 0.1)     │ 15.91 µs│28.7 µs │0.55 │
│  ├─ (100000, 1, 0.01)    │ 10.29 µs│24.47 µs│0.42 │
│  ├─ (100000, 10, 0.0)    │ 202.2 µs│234.4 µs│0.86 │
│  ├─ (100000, 10, 0.1)    │ 253.9 µs│287.8 µs│0.88 │
│  ├─ (100000, 10, 0.01)   │ 213.4 µs│243.2 µs│0.88 │
│  ├─ (100000, 100, 0.0)   │ 2.233 ms│1.466 ms│1.52 │
│  ├─ (100000, 100, 0.1)   │ 2.176 ms│2 ms    │1.09 │
│  ├─ (100000, 100, 0.01)  │ 1.728 ms│1.551 ms│1.11 │
│  ├─ (100000, 1000, 0.0)  │ 31.25 ms│24.22 ms│1.29 │
│  ├─ (1000000, 100, 0.0)  │ 30.94 ms│23.7 ms │1.31 │
│  ├─ (1000000, 1000, 0.0) │ 308.4 ms│362 ms  │0.85 │
│  ╰─ (10000000, 100, 0.0) │ 313.3 ms│445.9 ms│0.70 │

nullable

├─ nullable                │ c_into  │ into_c │ratio│
│  ├─ (100000, 1, 0.0)     │ 11.29 µs│19.87 µs│0.57 │
│  ├─ (100000, 1, 0.1)     │ 16.45 µs│24.91 µs│0.66 │
│  ├─ (100000, 10, 0.0)    │ 189.6 µs│278.6 µs│0.68 │
│  ├─ (100000, 10, 0.1)    │ 240.5 µs│330.9 µs│0.73 │
│  ├─ (100000, 100, 0.0)   │ 1.749 ms│1.907 ms│0.92 │
│  ╰─ (100000, 100, 0.1)   │ 2.059 ms│2.429 ms│0.85 │

@danking
Copy link
Member Author

danking commented Feb 12, 2025

Is there any reason into_canonical doesn't just allocate an array builder and call this one in this PR?

BitPacking's into_canonical does this now, but we should make this change for ChunkedArray too. I prefer to do it in a separate PR so we can benchmark it as an independent change. In particular, I've seen canonical_into be slower sometimes because of the need to grow the BufferMut. I want to make sure we catch that kind of change in the PR that makes ChunkedArray use canonical_into.

@joseph-isaacs
Copy link
Member

Does this help with the strange behaviour?

} else {
self.nulls.set_bit(out_index, false)
}
if !matches!(validity, Mask::AllTrue(_)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be annoying but can we move this into either two matches or an inner fn? @0ax1 noticed that this match + more inner code can affect perf opts from the compiler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also do you think its a good idea to implement this with two loops in the nullable case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If code size is no concern, I think the best code is: switch on nullability, in the nullable case, have one loop over cache-sized chunks of indices with two inner loops that update the nulls and the values.

I could benchmark these approaches. I mostly felt the effect wouldn't be that large and I'm not sure of a way to pick the right size of the chunks. Patches should usually be a pretty small fraction of the array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I benchmarked two loops vs one loop for nullable, and your hunch was right that one loop is best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on release correct?

} else {
self.nulls.set_bit(out_index, false)
}
if !matches!(validity, Mask::AllTrue(_)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can only do this will nullability not validity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm. I was thinking that the validity of a position which will be patched is true, but I think you're right that the unpatched values could indeed be invalid.

@danking
Copy link
Member Author

danking commented Feb 13, 2025

@joseph-isaacs

Does this help with the strange behaviour?

It reduces the numbers overall but doesn't change the relationship on the bad examples:

canonicalize_bench              fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ canonical_into_non_nullable                │               │               │               │         │
│  ├─ (100000, 100, 0.0)        1.535 ms      │ 3.37 ms       │ 2.11 ms       │ 2.051 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.1)        2.044 ms      │ 2.873 ms      │ 2.065 ms      │ 2.069 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.01)       1.61 ms       │ 2.087 ms      │ 1.642 ms      │ 1.645 ms      │ 1000    │ 1000
│  ╰─ (100000, 1000, 0.0)       29.27 ms      │ 43.56 ms      │ 29.51 ms      │ 29.66 ms      │ 1000    │ 1000
├─ into_canonical_non_nullable                │               │               │               │         │
│  ├─ (100000, 100, 0.0)        1.363 ms      │ 1.506 ms      │ 1.466 ms      │ 1.445 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.1)        1.869 ms      │ 2.554 ms      │ 2.002 ms      │ 2.005 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.01)       1.441 ms      │ 2.058 ms      │ 1.55 ms       │ 1.54 ms       │ 1000    │ 1000
│  ╰─ (100000, 1000, 0.0)       22.66 ms      │ 38.88 ms      │ 23 ms         │ 23.23 ms      │ 1000    │ 1000

@danking
Copy link
Member Author

danking commented Feb 13, 2025

Latest numbers:

RUST_BACKTRACE=1 cargo bench -p vortex-fastlanes --features="test-harness" --bench canonicalize_bench -- --sample-count 1000
    Finished `bench` profile [optimized + debuginfo] target(s) in 0.10s
     Running benches/canonicalize_bench.rs (target/release/deps/canonicalize_bench-1c4b4971f3efc898)
Timer precision: 41 ns
canonicalize_bench              fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ canonical_into_non_nullable                │               │               │               │         │
│  ├─ (100000, 1, 0.0)          8.416 µs      │ 22.37 µs      │ 8.54 µs       │ 8.584 µs      │ 1000    │ 1000
│  ├─ (100000, 1, 0.1)          13.33 µs      │ 24.29 µs      │ 13.49 µs      │ 13.8 µs       │ 1000    │ 1000
│  ├─ (100000, 1, 0.01)         9.249 µs      │ 10.54 µs      │ 9.374 µs      │ 9.406 µs      │ 1000    │ 1000
│  ├─ (100000, 10, 0.0)         157.3 µs      │ 416.2 µs      │ 157.9 µs      │ 166 µs        │ 1000    │ 1000
│  ├─ (100000, 10, 0.1)         207.1 µs      │ 415.9 µs      │ 207.7 µs      │ 218 µs        │ 1000    │ 1000
│  ├─ (100000, 10, 0.01)        166.1 µs      │ 280.3 µs      │ 166.5 µs      │ 175.5 µs      │ 1000    │ 1000
│  ├─ (100000, 100, 0.0)        1.456 ms      │ 3.496 ms      │ 1.786 ms      │ 1.805 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.1)        2.039 ms      │ 2.843 ms      │ 2.072 ms      │ 2.076 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.01)       1.585 ms      │ 2.058 ms      │ 1.654 ms      │ 1.658 ms      │ 1000    │ 1000
│  ├─ (100000, 1000, 0.0)       29.29 ms      │ 52.83 ms      │ 30.01 ms      │ 30.83 ms      │ 1000    │ 1000
│  ├─ (1000000, 100, 0.0)       28.83 ms      │ 35.69 ms      │ 29.28 ms      │ 30.05 ms      │ 10      │ 10
│  ├─ (1000000, 1000, 0.0)      294.8 ms      │ 342 ms        │ 302.6 ms      │ 306.3 ms      │ 10      │ 10
│  ╰─ (10000000, 100, 0.0)      295.7 ms      │ 321.3 ms      │ 297.9 ms      │ 300.3 ms      │ 10      │ 10
├─ into_canonical_non_nullable                │               │               │               │         │
│  ├─ (100000, 1, 0.0)          14.83 µs      │ 37.58 µs      │ 14.99 µs      │ 15.26 µs      │ 1000    │ 1000
│  ├─ (100000, 1, 0.1)          20.08 µs      │ 37.99 µs      │ 20.24 µs      │ 20.33 µs      │ 1000    │ 1000
│  ├─ (100000, 1, 0.01)         15.79 µs      │ 25.08 µs      │ 16.29 µs      │ 16.29 µs      │ 1000    │ 1000
│  ├─ (100000, 10, 0.0)         220.2 µs      │ 388.8 µs      │ 235.3 µs      │ 242.9 µs      │ 1000    │ 1000
│  ├─ (100000, 10, 0.1)         288.4 µs      │ 420.4 µs      │ 289 µs        │ 297.8 µs      │ 1000    │ 1000
│  ├─ (100000, 10, 0.01)        228.7 µs      │ 349 µs        │ 244.3 µs      │ 252.1 µs      │ 1000    │ 1000
│  ├─ (100000, 100, 0.0)        1.461 ms      │ 1.558 ms      │ 1.468 ms      │ 1.47 ms       │ 1000    │ 1000
│  ├─ (100000, 100, 0.1)        1.889 ms      │ 2.731 ms      │ 2.006 ms      │ 2.025 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.01)       1.466 ms      │ 1.962 ms      │ 1.544 ms      │ 1.556 ms      │ 1000    │ 1000
│  ├─ (100000, 1000, 0.0)       24.04 ms      │ 40.29 ms      │ 25.29 ms      │ 26.08 ms      │ 1000    │ 1000
│  ├─ (1000000, 100, 0.0)       24.22 ms      │ 36.12 ms      │ 25.43 ms      │ 26.58 ms      │ 10      │ 10
│  ├─ (1000000, 1000, 0.0)      336.1 ms      │ 372 ms        │ 342.1 ms      │ 344.2 ms      │ 10      │ 10
│  ╰─ (10000000, 100, 0.0)      438.7 ms      │ 479.2 ms      │ 453.9 ms      │ 457.5 ms      │ 10      │ 10
├─ canonical_into_nullable                    │               │               │               │         │
│  ├─ (100000, 1, 0.0)          11.16 µs      │ 20.62 µs      │ 11.29 µs      │ 11.42 µs      │ 1000    │ 1000
│  ├─ (100000, 1, 0.1)          31.24 µs      │ 110.5 µs      │ 33.95 µs      │ 34.52 µs      │ 1000    │ 1000
│  ├─ (100000, 10, 0.0)         176.4 µs      │ 413.7 µs      │ 189.9 µs      │ 202 µs        │ 1000    │ 1000
│  ├─ (100000, 10, 0.1)         402.6 µs      │ 789.5 µs      │ 424 µs        │ 434.2 µs      │ 1000    │ 1000
│  ├─ (100000, 100, 0.0)        1.558 ms      │ 2.564 ms      │ 1.73 ms       │ 1.755 ms      │ 1000    │ 1000
│  ╰─ (100000, 100, 0.1)        3.664 ms      │ 4.794 ms      │ 3.765 ms      │ 3.822 ms      │ 1000    │ 1000
╰─ into_canonical_nullable                    │               │               │               │         │
   ├─ (100000, 1, 0.0)          19.66 µs      │ 94.54 µs      │ 28.12 µs      │ 24.99 µs      │ 1000    │ 1000
   ├─ (100000, 1, 0.1)          42.2 µs       │ 118 µs        │ 50.66 µs      │ 48.04 µs      │ 1000    │ 1000
   ├─ (100000, 10, 0.0)         280.1 µs      │ 417.2 µs      │ 281.2 µs      │ 290.6 µs      │ 1000    │ 1000
   ├─ (100000, 10, 0.1)         498.8 µs      │ 1.374 ms      │ 521.1 µs      │ 529.2 µs      │ 1000    │ 1000
   ├─ (100000, 100, 0.0)        1.897 ms      │ 2.255 ms      │ 1.912 ms      │ 1.917 ms      │ 1000    │ 1000
   ╰─ (100000, 100, 0.1)        4.284 ms      │ 4.853 ms      │ 4.31 ms       │ 4.325 ms      │ 1000    │ 1000

@danking danking enabled auto-merge (squash) February 13, 2025 15:21
@danking
Copy link
Member Author

danking commented Feb 13, 2025

Latest Summary

(rows_per_chunk, n_chunks, fraction_patches).

non-nullable

microseconds:

                     c_into into_c   ratio
(100000, 1, 0.0)     8.54   14.99    0.57
(100000, 1, 0.1)     13.49  20.24    0.67
(100000, 1, 0.01)    9.374  16.29    0.58
(100000, 10, 0.0)    157.9  235.30   0.67
(100000, 10, 0.1)    207.7  289.00   0.72
(100000, 10, 0.01)   166.5  244.30   0.68

milliseconds:

(100000, 100, 0.0)   1.786  1.47     1.22
(100000, 100, 0.1)   2.072  2.01     1.03
(100000, 100, 0.01)  1.654  1.54     1.07
(100000, 1000, 0.0)  30.01  25.29    1.19
(1000000, 100, 0.0)  29.28  25.43    1.15
(1000000, 1000, 0.0) 302.6  342.10   0.88
(10000000, 100, 0.0) 297.9  453.90   0.66

nullable

All samples have 5% of positions randomly invalid.

microseconds:

                   c_into into_c ratio
(100000, 1, 0.0)   11.29  28.12  0.40
(100000, 1, 0.1)   33.95  50.66  0.67
(100000, 10, 0.0)  189.9  281.2  0.68
(100000, 10, 0.1)  424    521.1  0.81
(100000, 100, 0.0) 1730   1912   0.90
(100000, 100, 0.1) 3765   4310   0.87

@danking
Copy link
Member Author

danking commented Feb 13, 2025

I think this is ready.

@gatesn , I've got an issue to assess ChunkedArray's into_canonical: #2356 .

@joseph-isaacs
Copy link
Member

The range [1 << 24, 1 << 25] is number of elements right, how big is it in bytes?

Comment on lines 274 to 275
|x| unsafe { std::mem::transmute(x) },
|x| unsafe { std::mem::transmute(x) },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some safety and explanation comment would be good here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 275 to 276
for i in first_full_chunk..num_chunks {
let chunk = &packed[i * elems_per_chunk..][0..elems_per_chunk];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True

}

let packed = array.packed_slice::<UnsignedT>();
let builder_current_length = builder.len();
builder.values.reserve((length + 1023) / 1024 * 1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you think this is the part that reallocated and effects perf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know the length of the array, so could you instead allocated a 1024 * |T| and decode the last block into that and then copy values from this to the builder? (If there is not already enough space).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the benchmarks, we ensure the resulting builder is big enough to hold all the elements plus any scratch space needed by the decoded (in this case, 1024 empty spots, though only 1023 are needed). This operation shouldn't allocate in the benchmarks. I've verified this in the profile (there's no BufferMut::reserve or BytesMut::extend_from_slice both of which I've seen when it allocates).

Yeah, what you're describing is a reasonable alternative to adding extra space to the builders. I'd be happy to do that, though at this point it feels like we should do that in a follow up PR since this one is fairly large already.

.bench_values(|chunked| {
let mut primitive_builder = PrimitiveBuilder::<i32>::with_capacity(
chunked.dtype().nullability(),
chunk_len * chunk_count + 1024,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot assume this constant since its not added by the builder or at least we need another bench without this which is the primary one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the benchmark without the extra 1024 elements, and canonical_into ended up slower in all cases. When the builder can’t hold the full result, it has to resize and copy—just like into_canonical.

I can also address your comment about using a temporary buffer for the final chunk and include that in a benchmark.

Looking ahead, I see two possible approaches:

  1. Never use truncate: canonical_into implementations would only push as many elements as they have (len), without truncating.
  2. Provide extra capacity: canonical_into implementations would reserve “garbage” space (perhaps via a minimum_builder_capacity()) so child structures can always write directly to their final location.

I’m inclined toward the second option because data goes exactly where it belongs, although it does waste space for smaller arrays. For example, a 500-element array might need space for 1024 elements. Ultimately, this is a key design decision for how builders should work.

Maybe worth a quick sync or a wider discussion? What do you think?

@joseph-isaacs
Copy link
Member

Looking good its almost there

@danking
Copy link
Member Author

danking commented Feb 14, 2025

@joseph-isaacs

The range [1 << 24, 1 << 25] is number of elements right, how big is it in bytes?

Yeah, that's right. The elements are i32s. Hundreds of megabytes. I'd believe there's cache effects, but I don't see how that could cause a small window to be better than outside the window.

Copy link
Member

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG


fn main() {
divan::main();
}

fn make_array<T: NativePType>(len: usize) -> Array {
#[divan::bench(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind extract these into a single shared const, see BENCH_ARGS

@danking danking merged commit dbf8ba5 into develop Feb 14, 2025
22 checks passed
@danking danking deleted the dk/bitpacking-builder branch February 14, 2025 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants