feat: teach BitPackedEncoding canonical_into #2324

danking · 2025-02-11T21:20:21Z

Runtime for a 100k element array is reduced to ~60% of into_canonical.

This runtime reduction only holds when the builder avoids all resizes (allocation + extend_from_slice)! The builder must be large enough to contain both the entire decoded array and also any scratch space needed by the decoder. For BitPacked, that means it needs 1023 empty element spots in case it needs to decode a 1024 element chunk to retrieve just one element.

canonicalize_bench    fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ test               43.24 µs      │ 221.6 µs      │ 47.62 µs      │ 47.49 µs      │ 100000  │ 100000
├─ canonical_into                   │               │               │               │         │
│  ╰─ u32                           │               │               │               │         │
│     ╰─ (100000, 3)  24.12 µs      │ 104.7 µs      │ 24.29 µs      │ 24.39 µs      │ 100000  │ 100000
╰─ into_canonical                   │               │               │               │         │
   ╰─ u32                           │               │               │               │         │
      ╰─ (100000, 3)  39.54 µs      │ 249.6 µs      │ 39.91 µs      │ 40.95 µs      │ 100000  │ 100000

Runtime for a 100k element array is reduced to ~60% of `into_canonical`. This runtime reduction only holds when the builder avoids all allocations! The builder must be large enough to contain both the entire decoded array and also any scratch space needed by the decoder. For BitPacked, that means it needs 1023 empty element spots in case it needs to decode a 1024 element chunk to retrieve just one element. ``` canonicalize_bench fastest │ slowest │ median │ mean │ samples │ iters ├─ test 43.24 µs │ 221.6 µs │ 47.62 µs │ 47.49 µs │ 100000 │ 100000 ├─ canonical_into │ │ │ │ │ │ ╰─ u32 │ │ │ │ │ │ ╰─ (100000, 3) 24.12 µs │ 104.7 µs │ 24.29 µs │ 24.39 µs │ 100000 │ 100000 ╰─ into_canonical │ │ │ │ │ ╰─ u32 │ │ │ │ │ ╰─ (100000, 3) 39.54 µs │ 249.6 µs │ 39.91 µs │ 40.95 µs │ 100000 │ 100000 ```

gatesn · 2025-02-11T21:21:34Z

Is there a reason into_canonical doesn't also do this?

danking · 2025-02-11T21:24:20Z

In this PR, into_canonical uses the same code path so for a single BitPackedArray the runtime should be equal. If we changed ChunkedArray.into_canonical to use canonical_into with a builder, then, yes, that would improve the runtime of ChunkedArray.into_canonical when containing BitPackedArray chunks.

I could make that change here, but so far it seems like the canonical_into stuff is implemented & benchmarked but unused.

gatesn · 2025-02-11T22:33:13Z

Is there any reason into_canonical doesn't just allocate an array builder and call this one in this PR?

joseph-isaacs

This is great!

joseph-isaacs · 2025-02-12T09:40:08Z

vortex-array/src/builders/primitive.rs

+        match_each_unsigned_integer_ptype!(indices.ptype(), |$P| {
+            for (compressed_index, decompressed_index) in indices.into_primitive()?.as_slice::<$P>().into_iter().enumerate() {
+                let decompressed_index = *decompressed_index as usize;
+                let out_index = starting_at + decompressed_index - offset;
+                if validity.value(compressed_index) {
+                    self.values[out_index] = values[compressed_index]
+                } else {
+                    self.nulls.set_bit(out_index, false)
+                }
+            }
+        });
+
+        Ok(())


Can this live in its own fn

joseph-isaacs · 2025-02-12T09:42:45Z

encodings/fastlanes/src/bitpacking/compress.rs

-        Ok(unpacked)
-    }
-}
+    builder.nulls.append_validity(array.validity(), length)?;


Do you think this should live in the builder with a method?

Right now, all the builder methods act on both the values and the nulls. I think I prefer the interface to either be: update values and nulls together or directly use the nulls and the values separately. I could add a parts(&mut self) -> (&mut BufferMut<T>, &mut LazyNullBufferBuilder) method.

joseph-isaacs · 2025-02-12T09:44:11Z

encodings/fastlanes/src/bitpacking/compress.rs

-    (first_full_chunk..num_chunks).for_each(|i| {
-        let chunk: &[T] = &packed[i * elems_per_chunk..][0..elems_per_chunk];


OOI, why did you move from a for each to a loop?

I needed a ? and I couldn't quickly sort out how to make for_each take a Result-returning function.

joseph-isaacs · 2025-02-12T09:44:44Z

encodings/fastlanes/src/bitpacking/compress.rs

+    for i in first_full_chunk..num_chunks {
+        let chunk = &packed[i * elems_per_chunk..][0..elems_per_chunk];


I wonder if there a lots of chunks (>L2) cache that we want to apply patches after each loop iteration?

Reasonable, but then I need to slice/chunk/partition the patches which is not free.

joseph-isaacs · 2025-02-12T09:48:15Z

vortex-array/src/builders/lazy_validity_builder.rs

@@ -80,11 +81,51 @@ impl LazyNullBufferBuilder {
            .append_buffer(&bool_buffer);
    }

+    pub fn append_validity(&mut self, validity: Validity, length: usize) -> VortexResult<()> {


I had been doing this with a validity mask. I wonder if we should choose either of the two methods and use them consistently

match array.validity_mask()?.boolean_buffer() { AllOr::All => { self.nulls.append_n_non_nulls(array.len()); } AllOr::None => self.nulls.append_n_nulls(array.len()), AllOr::Some(validity) => self.nulls.append_buffer(validity.clone()), }

I think we should use validity_mask here, it should perf better than creating a validity and then a boolean buffer below

Sure, I went with the following and changed append_buffer to take a reference to a BooleanBuffer rather than an owned one.

pub fn append_validity_mask(&mut self, validity_mask: Mask) { match validity_mask { Mask::AllTrue(len) => self.append_n_non_nulls(len), Mask::AllFalse(len) => self.append_n_nulls(len), Mask::Values(is_valid) => self.append_buffer(is_valid.boolean_buffer()), } }

joseph-isaacs · 2025-02-12T09:50:32Z

encodings/fastlanes/benches/canonicalize_bench.rs

+        let mut primitive_builder = PrimitiveBuilder::<T>::with_capacity(
+            arr.dtype().nullability(),
+            arr_len * chunk_count + 1024,
+        );


This + 1024 values is really important we need to make sure that it is actually applied in the ctor of the builders if needed, or maybe always?

Yeah, I'm not sure how best to approach this. Perhaps arrays should have a size_hint? Perhaps every builder should assume it needs a magic 1024 extra elements of space? 1024 is kind of a magic number in Vortex.

joseph-isaacs · 2025-02-12T11:51:47Z

The benefit of this is when we have chunked of bitpacked, correct? So we cannot use that for into_canon

joseph-isaacs · 2025-02-12T12:12:18Z

vortex-array/src/builders/primitive.rs

+        assert!(starting_at + array_len == self.len());
+        let indices = indices.into_primitive()?;
+        let values = values.into_primitive()?;
+        let validity = values.validity_mask()?;


if the validity is all true, then remove the check from the below loop.

joseph-isaacs · 2025-02-12T12:14:05Z

vortex-array/src/builders/primitive.rs

+                if validity.value(compressed_index) {
+                    self.values[out_index] = values[compressed_index]
+                } else {
+                    self.nulls.set_bit(out_index, false)
+                }


Suggested change

if validity.value(compressed_index) {

self.values[out_index] = values[compressed_index]

} else {

self.nulls.set_bit(out_index, false)

}

self.values[out_index] = values[compressed_index];

self.nulls.set_bit(out_index, validity.value(compressed_index));

I think you can remove the branch if you ensure that self.nulls has the correct backing array below

~~I did something different but indeed there are no branches.~~

I did this.

joseph-isaacs · 2025-02-12T12:34:52Z

Can you also add benchmarks for nullable and bitpacked array with patches

joseph-isaacs · 2025-02-12T12:35:52Z

encodings/fastlanes/benches/canonicalize_bench.rs

This creates patches

let mut rng = StdRng::seed_from_u64(0); let values = (0..len) .map(|_| { if rng.gen_bool(0.99) { T::from(rng.gen_range(0..6)).vortex_expect("valid value") + T::from(7).vortex_expect("sdf") } else { T::from(rng.gen_range(0..100)).vortex_expect("valid value") * T::from(100).expect("valid value") } }) .collect::<BufferMut<T>>() .into_array() .into_primitive() .vortex_unwrap(); let bit = bitpack_encode(values, 12).vortex_unwrap();

encodings/fastlanes/benches/canonicalize_bench.rs

danking · 2025-02-12T23:11:29Z

Something suspicious happens when the builder's capacity is ~10M. It's a bit fuzzy and changes from compilation-to-compilation but I think something bad happens when the capacity is in the range of 1 << 24 to 1 << 25. Outside that capacity "Bermuda interval", canonical_into is the best choice.

Profiles and tracing point at each call to unpack_into. Both canonical_into and into_canonical use unpack_into. Each one calls unpack_into once per chunk; however, canonical_into uses one big builder and the latter uses a smaller builder per chunk. Tracing shows that when given a big builder, canonical_into takes ~50us and a small builder takes ~15us. lto=fat has no effect. inline(always) on unpack_into has no effect.

non-nullable

├─ non_nullable            │ c_into  │ into_c │ratio│
│  ├─ (100000, 1, 0.0)     │ 11.08 µs│14.99 µs│0.74 │
│  ├─ (100000, 1, 0.1)     │ 15.91 µs│28.7 µs │0.55 │
│  ├─ (100000, 1, 0.01)    │ 10.29 µs│24.47 µs│0.42 │
│  ├─ (100000, 10, 0.0)    │ 202.2 µs│234.4 µs│0.86 │
│  ├─ (100000, 10, 0.1)    │ 253.9 µs│287.8 µs│0.88 │
│  ├─ (100000, 10, 0.01)   │ 213.4 µs│243.2 µs│0.88 │
│  ├─ (100000, 100, 0.0)   │ 2.233 ms│1.466 ms│1.52 │
│  ├─ (100000, 100, 0.1)   │ 2.176 ms│2 ms    │1.09 │
│  ├─ (100000, 100, 0.01)  │ 1.728 ms│1.551 ms│1.11 │
│  ├─ (100000, 1000, 0.0)  │ 31.25 ms│24.22 ms│1.29 │
│  ├─ (1000000, 100, 0.0)  │ 30.94 ms│23.7 ms │1.31 │
│  ├─ (1000000, 1000, 0.0) │ 308.4 ms│362 ms  │0.85 │
│  ╰─ (10000000, 100, 0.0) │ 313.3 ms│445.9 ms│0.70 │

nullable

├─ nullable                │ c_into  │ into_c │ratio│
│  ├─ (100000, 1, 0.0)     │ 11.29 µs│19.87 µs│0.57 │
│  ├─ (100000, 1, 0.1)     │ 16.45 µs│24.91 µs│0.66 │
│  ├─ (100000, 10, 0.0)    │ 189.6 µs│278.6 µs│0.68 │
│  ├─ (100000, 10, 0.1)    │ 240.5 µs│330.9 µs│0.73 │
│  ├─ (100000, 100, 0.0)   │ 1.749 ms│1.907 ms│0.92 │
│  ╰─ (100000, 100, 0.1)   │ 2.059 ms│2.429 ms│0.85 │

danking · 2025-02-12T23:18:27Z

Is there any reason into_canonical doesn't just allocate an array builder and call this one in this PR?

BitPacking's into_canonical does this now, but we should make this change for ChunkedArray too. I prefer to do it in a separate PR so we can benchmark it as an independent change. In particular, I've seen canonical_into be slower sometimes because of the need to grow the BufferMut. I want to make sure we catch that kind of change in the PR that makes ChunkedArray use canonical_into.

joseph-isaacs · 2025-02-13T11:52:35Z

Does this help with the strange behaviour?

joseph-isaacs · 2025-02-13T11:55:11Z

vortex-array/src/builders/primitive.rs

-                } else {
-                    self.nulls.set_bit(out_index, false)
-                }
+            if !matches!(validity, Mask::AllTrue(_)) {


Sorry to be annoying but can we move this into either two matches or an inner fn? @0ax1 noticed that this match + more inner code can affect perf opts from the compiler

Also do you think its a good idea to implement this with two loops in the nullable case?

If code size is no concern, I think the best code is: switch on nullability, in the nullable case, have one loop over cache-sized chunks of indices with two inner loops that update the nulls and the values.

I could benchmark these approaches. I mostly felt the effect wouldn't be that large and I'm not sure of a way to pick the right size of the chunks. Patches should usually be a pretty small fraction of the array.

I benchmarked two loops vs one loop for nullable, and your hunch was right that one loop is best.

This is on release correct?

joseph-isaacs · 2025-02-13T11:58:02Z

vortex-array/src/builders/primitive.rs

-                } else {
-                    self.nulls.set_bit(out_index, false)
-                }
+            if !matches!(validity, Mask::AllTrue(_)) {


I think you can only do this will nullability not validity?

Mmm. I was thinking that the validity of a position which will be patched is true, but I think you're right that the unpatched values could indeed be invalid.

danking · 2025-02-13T14:43:27Z

@joseph-isaacs

Does this help with the strange behaviour?

It reduces the numbers overall but doesn't change the relationship on the bad examples:

canonicalize_bench              fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ canonical_into_non_nullable                │               │               │               │         │
│  ├─ (100000, 100, 0.0)        1.535 ms      │ 3.37 ms       │ 2.11 ms       │ 2.051 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.1)        2.044 ms      │ 2.873 ms      │ 2.065 ms      │ 2.069 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.01)       1.61 ms       │ 2.087 ms      │ 1.642 ms      │ 1.645 ms      │ 1000    │ 1000
│  ╰─ (100000, 1000, 0.0)       29.27 ms      │ 43.56 ms      │ 29.51 ms      │ 29.66 ms      │ 1000    │ 1000
├─ into_canonical_non_nullable                │               │               │               │         │
│  ├─ (100000, 100, 0.0)        1.363 ms      │ 1.506 ms      │ 1.466 ms      │ 1.445 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.1)        1.869 ms      │ 2.554 ms      │ 2.002 ms      │ 2.005 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.01)       1.441 ms      │ 2.058 ms      │ 1.55 ms       │ 1.54 ms       │ 1000    │ 1000
│  ╰─ (100000, 1000, 0.0)       22.66 ms      │ 38.88 ms      │ 23 ms         │ 23.23 ms      │ 1000    │ 1000

danking · 2025-02-13T15:17:37Z

Latest numbers:

RUST_BACKTRACE=1 cargo bench -p vortex-fastlanes --features="test-harness" --bench canonicalize_bench -- --sample-count 1000
    Finished `bench` profile [optimized + debuginfo] target(s) in 0.10s
     Running benches/canonicalize_bench.rs (target/release/deps/canonicalize_bench-1c4b4971f3efc898)
Timer precision: 41 ns
canonicalize_bench              fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ canonical_into_non_nullable                │               │               │               │         │
│  ├─ (100000, 1, 0.0)          8.416 µs      │ 22.37 µs      │ 8.54 µs       │ 8.584 µs      │ 1000    │ 1000
│  ├─ (100000, 1, 0.1)          13.33 µs      │ 24.29 µs      │ 13.49 µs      │ 13.8 µs       │ 1000    │ 1000
│  ├─ (100000, 1, 0.01)         9.249 µs      │ 10.54 µs      │ 9.374 µs      │ 9.406 µs      │ 1000    │ 1000
│  ├─ (100000, 10, 0.0)         157.3 µs      │ 416.2 µs      │ 157.9 µs      │ 166 µs        │ 1000    │ 1000
│  ├─ (100000, 10, 0.1)         207.1 µs      │ 415.9 µs      │ 207.7 µs      │ 218 µs        │ 1000    │ 1000
│  ├─ (100000, 10, 0.01)        166.1 µs      │ 280.3 µs      │ 166.5 µs      │ 175.5 µs      │ 1000    │ 1000
│  ├─ (100000, 100, 0.0)        1.456 ms      │ 3.496 ms      │ 1.786 ms      │ 1.805 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.1)        2.039 ms      │ 2.843 ms      │ 2.072 ms      │ 2.076 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.01)       1.585 ms      │ 2.058 ms      │ 1.654 ms      │ 1.658 ms      │ 1000    │ 1000
│  ├─ (100000, 1000, 0.0)       29.29 ms      │ 52.83 ms      │ 30.01 ms      │ 30.83 ms      │ 1000    │ 1000
│  ├─ (1000000, 100, 0.0)       28.83 ms      │ 35.69 ms      │ 29.28 ms      │ 30.05 ms      │ 10      │ 10
│  ├─ (1000000, 1000, 0.0)      294.8 ms      │ 342 ms        │ 302.6 ms      │ 306.3 ms      │ 10      │ 10
│  ╰─ (10000000, 100, 0.0)      295.7 ms      │ 321.3 ms      │ 297.9 ms      │ 300.3 ms      │ 10      │ 10
├─ into_canonical_non_nullable                │               │               │               │         │
│  ├─ (100000, 1, 0.0)          14.83 µs      │ 37.58 µs      │ 14.99 µs      │ 15.26 µs      │ 1000    │ 1000
│  ├─ (100000, 1, 0.1)          20.08 µs      │ 37.99 µs      │ 20.24 µs      │ 20.33 µs      │ 1000    │ 1000
│  ├─ (100000, 1, 0.01)         15.79 µs      │ 25.08 µs      │ 16.29 µs      │ 16.29 µs      │ 1000    │ 1000
│  ├─ (100000, 10, 0.0)         220.2 µs      │ 388.8 µs      │ 235.3 µs      │ 242.9 µs      │ 1000    │ 1000
│  ├─ (100000, 10, 0.1)         288.4 µs      │ 420.4 µs      │ 289 µs        │ 297.8 µs      │ 1000    │ 1000
│  ├─ (100000, 10, 0.01)        228.7 µs      │ 349 µs        │ 244.3 µs      │ 252.1 µs      │ 1000    │ 1000
│  ├─ (100000, 100, 0.0)        1.461 ms      │ 1.558 ms      │ 1.468 ms      │ 1.47 ms       │ 1000    │ 1000
│  ├─ (100000, 100, 0.1)        1.889 ms      │ 2.731 ms      │ 2.006 ms      │ 2.025 ms      │ 1000    │ 1000
│  ├─ (100000, 100, 0.01)       1.466 ms      │ 1.962 ms      │ 1.544 ms      │ 1.556 ms      │ 1000    │ 1000
│  ├─ (100000, 1000, 0.0)       24.04 ms      │ 40.29 ms      │ 25.29 ms      │ 26.08 ms      │ 1000    │ 1000
│  ├─ (1000000, 100, 0.0)       24.22 ms      │ 36.12 ms      │ 25.43 ms      │ 26.58 ms      │ 10      │ 10
│  ├─ (1000000, 1000, 0.0)      336.1 ms      │ 372 ms        │ 342.1 ms      │ 344.2 ms      │ 10      │ 10
│  ╰─ (10000000, 100, 0.0)      438.7 ms      │ 479.2 ms      │ 453.9 ms      │ 457.5 ms      │ 10      │ 10
├─ canonical_into_nullable                    │               │               │               │         │
│  ├─ (100000, 1, 0.0)          11.16 µs      │ 20.62 µs      │ 11.29 µs      │ 11.42 µs      │ 1000    │ 1000
│  ├─ (100000, 1, 0.1)          31.24 µs      │ 110.5 µs      │ 33.95 µs      │ 34.52 µs      │ 1000    │ 1000
│  ├─ (100000, 10, 0.0)         176.4 µs      │ 413.7 µs      │ 189.9 µs      │ 202 µs        │ 1000    │ 1000
│  ├─ (100000, 10, 0.1)         402.6 µs      │ 789.5 µs      │ 424 µs        │ 434.2 µs      │ 1000    │ 1000
│  ├─ (100000, 100, 0.0)        1.558 ms      │ 2.564 ms      │ 1.73 ms       │ 1.755 ms      │ 1000    │ 1000
│  ╰─ (100000, 100, 0.1)        3.664 ms      │ 4.794 ms      │ 3.765 ms      │ 3.822 ms      │ 1000    │ 1000
╰─ into_canonical_nullable                    │               │               │               │         │
   ├─ (100000, 1, 0.0)          19.66 µs      │ 94.54 µs      │ 28.12 µs      │ 24.99 µs      │ 1000    │ 1000
   ├─ (100000, 1, 0.1)          42.2 µs       │ 118 µs        │ 50.66 µs      │ 48.04 µs      │ 1000    │ 1000
   ├─ (100000, 10, 0.0)         280.1 µs      │ 417.2 µs      │ 281.2 µs      │ 290.6 µs      │ 1000    │ 1000
   ├─ (100000, 10, 0.1)         498.8 µs      │ 1.374 ms      │ 521.1 µs      │ 529.2 µs      │ 1000    │ 1000
   ├─ (100000, 100, 0.0)        1.897 ms      │ 2.255 ms      │ 1.912 ms      │ 1.917 ms      │ 1000    │ 1000
   ╰─ (100000, 100, 0.1)        4.284 ms      │ 4.853 ms      │ 4.31 ms       │ 4.325 ms      │ 1000    │ 1000

danking · 2025-02-13T20:42:08Z

Latest Summary

(rows_per_chunk, n_chunks, fraction_patches).

non-nullable

microseconds:

                     c_into into_c   ratio
(100000, 1, 0.0)     8.54   14.99    0.57
(100000, 1, 0.1)     13.49  20.24    0.67
(100000, 1, 0.01)    9.374  16.29    0.58
(100000, 10, 0.0)    157.9  235.30   0.67
(100000, 10, 0.1)    207.7  289.00   0.72
(100000, 10, 0.01)   166.5  244.30   0.68

milliseconds:

(100000, 100, 0.0)   1.786  1.47     1.22
(100000, 100, 0.1)   2.072  2.01     1.03
(100000, 100, 0.01)  1.654  1.54     1.07
(100000, 1000, 0.0)  30.01  25.29    1.19
(1000000, 100, 0.0)  29.28  25.43    1.15
(1000000, 1000, 0.0) 302.6  342.10   0.88
(10000000, 100, 0.0) 297.9  453.90   0.66

nullable

All samples have 5% of positions randomly invalid.

microseconds:

                   c_into into_c ratio
(100000, 1, 0.0)   11.29  28.12  0.40
(100000, 1, 0.1)   33.95  50.66  0.67
(100000, 10, 0.0)  189.9  281.2  0.68
(100000, 10, 0.1)  424    521.1  0.81
(100000, 100, 0.0) 1730   1912   0.90
(100000, 100, 0.1) 3765   4310   0.87

danking · 2025-02-13T21:29:47Z

I think this is ready.

@gatesn , I've got an issue to assess ChunkedArray's into_canonical: #2356 .

joseph-isaacs · 2025-02-14T09:16:46Z

The range [1 << 24, 1 << 25] is number of elements right, how big is it in bytes?

joseph-isaacs · 2025-02-14T09:19:02Z

encodings/fastlanes/src/bitpacking/mod.rs

+                |x| unsafe { std::mem::transmute(x) },
+                |x| unsafe { std::mem::transmute(x) },


Some safety and explanation comment would be good here?

joseph-isaacs · 2025-02-14T09:21:41Z

encodings/fastlanes/src/bitpacking/compress.rs

+    for i in first_full_chunk..num_chunks {
+        let chunk = &packed[i * elems_per_chunk..][0..elems_per_chunk];


joseph-isaacs · 2025-02-14T09:23:20Z

encodings/fastlanes/src/bitpacking/compress.rs

    }

+    let packed = array.packed_slice::<UnsignedT>();
+    let builder_current_length = builder.len();
+    builder.values.reserve((length + 1023) / 1024 * 1024);


So do you think this is the part that reallocated and effects perf?

We know the length of the array, so could you instead allocated a 1024 * |T| and decode the last block into that and then copy values from this to the builder? (If there is not already enough space).

In the benchmarks, we ensure the resulting builder is big enough to hold all the elements plus any scratch space needed by the decoded (in this case, 1024 empty spots, though only 1023 are needed). This operation shouldn't allocate in the benchmarks. I've verified this in the profile (there's no BufferMut::reserve or BytesMut::extend_from_slice both of which I've seen when it allocates).

Yeah, what you're describing is a reasonable alternative to adding extra space to the builders. I'd be happy to do that, though at this point it feels like we should do that in a follow up PR since this one is fairly large already.

joseph-isaacs · 2025-02-14T09:48:54Z

encodings/fastlanes/benches/canonicalize_bench.rs

+        .bench_values(|chunked| {
+            let mut primitive_builder = PrimitiveBuilder::<i32>::with_capacity(
+                chunked.dtype().nullability(),
+                chunk_len * chunk_count + 1024,


We cannot assume this constant since its not added by the builder or at least we need another bench without this which is the primary one.

I ran the benchmark without the extra 1024 elements, and canonical_into ended up slower in all cases. When the builder can’t hold the full result, it has to resize and copy—just like into_canonical.

I can also address your comment about using a temporary buffer for the final chunk and include that in a benchmark.

Looking ahead, I see two possible approaches:

Never use truncate: canonical_into implementations would only push as many elements as they have (len), without truncating.

Provide extra capacity: canonical_into implementations would reserve “garbage” space (perhaps via a minimum_builder_capacity()) so child structures can always write directly to their final location.

I’m inclined toward the second option because data goes exactly where it belongs, although it does waste space for smaller arrays. For example, a 500-element array might need space for 1024 elements. Ultimately, this is a key design decision for how builders should work.

Maybe worth a quick sync or a wider discussion? What do you think?

joseph-isaacs · 2025-02-14T09:49:10Z

Looking good its almost there

danking · 2025-02-14T14:36:57Z

@joseph-isaacs

The range [1 << 24, 1 << 25] is number of elements right, how big is it in bytes?

Yeah, that's right. The elements are i32s. Hundreds of megabytes. I'd believe there's cache effects, but I don't see how that could cause a small window to be better than outside the window.

joseph-isaacs

LG

joseph-isaacs · 2025-02-14T16:30:02Z

encodings/fastlanes/benches/canonicalize_bench.rs


 fn main() {
    divan::main();
 }

-fn make_array<T: NativePType>(len: usize) -> Array {
+#[divan::bench(


mind extract these into a single shared const, see BENCH_ARGS

danking requested review from gatesn and joseph-isaacs February 11, 2025 21:20

danking marked this pull request as ready for review February 11, 2025 21:20

danking added 2 commits February 11, 2025 16:25

unnecessary import

77b779c

fix PrimitiveBuilder::patch

10da4b4

joseph-isaacs reviewed Feb 12, 2025

View reviewed changes

danking added 8 commits February 12, 2025 11:17

with_inputs

1045a96

better benchmarks: more sizes, more chunks, patches, move tests

e4cbde9

append_validity_mask

409a98c

fix patch to loop over validity and then separately loop over values

715c7df

extract macro into parameterized method in PrimitiveBuilder::patch

eeee805

add nullable arrays

89a4e90

patch validity, but correct this time

fa243bf

do not trigger validity materialization for non-null validity

e063b93

danking added 2 commits February 12, 2025 18:14

Merge remote-tracking branch 'origin/develop' into dk/bitpacking-builder

348e3c4

use new append validity mask

05ac017

joseph-isaacs reviewed Feb 13, 2025

View reviewed changes

danking added 2 commits February 13, 2025 09:47

extract function out of match

8308c6a

guard on non-nullability not validity

2639bb1

one loop

db8bdfe

danking enabled auto-merge (squash) February 13, 2025 15:21

danking added 2 commits February 13, 2025 10:23

fix test code

801eeeb

Merge remote-tracking branch 'origin/develop' into dk/bitpacking-builder

32ec159

danking requested a review from joseph-isaacs February 13, 2025 20:43

joseph-isaacs reviewed Feb 14, 2025

View reviewed changes

danking added 2 commits February 14, 2025 09:39

safety comment

e6d3435

special handling of last chunk, no extra allocations

2fb3a35

joseph-isaacs approved these changes Feb 14, 2025

View reviewed changes

joseph-isaacs reviewed Feb 14, 2025

View reviewed changes

danking added 3 commits February 14, 2025 11:54

constants for bench args

8b1f1c7

correct the case of a single chunk which is both offset and short

6dd57c0

Merge remote-tracking branch 'origin/develop' into dk/bitpacking-builder

549f68e

danking merged commit dbf8ba5 into develop Feb 14, 2025
22 checks passed

danking deleted the dk/bitpacking-builder branch February 14, 2025 19:01

		(first_full_chunk..num_chunks).for_each(\|i\| {
		let chunk: &[T] = &packed[i * elems_per_chunk..][0..elems_per_chunk];

		for i in first_full_chunk..num_chunks {
		let chunk = &packed[i * elems_per_chunk..][0..elems_per_chunk];

		\|x\| unsafe { std::mem::transmute(x) },
		\|x\| unsafe { std::mem::transmute(x) },

feat: teach BitPackedEncoding canonical_into #2324

feat: teach BitPackedEncoding canonical_into #2324

Conversation

danking commented Feb 11, 2025 • edited Loading

gatesn commented Feb 11, 2025

danking commented Feb 11, 2025

gatesn commented Feb 11, 2025

joseph-isaacs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joseph-isaacs commented Feb 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danking Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joseph-isaacs commented Feb 12, 2025

Choose a reason for hiding this comment

danking commented Feb 12, 2025

non-nullable

nullable

danking commented Feb 12, 2025

joseph-isaacs commented Feb 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danking commented Feb 13, 2025

danking commented Feb 13, 2025

danking commented Feb 13, 2025 • edited Loading

non-nullable

nullable

danking commented Feb 13, 2025

joseph-isaacs commented Feb 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joseph-isaacs commented Feb 14, 2025

danking commented Feb 14, 2025

joseph-isaacs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danking commented Feb 11, 2025 •

edited

Loading

danking Feb 12, 2025 •

edited

Loading

danking commented Feb 13, 2025 •

edited

Loading