Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: RoaringBitmap::from_lsb0_bytes #288

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
14 changes: 14 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,20 @@ jobs:
command: test
args: --features serde

- name: miri setup
if: matrix.rust == 'nightly'
uses: actions-rs/cargo@v1
with:
command: miri
args: setup

- name: Test bit endian
if: matrix.rust == 'nightly'
uses: actions-rs/cargo@v1
with:
command: miri
args: test --target s390x-unknown-linux-gnu --package roaring --lib -- bitmap::serialization::test::test_from_lsb0_bytes

- name: Test no default features
uses: actions-rs/cargo@v1
with:
Expand Down
21 changes: 21 additions & 0 deletions benchmarks/benches/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,27 @@ fn creation(c: &mut Criterion) {

group.throughput(Throughput::Elements(dataset.bitmaps.iter().map(|rb| rb.len()).sum()));

group.bench_function(BenchmarkId::new("from_lsb0_bytes", &dataset.name), |b| {
let bitmap_bytes = dataset_numbers
.iter()
.map(|bitmap_numbers| {
let max_number = *bitmap_numbers.iter().max().unwrap() as usize;
let mut buf = vec![0u8; max_number / 8 + 1];
for n in bitmap_numbers {
let byte = (n / 8) as usize;
let bit = n % 8;
buf[byte] |= 1 << bit;
}
buf
})
.collect::<Vec<_>>();
b.iter(|| {
for bitmap_bytes in &bitmap_bytes {
black_box(RoaringBitmap::from_lsb0_bytes(0, bitmap_bytes));
}
})
});

group.bench_function(BenchmarkId::new("from_sorted_iter", &dataset.name), |b| {
b.iter(|| {
for bitmap_numbers in &dataset_numbers {
Expand Down
4 changes: 4 additions & 0 deletions roaring/src/bitmap/container.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@ impl Container {
pub fn full(key: u16) -> Container {
Container { key, store: Store::full() }
}

pub fn from_lsb0_bytes(key: u16, bytes: &[u8], byte_offset: usize) -> Option<Self> {
Some(Container { key, store: Store::from_lsb0_bytes(bytes, byte_offset)? })
}
}

impl Container {
Expand Down
179 changes: 174 additions & 5 deletions roaring/src/bitmap/serialization.rs
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
use crate::bitmap::container::{Container, ARRAY_LIMIT};
use crate::bitmap::store::{ArrayStore, BitmapStore, Store, BITMAP_LENGTH};
use crate::RoaringBitmap;
use bytemuck::cast_slice_mut;
use byteorder::{LittleEndian, ReadBytesExt, WriteBytesExt};
use core::convert::Infallible;
use core::mem::size_of;
use core::ops::RangeInclusive;
use std::error::Error;
use std::io;

use crate::bitmap::container::{Container, ARRAY_LIMIT};
use crate::bitmap::store::{ArrayStore, BitmapStore, Store, BITMAP_LENGTH};
use crate::RoaringBitmap;

pub const SERIAL_COOKIE_NO_RUNCONTAINER: u32 = 12346;
pub const SERIAL_COOKIE: u16 = 12347;
pub const NO_OFFSET_THRESHOLD: usize = 4;
Expand Down Expand Up @@ -47,6 +47,113 @@ impl RoaringBitmap {
8 + container_sizes
}

/// Creates a `RoaringBitmap` from a byte slice, interpreting the bytes as a bitmap with a specified offset.
///
/// # Arguments
///
/// - `offset: u32` - The starting position in the bitmap where the byte slice will be applied, specified in bits.
/// This means that if `offset` is `n`, the first byte in the slice will correspond to the `n`th bit(0-indexed) in the bitmap.
/// Must be a multiple of 8.
/// - `bytes: &[u8]` - The byte slice containing the bitmap data. The bytes are interpreted in "Least-Significant-First" bit order.
///
/// # Interpretation of `bytes`
///
/// The `bytes` slice is interpreted in "Least-Significant-First" bit order. Each byte is read from least significant bit (LSB) to most significant bit (MSB).
/// For example, the byte `0b00000101` represents the bits `1, 0, 1, 0, 0, 0, 0, 0` in that order (see Examples section).
///
///
/// # Panics
///
/// This function will panic if `offset` is not a multiple of 8, or if `bytes.len() + offset` is greater than 2^32.
///
///
/// # Examples
///
/// ```rust
/// use roaring::RoaringBitmap;
///
/// let bytes = [0b00000101, 0b00000010, 0b00000000, 0b10000000];
/// // ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^
/// // 76543210 98
/// let rb = RoaringBitmap::from_lsb0_bytes(0, &bytes);
/// assert!(rb.contains(0));
/// assert!(!rb.contains(1));
/// assert!(rb.contains(2));
/// assert!(rb.contains(9));
/// assert!(rb.contains(31));
///
/// let rb = RoaringBitmap::from_lsb0_bytes(8, &bytes);
/// assert!(rb.contains(8));
/// assert!(!rb.contains(9));
/// assert!(rb.contains(10));
/// assert!(rb.contains(17));
/// assert!(rb.contains(39));
/// ```
pub fn from_lsb0_bytes(offset: u32, mut bytes: &[u8]) -> RoaringBitmap {
assert_eq!(offset % 8, 0, "offset must be a multiple of 8");
Comment on lines +92 to +93
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we can't just switch this offset in bits to an offset in bytes as the precision can only be expressed 8 bits at a time. This way we avoid this assert and reduce the possible errors.

If I understand correctly the offset is like a helper to increment the bitmap numbers by 8 at a time? Why is it necessary that it must be 8 at a time? Do we want to keep this? Why users can't just shift numbers themselves?

And if we want to keep it can we make it more clear (like the description I did above)? Your example helped me understand the high intent of this parameter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review and sorry for the late responce.

The reason why the offset must be 8 multiples is if the function accepts not 8 multiples offset, the copying will not so easy. If the container is Array, the implementation would be almost the same. In case Bitmap store, we need bit shifting for every bitmap byte copy.

We can switch the implementation based on the offset. In terms of the peformance, the users who use 8 multiples offset might not be impacted by this API change.
And the peformance of users with not 8 multiples offset is unknown.

I could allow any offset for this method with a little additional implementation and the document on peformances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @lemolatoon, for the explanation. It would be great if we could accept for than just 8 multipliers, indeed.


if bytes.is_empty() {
return RoaringBitmap::new();
}

// Using inclusive range avoids overflow: the max exclusive value is 2^32 (u32::MAX + 1).
let end_bit_inc = u32::try_from(bytes.len())
.ok()
.and_then(|len_bytes| len_bytes.checked_mul(8))
// `bytes` is non-empty, so len_bits is > 0
.and_then(|len_bits| offset.checked_add(len_bits - 1))
.expect("offset + bytes.len() must be <= 2^32");

// offsets are in bytes
let (mut start_container, start_offset) =
(offset as usize >> 16, (offset as usize % 0x1_0000) / 8);
let (end_container_inc, end_offset) =
(end_bit_inc as usize >> 16, (end_bit_inc as usize % 0x1_0000 + 1) / 8);

let n_containers_needed = end_container_inc + 1 - start_container;
let mut containers = Vec::with_capacity(n_containers_needed);

// Handle a partial first container
if start_offset != 0 {
let end_byte = if end_container_inc == start_container {
end_offset
} else {
BITMAP_LENGTH * size_of::<u64>()
};

let (src, rest) = bytes.split_at(end_byte - start_offset);
bytes = rest;

if let Some(container) =
Container::from_lsb0_bytes(start_container as u16, src, start_offset)
{
containers.push(container);
}

start_container += 1;
}

// Handle all full containers
for full_container_key in start_container..end_container_inc {
let (src, rest) = bytes.split_at(BITMAP_LENGTH * size_of::<u64>());
bytes = rest;

if let Some(container) = Container::from_lsb0_bytes(full_container_key as u16, src, 0) {
containers.push(container);
}
}

// Handle a last container
if !bytes.is_empty() {
if let Some(container) = Container::from_lsb0_bytes(end_container_inc as u16, bytes, 0)
{
containers.push(container);
}
}

RoaringBitmap { containers }
}

/// Serialize this bitmap into [the standard Roaring on-disk format][format].
/// This is compatible with the official C/C++, Java and Go implementations.
///
Expand Down Expand Up @@ -256,7 +363,7 @@ impl RoaringBitmap {

#[cfg(test)]
mod test {
use crate::RoaringBitmap;
use crate::{bitmap::store::BITMAP_LENGTH, RoaringBitmap};
use proptest::prelude::*;

proptest! {
Expand All @@ -270,6 +377,68 @@ mod test {
}
}

#[test]
fn test_from_lsb0_bytes() {
const CONTAINER_OFFSET: u32 = u64::BITS * BITMAP_LENGTH as u32;
const CONTAINER_OFFSET_IN_BYTES: u32 = CONTAINER_OFFSET / 8;
let mut bytes = vec![0xff; CONTAINER_OFFSET_IN_BYTES as usize];
bytes.extend(&[0x00; CONTAINER_OFFSET_IN_BYTES as usize]);
bytes.extend(&[0b00000001, 0b00000010, 0b00000011, 0b00000100]);

let offset = 32;
let rb = RoaringBitmap::from_lsb0_bytes(offset, &bytes);
for i in 0..offset {
assert!(!rb.contains(i), "{i} should not be in the bitmap");
}
for i in offset..offset + CONTAINER_OFFSET {
assert!(rb.contains(i), "{i} should be in the bitmap");
}
for i in offset + CONTAINER_OFFSET..offset + CONTAINER_OFFSET * 2 {
assert!(!rb.contains(i), "{i} should not be in the bitmap");
}
for bit in [0, 9, 16, 17, 26] {
let i = bit + offset + CONTAINER_OFFSET * 2;
assert!(rb.contains(i), "{i} should be in the bitmap");
}

assert_eq!(rb.len(), CONTAINER_OFFSET as u64 + 5);

// Ensure the empty container is not created
let mut bytes = vec![0x00u8; CONTAINER_OFFSET_IN_BYTES as usize];
bytes.extend(&[0xff]);
let rb = RoaringBitmap::from_lsb0_bytes(0, &bytes);
assert_eq!(rb.min(), Some(CONTAINER_OFFSET));

let rb = RoaringBitmap::from_lsb0_bytes(8, &bytes);
assert_eq!(rb.min(), Some(CONTAINER_OFFSET + 8));

// Ensure we can set the last byte in an array container
let bytes = [0x80];
let rb = RoaringBitmap::from_lsb0_bytes(0xFFFFFFF8, &bytes);
assert_eq!(rb.len(), 1);
assert!(rb.contains(u32::MAX));

// Ensure we can set the last byte in a bitmap container
let bytes = vec![0xFF; 0x1_0000 / 8];
let rb = RoaringBitmap::from_lsb0_bytes(0xFFFF0000, &bytes);
assert_eq!(rb.len(), 0x1_0000);
assert!(rb.contains(u32::MAX));
}

#[test]
#[should_panic(expected = "multiple of 8")]
fn test_from_lsb0_bytes_invalid_offset() {
let bytes = [0x01];
RoaringBitmap::from_lsb0_bytes(1, &bytes);
}

#[test]
#[should_panic(expected = "<= 2^32")]
fn test_from_lsb0_bytes_overflow() {
let bytes = [0x01, 0x01];
RoaringBitmap::from_lsb0_bytes(u32::MAX - 7, &bytes);
}

#[test]
fn test_deserialize_overflow_s_plus_len() {
let data = vec![59, 48, 0, 0, 255, 130, 254, 59, 48, 2, 0, 41, 255, 255, 166, 197, 4, 0, 2];
Expand Down
28 changes: 28 additions & 0 deletions roaring/src/bitmap/store/array_store/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ use crate::bitmap::store::array_store::visitor::{CardinalityCounter, VecWriter};
use core::cmp::Ordering;
use core::cmp::Ordering::*;
use core::fmt::{Display, Formatter};
use core::mem::size_of;
use core::ops::{BitAnd, BitAndAssign, BitOr, BitXor, RangeInclusive, Sub, SubAssign};

#[cfg(not(feature = "std"))]
Expand Down Expand Up @@ -47,6 +48,33 @@ impl ArrayStore {
}
}

pub fn from_lsb0_bytes(bytes: &[u8], byte_offset: usize, bits_set: u64) -> Self {
type Word = u64;

let mut vec = Vec::with_capacity(bits_set as usize);

let chunks = bytes.chunks_exact(size_of::<Word>());
let remainder = chunks.remainder();
for (index, chunk) in chunks.enumerate() {
let bit_index = (byte_offset + index * size_of::<Word>()) * 8;
let mut word = Word::from_le_bytes(chunk.try_into().unwrap());

while word != 0 {
vec.push((word.trailing_zeros() + bit_index as u32) as u16);
word &= word - 1;
}
}
for (index, mut byte) in remainder.iter().copied().enumerate() {
let bit_index = (byte_offset + (bytes.len() - remainder.len()) + index) * 8;
while byte != 0 {
vec.push((byte.trailing_zeros() + bit_index as u32) as u16);
byte &= byte - 1;
}
}

Self::from_vec_unchecked(vec)
}

pub fn insert(&mut self, index: u16) -> bool {
self.vec.binary_search(&index).map_err(|loc| self.vec.insert(loc, index)).is_err()
}
Expand Down
42 changes: 42 additions & 0 deletions roaring/src/bitmap/store/bitmap_store.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
use core::borrow::Borrow;
use core::fmt::{Display, Formatter};
use core::mem::size_of;
use core::ops::{BitAndAssign, BitOrAssign, BitXorAssign, RangeInclusive, SubAssign};

use super::ArrayStore;
Expand Down Expand Up @@ -35,6 +36,47 @@ impl BitmapStore {
}
}

pub fn from_lsb0_bytes_unchecked(bytes: &[u8], byte_offset: usize, bits_set: u64) -> Self {
const BITMAP_BYTES: usize = BITMAP_LENGTH * size_of::<u64>();
assert!(byte_offset.checked_add(bytes.len()).map_or(false, |sum| sum <= BITMAP_BYTES));

// If we know we're writing the full bitmap, we can avoid the initial memset to 0
let mut bits = if bytes.len() == BITMAP_BYTES {
debug_assert_eq!(byte_offset, 0); // Must be true from the above assert

// Safety: We've checked that the length is correct, and we use an unaligned load in case
// the bytes are not 8 byte aligned.
// The optimizer can see through this, and avoid the double copy to copy directly into
// the allocated box from bytes with memcpy
let bytes_as_words =
unsafe { bytes.as_ptr().cast::<[u64; BITMAP_LENGTH]>().read_unaligned() };
Box::new(bytes_as_words)
} else {
let mut bits = Box::new([0u64; BITMAP_LENGTH]);
// Safety: It's safe to reinterpret u64s as u8s because u8 has less alignment requirements,
// and has no padding/uninitialized data.
let dst = unsafe {
std::slice::from_raw_parts_mut(bits.as_mut_ptr().cast::<u8>(), BITMAP_BYTES)
};
let dst = &mut dst[byte_offset..][..bytes.len()];
dst.copy_from_slice(bytes);
bits
};

if !cfg!(target_endian = "little") {
// Convert all words we touched (even partially) to little-endian
let start_word = byte_offset / size_of::<u64>();
let end_word = (byte_offset + bytes.len() + (size_of::<u64>() - 1)) / size_of::<u64>();

// The 0th byte is the least significant byte, so we've written the bytes in little-endian
for word in &mut bits[start_word..end_word] {
*word = u64::from_le(*word);
}
}

Self::from_unchecked(bits_set, bits)
}

///
/// Create a new BitmapStore from a given len and bits array
/// It is up to the caller to ensure len == cardinality of bits
Expand Down
Loading