Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chaining hash table implementation #37

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

amari
Copy link
Contributor

@amari amari commented Sep 1, 2019

Intrusive hash table implementation that supports:

  • collision resolution via chaining (Adapter<Link = LinkedListLink>)
  • run-time configurable load factor
  • bucket usage bitmap
  • Entry API
  • zero-alloc insertions
  • allocating/resizing insertions

There are a few utility traits:

  • Array for implementing the bucket index and the metadata.
  • DynamicArray: Array for implementing dynamic allocation / resizing.
  • Bits
  • BitsArray

Dynamic allocation is entirely optional and can be implemented via the DynamicArray trait.

@Amanieu
Copy link
Owner

Amanieu commented Sep 3, 2019

Nice work! I'm quite busy this week and won't have time to perform a full review, however I will try to find some time next for it.

Copy link
Owner

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an initial review with some comments. I will do a further review later.

src/array.rs Outdated Show resolved Hide resolved
src/array.rs Outdated Show resolved Hide resolved
src/bits.rs Outdated Show resolved Hide resolved
src/bits.rs Outdated Show resolved Hide resolved
src/hash_table/mod.rs Outdated Show resolved Hide resolved
pub use crate::hash_table::ChainedLoadFactor;
use crate::key_adapter::KeyAdapter;
use crate::linked_list::Link as LinkedListLink;
use crate::linked_list::LinkedList;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really should be using singly-linked lists here. The memory savings are significant and nobody really cares about iterating through hash tables backwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had originally planned for O(1) inserts and removals. If I use singly-linked lists, the constant time remove operation is lost.

I had also wanted a general purpose implementation. Currently the max load factor is not constrained; the user is free to create hash tables where the number of buckets is small relative to the number of items. Linear time removals could affect this kind of use case.

I have a use case where link: LinkedListLink is reused between a double-ended-queue and a hash table to save memory. I'll need an extra pointer if I use singly-linked lists, unless a LinkedListLink can be used as a SinglyLinkedListLink.

I think the solution is two chaining hash table implementations:

Which names do you like?

  • singly-linked list
    • SinglyLinkedChainedHashTable
    • SinglyChainedHashTable
    • SinglyLinkedHashTable
  • linked list
    • LinkedChainedHashTable
    • ChainedHashTable
    • LinkedHashTable

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O(1) removal can be obtained by having cursors point to the previous element in the linked list.

Sharing a link between a linked list and a hash table can be solved by #27 which would allow you to implement your own Link types and make them work for different collections.

src/hash_table/chaining.rs Outdated Show resolved Hide resolved
buckets: B,
buckets_bits: C,
buckets_len: usize,
hash_builder: S,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, it is probably better to leave the choice of hasher to the adapter. As a bonus this eliminates our dependency on std.

src/hash_table/chaining.rs Outdated Show resolved Hide resolved
src/hash_table/chaining.rs Outdated Show resolved Hide resolved
@Amanieu
Copy link
Owner

Amanieu commented Apr 5, 2020

@amari Are you planning on continuing work on this? One of the major concerns (doubly vs singly linked list) is now resolved thanks to #46.

@Amanieu
Copy link
Owner

Amanieu commented Apr 5, 2020

I would like to make one major change regarding the Array trait. I think we should only have a single Array trait, with these methods:

trait Array<T>: DerefMut<[T]> {
    /// Hint to the array that the hash table is about to insert additional elements.
    /// This function can return a new array to tell the hash table to rehash into the new array.
    fn reserve(&mut self, len: usize) -> Option<Self>;

    /// Hint to shrink the capacity of the array if possible.
    /// This function can return a new array to tell the hash table to rehash into the new array.
    fn shrink_to_fit(&mut self, len: usize) -> Option<Self>;
}

The key change here is that the Array is now responsible for tracking the load factor of the table and choosing when to rehash.

The implementations of Array for [T; N] will have no-op implementations for both of these methods: a fixed-size array will never re-hash and doesn't care about the load factor.

For dynamic arrays which track the load factor we will provide a DynamicArray<T> type (we can't just use Vec<T> or Box<[T]> because they don't have a load factor). This type will control the load factor and resize as needed.

@amari
Copy link
Contributor Author

amari commented Apr 6, 2020

Yes, I plan on finishing this later this week.

@amari amari force-pushed the chained-hash-table branch from f5d36dc to ba60b09 Compare April 26, 2020 08:26
@amari
Copy link
Contributor Author

amari commented Apr 26, 2020

I've been ruminating about the design and came to the conclusion that it is much cleaner to treat HashTable as a "N+1 degree" data structure, whereas SinglyLinkedList would be a "N degree" data structure.

This means that HashTableAdapter does not implement Adapter because a HashTable never deals with link operations directly, and instead delegates to the bucket type.

I wanted to get the current design reviewed. Also, test coverage is 93.2%, so check the tests for example usage.

Outstanding issues

  • Create and document examples.
  • HashOps implementations must be generic over their value argument, otherwise the compilation fails (the compiler claims that HashOps<A::KeyOps::Key> is not implemented for IntegerIdentityHashOps when A::KeyOps::Key = u32). This is due to the HRTB A::HashOps: for<'b> HashOps<<A::KeyOps as KeyOps<'b, <A::PointerOps as PointerOps>::Value>>::Key>. The solution is to remove the HRTB, but I am debating whether to add a lifetime parameter to the impl or to remove the lifetime from KeyOps.
  • A way to customize the "load factor target" through Array or HashTable. The array would still decide when to grow/shrink, if at all, but a use-case may require optimized time (more buckets, fewer elements per bucket, faster search within bucket) or optimized space.
  • Extending BucketOps to support custom bucket emptiness tracking.
  • Consolidate the hash_table module into a single file.

Some notes

There are now two kinds of cursors: BucketCursor and Cursor.

Originally there was a single Cursor where move_next would advance to the next element within the bucket or the first element of the next non-empty bucket.

However the effects of move_next were dependent on the internal state of the hash table, making the meaning of this code ambiguous:

// ...
let mut cursor = hash_table.cursor_mut();
cursor.move_next();
cursor.insert_after(value);

Into which bucket is the value inserted? There's no way to tell without exposing the current index. Even that wouldn't enable the same degree of control offered by the cursor APIs of the other intrusive collections. move_next_bucket and move_prev_bucket would move the cursor to a hybrid "null" state where get returns None, but that now means that there is no longer a single "null object", but a multitude. What would is_null return if the cursor was pointing to the null object of a bucket? (E.g. When inserting a value into an empty bucket.)

KeyOps and HashOps

Since HashTableAdapter does not implement Adapter, we cannot implement KeyAdapter.

/// Key operations.
pub trait KeyOps<'a, T: ?Sized> {
    /// The key type.
    type Key;

    /// Returns the key of `value`.
    fn key(&self, value: &'a T) -> Self::Key;
}

/// Trait for hash operations.
pub trait HashOps<T: ?Sized> {
    /// Returns the hash value of `value`.
    fn hash(&self, value: &T) -> u64;
}

/// The default implementation of `HashOps`.
#[derive(Default)]
pub struct DefaultHashOps<S: BuildHasher> {
    hasher_builder: S,
}

impl<S: BuildHasher> DefaultHashOps<S> {
    /// Returns a new hash ops instance.
    #[inline]
    pub fn from_hasher_builder(hasher_builder: S) -> DefaultHashOps<S> {
        DefaultHashOps { hasher_builder }
    }
}

impl<S: BuildHasher, T: Hash> HashOps<T> for DefaultHashOps<S> {
    #[inline(never)]
    fn hash(&self, value: &T) -> u64 {
        let mut hasher = self.hasher_builder.build_hasher();
        value.hash(&mut hasher);
        hasher.finish()
    }
}

I need cursor_mut_from_ptr(ptr).remove() to have constant time complexity.

However, a singly linked list is sufficient for many use cases.

I've added a BucketOps trait for the following reasons:

  • Clean separation between the Adapter implementation of the bucket vs HashTableAdapter
  • User-specified time complexity (SinglyLinkedList vs LinkedList)
  • Custom emptiness tracking (e.g. bitset of non-empty buckets) (Planned but not yet implemented.)

Also, BucketOps::Cursor is not the same as a collection's Cursor. It is usually Option<NonNull<A::PointerOps::Value>>.

pub unsafe trait BucketOps {
    /// Collection-specific pointer conversions which allow an object to
    /// be inserted in an intrusive collection.
    type PointerOps: PointerOps;

    /// The hash table bucket type.
    type Bucket;

    /// The cursor over elements within a bucket.
    type Cursor: Clone;

    /// Returns `true` if the bucket is empty.
    unsafe fn is_empty(&self, bucket: &Self::Bucket) -> bool;

    /// Removes all elements from the bucket.
    ///
    /// This will unlink all objects currently in the bucket, which requires iterating through
    /// all elements in the bucket. Each element is converted back into an owned pointer and then dropped.
    unsafe fn clear(&mut self, bucket: &mut Self::Bucket);

    /// Empties the bucket without unlinking or freeing the objects within it.
    ///
    /// Since this does not unlink any objects, any attempts to link these objects into another intrusive collection will fail,
    /// but will not cause any memory unsafety.
    /// To unlink those objects manually, you must call the `force_unlink` function on them.
    unsafe fn fast_clear(&mut self, bucket: &mut Self::Bucket);

    /// Returns `true` if the cursor is pointing to the null object.
    unsafe fn is_null(&self, bucket: &Self::Bucket, cursor: &Self::Cursor) -> bool;

    /// Returns a reference to the object that the cursor is currently pointing to.
    ///
    /// This returns `None` if the cursor is pointing to the null object.
    unsafe fn get(
        &self,
        bucket: &Self::Bucket,
        cursor: &Self::Cursor,
    ) -> Option<&<Self::PointerOps as PointerOps>::Value>;

    /// Returns a null cursor for this bucket.
    unsafe fn cursor(&self, bucket: &Self::Bucket) -> Self::Cursor;

    /// Creates a cursor from a pointer to an element.
    ///
    /// # Safety
    /// `ptr` must be a pointer to an object that is a member of this bucket.
    unsafe fn cursor_from_ptr(
        &self,
        bucket: &Self::Bucket,
        ptr: *const <Self::PointerOps as PointerOps>::Value,
    ) -> Self::Cursor;

    /// Searches the bucket for the first element where `is_match` returns `true`.
    ///
    /// Returns `None` if no match was found, or the bucket was empty.
    unsafe fn find_prev<F>(&self, bucket: &Self::Bucket, is_match: F) -> Option<Self::Cursor>
    where
        for<'b> F: FnMut(&'b <Self::PointerOps as PointerOps>::Value) -> bool;

    /// Returns a cursor that points to the next element of the bucket.
    ///
    /// If `self` points to the null object,
    /// then it will return a cursor that points to the first element of the bucket.
    ///
    /// If `self` points to the last element of the bucket,
    /// then it will return a cursor that points to the null object.
    unsafe fn next(&self, bucket: &Self::Bucket, prev: &Self::Cursor) -> Self::Cursor;

    /// Inserts a new element into the bucket after the current one.
    ///
    /// If the cursor currently points to the null object,
    /// then the new element is inserted at the front of the bucket.
    ///
    /// # Panics
    /// Panics if the new element is already linked to a different intrusive collection.
    unsafe fn insert_after(
        &mut self,
        bucket: &mut Self::Bucket,
        prev: &Self::Cursor,
        value: <Self::PointerOps as PointerOps>::Pointer,
    );

    /// Removes the next element from the bucket.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor is not moved.
    ///
    /// If the cursor currently points to the last element of the bucket, then no element is removed  and `None` is returned.
    unsafe fn remove_next(
        &mut self,
        bucket: &mut Self::Bucket,
        prev: &Self::Cursor,
    ) -> Option<<Self::PointerOps as PointerOps>::Pointer>;

    /// Removes the current element from the bucket.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor is advanced to the next element of the bucket.
    ///
    /// If the cursor currently points to the null object, then no element is removed  and `None` is returned.
    unsafe fn remove(
        &mut self,
        bucket: &mut Self::Bucket,
        current: &mut Self::Cursor,
    ) -> Option<<Self::PointerOps as PointerOps>::Pointer>;

    /// Removes the next element from the bucket, and inserts another object in its place.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor is not moved.
    ///
    /// If the cursor currently points to the last element of the bucket, then `Err(value)` is returned.
    ///
    /// # Panics
    /// Panics if the new element is already linked to a different intrusive collection.
    unsafe fn replace_next_with(
        &mut self,
        bucket: &mut Self::Bucket,
        prev: &Self::Cursor,
        value: <Self::PointerOps as PointerOps>::Pointer,
    ) -> Result<<Self::PointerOps as PointerOps>::Pointer, <Self::PointerOps as PointerOps>::Pointer>;

    /// Removes the current element from the bucket, and inserts another object in its place.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor points to the newly added element.
    ///
    /// If the cursor currently points to the null object, then `Err(value)` is returned.
    ///
    /// # Panics
    /// Panics if the new element is already linked to a different intrusive collection.
    unsafe fn replace_with(
        &mut self,
        bucket: &mut Self::Bucket,
        current: &mut Self::Cursor,
        value: <Self::PointerOps as PointerOps>::Pointer,
    ) -> Result<<Self::PointerOps as PointerOps>::Pointer, <Self::PointerOps as PointerOps>::Pointer>;
}

Copy link
Owner

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review, I haven't finished reading through all the code yet.

/// This function can return a new array to tell the hash table to rehash into the new array.
fn shrink_to(&mut self, min_capacity: usize) -> Option<Self>;

/// Returns the capacity.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the capacity.
/// Returns the number of elements the map can hold without reallocating.
///
/// This number is a lower bound; the hash table might be able to hold
/// more, but is guaranteed to be able to hold at least this many.

}

impl<T> Array<T> for &'_ mut [T] {
#[inline(never)]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this #[inline(never)]?

src/hash_table/bucket_ops.rs Show resolved Hide resolved
inner: Vec<T>,
capacity: usize,
load_factor: LoadFactor,
len: usize,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is never used?

impl<T: Default> Array<T> for DynamicArray<T> {
#[inline(never)]
fn reserve(&mut self, new_capacity: usize) -> Option<Self> {
let new_len = new_capacity;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just rename the parameter to new_len?

10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer restricting array lengths to only powers of two. This has several advantages:

  • Bit masking is much faster than modulo.
  • We only need to provide array impls for powers of two.
  • Growing is easy: just double the size.

You can use the capacity_to_buckets and bucket_mask_to_capacity methods from hashbrown as a reference.

#![allow(clippy::declare_interior_mutable_const, clippy::collapsible_if)]

#[cfg(feature = "alloc")]
extern crate alloc;

#[cfg(test)]
#[macro_use]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should import macros using use std::<macro name>; rather than using #[macro_use].

@@ -267,13 +267,14 @@
#![warn(missing_docs)]
#![warn(rust_2018_idioms)]
#![no_std]
#![cfg_attr(feature = "nightly", feature(const_fn))]
#![cfg_attr(feature = "nightly", feature(const_fn, const_generics))]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Const generics aren't needed if we are only providing array impls for power-of-two sizes.

/// |`remove`|`O(n)`|`O(1)`|
/// |`replace_next_with`|`O(1)`|`O(1)`|
/// |`replace_with`|`O(n)`|`O(1)`|
pub unsafe trait BucketOps {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this: just implement BucketOps directly on LinkedList/SinglyLinkedList instead of using DefaultBucketOps. This functionality isn't really something we need to let the user customize.

Remove type Bucket and type PointerOps. Add type Pointer and type Value.

/// It is also possible to create stateful adapters.
/// This allows links and containers to be separated and avoids the need for objects to be modified to
/// contain a link.
pub trait HashTableAdapter {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashTableAdapter can be simplified. Remove PointerOps and BucketOps, the adapter only needs to handle key equality and key hashing. Everything else is handled by the BucketOps from the B array parameter to HashTable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants