Chaining hash table implementation #37

amari · 2019-09-01T06:48:28Z

Intrusive hash table implementation that supports:

collision resolution via chaining (Adapter<Link = LinkedListLink>)
run-time configurable load factor
~~bucket usage bitmap~~
Entry API
zero-alloc insertions
allocating/resizing insertions

There are a few utility traits:

Array for implementing the bucket index and the metadata.
DynamicArray: Array for implementing dynamic allocation / resizing.
~~Bits~~
~~BitsArray~~

Dynamic allocation is entirely optional and can be implemented via the DynamicArray trait.

Amanieu · 2019-09-03T19:48:22Z

Nice work! I'm quite busy this week and won't have time to perform a full review, however I will try to find some time next for it.

Amanieu

Here's an initial review with some comments. I will do a further review later.

src/array.rs

src/bits.rs

src/hash_table/mod.rs

Amanieu · 2019-09-08T08:18:54Z

src/hash_table/chaining.rs

+pub use crate::hash_table::ChainedLoadFactor;
+use crate::key_adapter::KeyAdapter;
+use crate::linked_list::Link as LinkedListLink;
+use crate::linked_list::LinkedList;


You really should be using singly-linked lists here. The memory savings are significant and nobody really cares about iterating through hash tables backwards.

I had originally planned for O(1) inserts and removals. If I use singly-linked lists, the constant time remove operation is lost.

I had also wanted a general purpose implementation. Currently the max load factor is not constrained; the user is free to create hash tables where the number of buckets is small relative to the number of items. Linear time removals could affect this kind of use case.

I have a use case where link: LinkedListLink is reused between a double-ended-queue and a hash table to save memory. I'll need an extra pointer if I use singly-linked lists, unless a LinkedListLink can be used as a SinglyLinkedListLink.

I think the solution is two chaining hash table implementations:

Which names do you like?

singly-linked list

SinglyLinkedChainedHashTable

SinglyChainedHashTable

SinglyLinkedHashTable

linked list

LinkedChainedHashTable

ChainedHashTable

LinkedHashTable

O(1) removal can be obtained by having cursors point to the previous element in the linked list.

Sharing a link between a linked list and a hash table can be solved by #27 which would allow you to implement your own Link types and make them work for different collections.

src/hash_table/chaining.rs

Amanieu · 2019-09-08T08:21:24Z

src/hash_table/chaining.rs

+    buckets: B,
+    buckets_bits: C,
+    buckets_len: usize,
+    hash_builder: S,


As I mentioned above, it is probably better to leave the choice of hasher to the adapter. As a bonus this eliminates our dependency on std.

src/hash_table/chaining.rs

Amanieu · 2020-04-05T20:05:15Z

@amari Are you planning on continuing work on this? One of the major concerns (doubly vs singly linked list) is now resolved thanks to #46.

Amanieu · 2020-04-05T20:41:30Z

I would like to make one major change regarding the Array trait. I think we should only have a single Array trait, with these methods:

trait Array<T>: DerefMut<[T]> {
    /// Hint to the array that the hash table is about to insert additional elements.
    /// This function can return a new array to tell the hash table to rehash into the new array.
    fn reserve(&mut self, len: usize) -> Option<Self>;

    /// Hint to shrink the capacity of the array if possible.
    /// This function can return a new array to tell the hash table to rehash into the new array.
    fn shrink_to_fit(&mut self, len: usize) -> Option<Self>;
}

The key change here is that the Array is now responsible for tracking the load factor of the table and choosing when to rehash.

The implementations of Array for [T; N] will have no-op implementations for both of these methods: a fixed-size array will never re-hash and doesn't care about the load factor.

For dynamic arrays which track the load factor we will provide a DynamicArray<T> type (we can't just use Vec<T> or Box<[T]> because they don't have a load factor). This type will control the load factor and resize as needed.

amari · 2020-04-06T15:08:28Z

Yes, I plan on finishing this later this week.

amari · 2020-04-26T09:49:55Z

I've been ruminating about the design and came to the conclusion that it is much cleaner to treat HashTable as a "N+1 degree" data structure, whereas SinglyLinkedList would be a "N degree" data structure.

This means that HashTableAdapter does not implement Adapter because a HashTable never deals with link operations directly, and instead delegates to the bucket type.

I wanted to get the current design reviewed. Also, test coverage is 93.2%, so check the tests for example usage.

Outstanding issues

Create and document examples.
HashOps implementations must be generic over their value argument, otherwise the compilation fails (the compiler claims that HashOps<A::KeyOps::Key> is not implemented for IntegerIdentityHashOps when A::KeyOps::Key = u32). This is due to the HRTB A::HashOps: for<'b> HashOps<<A::KeyOps as KeyOps<'b, <A::PointerOps as PointerOps>::Value>>::Key>. The solution is to remove the HRTB, but I am debating whether to add a lifetime parameter to the impl or to remove the lifetime from KeyOps.
A way to customize the "load factor target" through Array or HashTable. The array would still decide when to grow/shrink, if at all, but a use-case may require optimized time (more buckets, fewer elements per bucket, faster search within bucket) or optimized space.
Extending BucketOps to support custom bucket emptiness tracking.
Consolidate the hash_table module into a single file.

Some notes

There are now two kinds of cursors: `BucketCursor` and `Cursor`.

Originally there was a single Cursor where move_next would advance to the next element within the bucket or the first element of the next non-empty bucket.

However the effects of move_next were dependent on the internal state of the hash table, making the meaning of this code ambiguous:

// ...
let mut cursor = hash_table.cursor_mut();
cursor.move_next();
cursor.insert_after(value);

Into which bucket is the value inserted? There's no way to tell without exposing the current index. Even that wouldn't enable the same degree of control offered by the cursor APIs of the other intrusive collections. move_next_bucket and move_prev_bucket would move the cursor to a hybrid "null" state where get returns None, but that now means that there is no longer a single "null object", but a multitude. What would is_null return if the cursor was pointing to the null object of a bucket? (E.g. When inserting a value into an empty bucket.)

`KeyOps` and `HashOps`

Since HashTableAdapter does not implement Adapter, we cannot implement KeyAdapter.

/// Key operations.
pub trait KeyOps<'a, T: ?Sized> {
    /// The key type.
    type Key;

    /// Returns the key of `value`.
    fn key(&self, value: &'a T) -> Self::Key;
}

/// Trait for hash operations.
pub trait HashOps<T: ?Sized> {
    /// Returns the hash value of `value`.
    fn hash(&self, value: &T) -> u64;
}

/// The default implementation of `HashOps`.
#[derive(Default)]
pub struct DefaultHashOps<S: BuildHasher> {
    hasher_builder: S,
}

impl<S: BuildHasher> DefaultHashOps<S> {
    /// Returns a new hash ops instance.
    #[inline]
    pub fn from_hasher_builder(hasher_builder: S) -> DefaultHashOps<S> {
        DefaultHashOps { hasher_builder }
    }
}

impl<S: BuildHasher, T: Hash> HashOps<T> for DefaultHashOps<S> {
    #[inline(never)]
    fn hash(&self, value: &T) -> u64 {
        let mut hasher = self.hasher_builder.build_hasher();
        value.hash(&mut hasher);
        hasher.finish()
    }
}

I need `cursor_mut_from_ptr(ptr).remove()` to have constant time complexity.

However, a singly linked list is sufficient for many use cases.

I've added a BucketOps trait for the following reasons:

Clean separation between the Adapter implementation of the bucket vs HashTableAdapter
User-specified time complexity (SinglyLinkedList vs LinkedList)
Custom emptiness tracking (e.g. bitset of non-empty buckets) (Planned but not yet implemented.)

Also, BucketOps::Cursor is not the same as a collection's Cursor. It is usually Option<NonNull<A::PointerOps::Value>>.

pub unsafe trait BucketOps {
    /// Collection-specific pointer conversions which allow an object to
    /// be inserted in an intrusive collection.
    type PointerOps: PointerOps;

    /// The hash table bucket type.
    type Bucket;

    /// The cursor over elements within a bucket.
    type Cursor: Clone;

    /// Returns `true` if the bucket is empty.
    unsafe fn is_empty(&self, bucket: &Self::Bucket) -> bool;

    /// Removes all elements from the bucket.
    ///
    /// This will unlink all objects currently in the bucket, which requires iterating through
    /// all elements in the bucket. Each element is converted back into an owned pointer and then dropped.
    unsafe fn clear(&mut self, bucket: &mut Self::Bucket);

    /// Empties the bucket without unlinking or freeing the objects within it.
    ///
    /// Since this does not unlink any objects, any attempts to link these objects into another intrusive collection will fail,
    /// but will not cause any memory unsafety.
    /// To unlink those objects manually, you must call the `force_unlink` function on them.
    unsafe fn fast_clear(&mut self, bucket: &mut Self::Bucket);

    /// Returns `true` if the cursor is pointing to the null object.
    unsafe fn is_null(&self, bucket: &Self::Bucket, cursor: &Self::Cursor) -> bool;

    /// Returns a reference to the object that the cursor is currently pointing to.
    ///
    /// This returns `None` if the cursor is pointing to the null object.
    unsafe fn get(
        &self,
        bucket: &Self::Bucket,
        cursor: &Self::Cursor,
    ) -> Option<&<Self::PointerOps as PointerOps>::Value>;

    /// Returns a null cursor for this bucket.
    unsafe fn cursor(&self, bucket: &Self::Bucket) -> Self::Cursor;

    /// Creates a cursor from a pointer to an element.
    ///
    /// # Safety
    /// `ptr` must be a pointer to an object that is a member of this bucket.
    unsafe fn cursor_from_ptr(
        &self,
        bucket: &Self::Bucket,
        ptr: *const <Self::PointerOps as PointerOps>::Value,
    ) -> Self::Cursor;

    /// Searches the bucket for the first element where `is_match` returns `true`.
    ///
    /// Returns `None` if no match was found, or the bucket was empty.
    unsafe fn find_prev<F>(&self, bucket: &Self::Bucket, is_match: F) -> Option<Self::Cursor>
    where
        for<'b> F: FnMut(&'b <Self::PointerOps as PointerOps>::Value) -> bool;

    /// Returns a cursor that points to the next element of the bucket.
    ///
    /// If `self` points to the null object,
    /// then it will return a cursor that points to the first element of the bucket.
    ///
    /// If `self` points to the last element of the bucket,
    /// then it will return a cursor that points to the null object.
    unsafe fn next(&self, bucket: &Self::Bucket, prev: &Self::Cursor) -> Self::Cursor;

    /// Inserts a new element into the bucket after the current one.
    ///
    /// If the cursor currently points to the null object,
    /// then the new element is inserted at the front of the bucket.
    ///
    /// # Panics
    /// Panics if the new element is already linked to a different intrusive collection.
    unsafe fn insert_after(
        &mut self,
        bucket: &mut Self::Bucket,
        prev: &Self::Cursor,
        value: <Self::PointerOps as PointerOps>::Pointer,
    );

    /// Removes the next element from the bucket.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor is not moved.
    ///
    /// If the cursor currently points to the last element of the bucket, then no element is removed  and `None` is returned.
    unsafe fn remove_next(
        &mut self,
        bucket: &mut Self::Bucket,
        prev: &Self::Cursor,
    ) -> Option<<Self::PointerOps as PointerOps>::Pointer>;

    /// Removes the current element from the bucket.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor is advanced to the next element of the bucket.
    ///
    /// If the cursor currently points to the null object, then no element is removed  and `None` is returned.
    unsafe fn remove(
        &mut self,
        bucket: &mut Self::Bucket,
        current: &mut Self::Cursor,
    ) -> Option<<Self::PointerOps as PointerOps>::Pointer>;

    /// Removes the next element from the bucket, and inserts another object in its place.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor is not moved.
    ///
    /// If the cursor currently points to the last element of the bucket, then `Err(value)` is returned.
    ///
    /// # Panics
    /// Panics if the new element is already linked to a different intrusive collection.
    unsafe fn replace_next_with(
        &mut self,
        bucket: &mut Self::Bucket,
        prev: &Self::Cursor,
        value: <Self::PointerOps as PointerOps>::Pointer,
    ) -> Result<<Self::PointerOps as PointerOps>::Pointer, <Self::PointerOps as PointerOps>::Pointer>;

    /// Removes the current element from the bucket, and inserts another object in its place.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor points to the newly added element.
    ///
    /// If the cursor currently points to the null object, then `Err(value)` is returned.
    ///
    /// # Panics
    /// Panics if the new element is already linked to a different intrusive collection.
    unsafe fn replace_with(
        &mut self,
        bucket: &mut Self::Bucket,
        current: &mut Self::Cursor,
        value: <Self::PointerOps as PointerOps>::Pointer,
    ) -> Result<<Self::PointerOps as PointerOps>::Pointer, <Self::PointerOps as PointerOps>::Pointer>;
}

Amanieu

Initial review, I haven't finished reading through all the code yet.

Amanieu · 2020-04-26T10:57:42Z

src/hash_table/array.rs

+    /// This function can return a new array to tell the hash table to rehash into the new array.
+    fn shrink_to(&mut self, min_capacity: usize) -> Option<Self>;
+
+    /// Returns the capacity.


Suggested change

/// Returns the capacity.

/// Returns the number of elements the map can hold without reallocating.

///

/// This number is a lower bound; the hash table might be able to hold

/// more, but is guaranteed to be able to hold at least this many.

Amanieu · 2020-04-26T10:58:05Z

src/hash_table/array.rs

+}
+
+impl<T> Array<T> for &'_ mut [T] {
+    #[inline(never)]


Why is this #[inline(never)]?

src/hash_table/bucket_ops.rs

Amanieu · 2020-04-26T11:56:42Z

src/hash_table/array.rs

+    inner: Vec<T>,
+    capacity: usize,
+    load_factor: LoadFactor,
+    len: usize,


This field is never used?

Amanieu · 2020-04-26T11:57:46Z

src/hash_table/array.rs

+impl<T: Default> Array<T> for DynamicArray<T> {
+    #[inline(never)]
+    fn reserve(&mut self, new_capacity: usize) -> Option<Self> {
+        let new_len = new_capacity;


Just rename the parameter to new_len?

Amanieu · 2020-04-26T13:22:08Z

src/hash_table/array.rs

+   10 11 12 13 14 15 16 17 18 19
+   20 21 22 23 24 25 26 27 28 29
+   30 31 32
+}


I would prefer restricting array lengths to only powers of two. This has several advantages:

Bit masking is much faster than modulo.

We only need to provide array impls for powers of two.

Growing is easy: just double the size.

You can use the capacity_to_buckets and bucket_mask_to_capacity methods from hashbrown as a reference.

Amanieu · 2020-04-26T13:30:08Z

src/lib.rs

 #![allow(clippy::declare_interior_mutable_const, clippy::collapsible_if)]

 #[cfg(feature = "alloc")]
 extern crate alloc;

 #[cfg(test)]
+#[macro_use]


You should import macros using use std::<macro name>; rather than using #[macro_use].

Amanieu · 2020-04-26T13:30:49Z

src/lib.rs

@@ -267,13 +267,14 @@
 #![warn(missing_docs)]
 #![warn(rust_2018_idioms)]
 #![no_std]
-#![cfg_attr(feature = "nightly", feature(const_fn))]
+#![cfg_attr(feature = "nightly", feature(const_fn, const_generics))]


Const generics aren't needed if we are only providing array impls for power-of-two sizes.

Amanieu · 2020-04-27T10:17:27Z

src/hash_table/bucket_ops.rs

+/// |`remove`|`O(n)`|`O(1)`|
+/// |`replace_next_with`|`O(1)`|`O(1)`|
+/// |`replace_with`|`O(n)`|`O(1)`|
+pub unsafe trait BucketOps {


We can simplify this: just implement BucketOps directly on LinkedList/SinglyLinkedList instead of using DefaultBucketOps. This functionality isn't really something we need to let the user customize.

Remove type Bucket and type PointerOps. Add type Pointer and type Value.

Amanieu · 2020-04-27T10:23:24Z

src/hash_table/hash_table.rs

+/// It is also possible to create stateful adapters.
+/// This allows links and containers to be separated and avoids the need for objects to be modified to
+/// contain a link.
+pub trait HashTableAdapter {


HashTableAdapter can be simplified. Remove PointerOps and BucketOps, the adapter only needs to handle key equality and key hashing. Everything else is handled by the BucketOps from the B array parameter to HashTable.

Amanieu reviewed Sep 8, 2019

View reviewed changes

Amanieu mentioned this pull request Feb 6, 2020

Reduce boxing jeromefroe/lru-rs#72

Open

amari added 9 commits April 17, 2020 14:48

Added ChainingHashTable

76d6207

Fixed some ChainedHashTable tests

b65b89b

Fix travis errors

c7918a8

removed non-existent "std" feature

3f0303c

removed bit operations (for now)

42fd906

removed LoadFactor and OpenAddressedLoadFactor

a638f29

fixed clear and fast_clear

255c1dc

Refactor hash table

d8cf7cc

Added hash table documentation

ba60b09

amari force-pushed the chained-hash-table branch from f5d36dc to ba60b09 Compare April 26, 2020 08:26

Add workaround for Rust 1.36 integer constant type coercion bug

1976760

Fix KeyOps by refactoring HRTB

33abc19

Amanieu reviewed Apr 27, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chaining hash table implementation #37

Chaining hash table implementation #37

amari commented Sep 1, 2019 •

edited

Loading

Amanieu commented Sep 3, 2019

Amanieu left a comment

Amanieu Sep 8, 2019

amari Sep 8, 2019

Amanieu Sep 14, 2019

Amanieu Sep 8, 2019

Amanieu commented Apr 5, 2020

Amanieu commented Apr 5, 2020 •

edited

Loading

amari commented Apr 6, 2020

amari commented Apr 26, 2020 •

edited

Loading

Amanieu left a comment

Amanieu Apr 26, 2020

Amanieu Apr 26, 2020

Amanieu Apr 26, 2020

Amanieu Apr 26, 2020

Amanieu Apr 26, 2020

Amanieu Apr 26, 2020

Amanieu Apr 26, 2020

Amanieu Apr 27, 2020

Amanieu Apr 27, 2020

-    /// Returns the capacity.
+    /// Returns the number of elements the map can hold without reallocating.
+    ///
+    /// This number is a lower bound; the hash table might be able to hold
+    /// more, but is guaranteed to be able to hold at least this many.

Chaining hash table implementation #37

Are you sure you want to change the base?

Chaining hash table implementation #37

Conversation

amari commented Sep 1, 2019 • edited Loading

Amanieu commented Sep 3, 2019

Amanieu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Amanieu commented Apr 5, 2020

Amanieu commented Apr 5, 2020 • edited Loading

amari commented Apr 6, 2020

amari commented Apr 26, 2020 • edited Loading

Outstanding issues

Some notes

There are now two kinds of cursors: BucketCursor and Cursor.

KeyOps and HashOps

I need cursor_mut_from_ptr(ptr).remove() to have constant time complexity.

Amanieu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amari commented Sep 1, 2019 •

edited

Loading

Amanieu commented Apr 5, 2020 •

edited

Loading

amari commented Apr 26, 2020 •

edited

Loading

There are now two kinds of cursors: `BucketCursor` and `Cursor`.

`KeyOps` and `HashOps`

I need `cursor_mut_from_ptr(ptr).remove()` to have constant time complexity.