Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive Radix Tree: high-performance memtable #273

Open
wants to merge 19 commits into
base: 6.4.tikv
Choose a base branch
from

Conversation

Little-Wallace
Copy link

@Little-Wallace Little-Wallace commented Apr 4, 2022

Background

see more algorithm details in https://db.in.tum.de/~leis/papers/ART.pdf
Adaptive Radix Tree is a kind of Trie and can save more memory capacity but still keep high performance.
But the classic algorithm does not explain how to make it work in concurrency read and write.
Here I port a high performance memtable which based on this paper. I only support it in one thread write, which means that we shall set allow_concurrent_memtable_write false. But ART is 8 times than skiplist, it means that only one thread could make a large throughput.

Perfomance compare

LD_PRELOAD=/opt/homebrew/lib/libgflags.dylib ./db_bench --db=./data --disable_wal=true --enable_pipelined_write=true --key_size=35 --value_size=100 --write_buffer_size=4096000000 --benchmarks=fillrandom --batch_size=512 --num=1000000 --threads=1 --compression_type=none --allow_concurrent_memtable_write=false

fillrandom : 1.446 micros/op 691349 ops/sec; 89.0 MB/s

LD_PRELOAD=/opt/homebrew/lib/libgflags.dylib ./db_bench --db=./data --disable_wal=true --enable_pipelined_write=true --memtablerep=art --key_size=35 --value_size=100 --write_buffer_size=4096000000 --benchmarks=fillrandom --batch_size=512 --num=1000000 --threads=1 --compression_type=none --allow_concurrent_memtable_write=false

fillrandom : 0.283 micros/op 3527534 ops/sec; 454.2 MB/s

TODO

  • support snapshot isolation (not replace the value directly).
  • In the future we will port the optimistic lock to support multi-thread write.
  • Node256 will allocate a node which use 2048 bytes memory, it's too large. Maybe we could create a compressedl Node256 which uses less memory and owns a sub-allocator for the nodes of its sub-tree.
  • Similar idea for CompressedNode16 like the previous note.
  • Merge Node and InnerNode to save memory. Because we do not need to store prefix and prefix_len for a leaf node.
class CompressedNode256 : public Inner Node {
public:
     Node* find_child(uint8_t c) const override {
          uint16_t index = children_index[c].load(std::memory_order_relaxed);
          return arena.get_node(index);
     }

private:
     NodeArena arena;
     std::atomic<uint16_t> children_index[256];
}

Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
@Little-Wallace Little-Wallace requested a review from sticnarf April 5, 2022 14:48
@Little-Wallace
Copy link
Author

Little-Wallace commented Apr 5, 2022

Memory Usage

I tested the adaptive radix tree and inlineskiplist with 8byte key + 16byte value and 8byte key + 136byte (include 8byte sequence in value).

For value = 16byte.
AdaptiveRadixTree takes up 13.1MB while InlineSkipList takes up 6.6MB

For value = 136byte
AdaptiveRadixTree takes up 36.5MB while InlineSkipList takes 30.4MB

I think the extra memory cost is worth it.

  const int N = 200000;
  Arena arena;
  AdaptiveRadixTree list(&arena);
  for (int i = 0; i < N; i++) {
    Key key = i;
    char* buf = arena.AllocateAligned(sizeof(Key) + 16);
    const char* d = Encode(key);
    memcpy(buf, d, sizeof(Key));
    list.Insert(buf, sizeof(key), buf);
  }
  printf("cost memory: %lu\n", arena.ApproximateMemoryUsage());

  ConcurrentArena arena;
  TestComparator cmp;
  InlineSkipList<TestComparator> list(cmp, &arena);
  for (int i = 0; i < N; i++) {
    Key key = i;
    char* buf = list.AllocateKey(sizeof(Key) + 16);
    memcpy(buf, &key, sizeof(Key));
    list.Insert(buf);
  }
  printf("cost memory: %lu\n", arena.ApproximateMemoryUsage());

@Little-Wallace Little-Wallace changed the title [WIP] Adaptive Radix Tree: a high-performance memtable [WIP] Adaptive Radix Tree: high-performance memtable Apr 6, 2022
@Little-Wallace Little-Wallace changed the title [WIP] Adaptive Radix Tree: high-performance memtable Adaptive Radix Tree: high-performance memtable Apr 6, 2022
@WenyXu
Copy link

WenyXu commented May 23, 2022

Excellent job; I'm interested in your work. Can I work with your on this? I recently worked with art, and I did some micro bench compared with the B tree, which showed that range-scan performance is poor. (sort of similar to the result in the paper ) To improve the range-scan performance, the first idea that comes to my mind is that may be added the double link between the parent node of the leaves, and I'm going to do more research on this idea. For synchronization, we may look at this paper.

** sequential set **
artTree:    set-seq        1,000,000 ops in 102ms, 9,780,250/sec, 102 ns/op, 86.9 MB, 91 bytes/op
google:     set-seq        1,000,000 ops in 219ms, 4,557,655/sec, 219 ns/op, 54.2 MB, 56 bytes/op
tidwall:    set-seq        1,000,000 ops in 154ms, 6,483,031/sec, 154 ns/op, 38.8 MB, 40 bytes/op
tidwall(G): set-seq        1,000,000 ops in 129ms, 7,740,272/sec, 129 ns/op, 23.6 MB, 24 bytes/op
tidwall:    set-seq-hint   1,000,000 ops in 81ms, 12,298,654/sec, 81 ns/op, 38.8 MB, 40 bytes/op
tidwall(G): set-seq-hint   1,000,000 ops in 61ms, 16,473,638/sec, 60 ns/op, 23.6 MB, 24 bytes/op
tidwall:    load-seq       1,000,000 ops in 42ms, 23,674,685/sec, 42 ns/op, 38.8 MB, 40 bytes/op
tidwall(G): load-seq       1,000,000 ops in 34ms, 29,754,119/sec, 33 ns/op, 23.6 MB, 24 bytes/op
go-arr:     append         1,000,000 ops in 24ms, 40,864,488/sec, 24 ns/op, 26.5 MB, 27 bytes/op

** sequential get **
artTree:    get-seq        1,000,000 ops in 20ms, 49,690,984/sec, 20 ns/op
google:     get-seq        1,000,000 ops in 207ms, 4,831,567/sec, 206 ns/op
tidwall:    get-seq        1,000,000 ops in 151ms, 6,629,358/sec, 150 ns/op
tidwall(G): get-seq        1,000,000 ops in 117ms, 8,519,272/sec, 117 ns/op
tidwall:    get-seq-hint   1,000,000 ops in 68ms, 14,612,574/sec, 68 ns/op
tidwall(G): get-seq-hint   1,000,000 ops in 38ms, 26,476,360/sec, 37 ns/op

** random set **
artTree:    set-rand       1,000,000 ops in 146ms, 6,865,334/sec, 145 ns/op, 86.9 MB, 91 bytes/op
google:     set-rand       1,000,000 ops in 1435ms, 696,714/sec, 1435 ns/op, 44.9 MB, 47 bytes/op
tidwall:    set-rand       1,000,000 ops in 938ms, 1,066,533/sec, 937 ns/op, 44.9 MB, 47 bytes/op
tidwall(G): set-rand       1,000,000 ops in 709ms, 1,409,824/sec, 709 ns/op, 32.9 MB, 34 bytes/op
tidwall:    set-rand-hint  1,000,000 ops in 931ms, 1,073,607/sec, 931 ns/op, 44.9 MB, 47 bytes/op
tidwall(G): set-rand-hint  1,000,000 ops in 628ms, 1,592,353/sec, 628 ns/op, 32.9 MB, 34 bytes/op
tidwall:    set-after-copy 1,000,000 ops in 1015ms, 984,766/sec, 1015 ns/op, 344 bytes, 0 bytes/op
tidwall(G): set-after-copy 1,000,000 ops in 596ms, 1,678,383/sec, 595 ns/op, 344 bytes, 0 bytes/op
tidwall:    load-rand      1,000,000 ops in 907ms, 1,102,022/sec, 907 ns/op, 44.9 MB, 47 bytes/op
tidwall(G): load-rand      1,000,000 ops in 582ms, 1,718,565/sec, 581 ns/op, 32.9 MB, 34 bytes/op

** random get **
artTree:    get-rand       1,000,000 ops in 266ms, 3,763,685/sec, 265 ns/op
google:     get-rand       1,000,000 ops in 1908ms, 524,022/sec, 1908 ns/op
tidwall:    get-rand       1,000,000 ops in 1145ms, 873,394/sec, 1144 ns/op
tidwall(G): get-rand       1,000,000 ops in 599ms, 1,668,960/sec, 599 ns/op
tidwall:    get-rand-hint  1,000,000 ops in 1416ms, 706,306/sec, 1415 ns/op
tidwall(G): get-rand-hint  1,000,000 ops in 719ms, 1,391,549/sec, 718 ns/op

** range **
artTree:    traverse      1,000,000 ops in 11ms, 88,034,949/sec, 11 ns/op
artTree:    iter          1,000,000 ops in 30ms, 33,882,559/sec, 29 ns/op
google:     ascend        1,000,000 ops in 6ms, 180,193,719/sec, 5 ns/op
tidwall:    ascend        1,000,000 ops in 5ms, 208,342,361/sec, 4 ns/op
tidwall(G): iter          1,000,000 ops in 6ms, 153,921,148/sec, 6 ns/op
tidwall(G): scan          1,000,000 ops in 4ms, 222,692,350/sec, 4 ns/op
tidwall(G): walk          1,000,000 ops in 2ms, 401,949,454/sec, 2 ns/op
go-arr:     for-loop      1,000,000 ops in 2ms, 614,266,461/sec, 1 ns/op

see also casbin-mesh/neo#10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants