Intel Multi Tier CacheLib

Introduction

With the increasing memory requirements for AI/ML/CDN/Web services, memory is playing a significant role in power, performance and TCO for these services. CXL provides new capabilities for scaling performance of memory and I/O bound systems with cost-effective additional byte-addressable memory. Our work to enable memory tiering in CacheLib demonstrates how to leverage the additional byte-addressable memory offered by CXL devices to improve performance of hybrid caching environments.

Typically, a hybrid cache system is a DRAM cache backed by a larger NVM block cache. CacheLib uses the NVM device to expand cache capacity and increase the cache hit ratio. In fact, hit ratio plays a critical role in overall end-to-end performance of an application. Consider the following example: A cache has 98% hit ratio, average cache latency of 100us and a database access time of 10ms, the expected end-to-end application latency is 298us (0.98 x 100us + 0.02 x 10000us). Now increasing the application's hit ratio to 99%, we get an expected end-to-end latency of 199us, resulting in over a 33% speedup.

CacheLib - multi-tier memory value proposition with CXL

CXL memory expansion

With the presence of additional byte-addressable memory, the pressure on the NVM device is reduced significantly. In our evaluation, we found that by increasing the byte addressable memory from 12GB to 20GB results in over 1.5x improvement in throughput. Furthermore, we show that we can reduce DRAM memory by 3x with no degradation to overall system throughput.

CXL bandwidth expansion

In some workloads, particularly with large objects - as we scale up the number of threads, memory bandwidth increases. In our experiments we have measured over 220GB/s memory bandwidth with 48 threads for a synthetic CDN workload. CXL bandwidth expansion can be used to improve CacheLib's throughput by enabling higher throughput since its traffic goes through the PCIe channels.

CacheLib Background and Internals

CacheLib is general purpose caching library in C++ developed and released as open-source software by Meta. CacheLib provides caching capabilities inside Meta for more than 100 diverse services from Web, Social Graph, CDN, and storage.

CacheLib uses the slab allocator design popularized by memcached (a popular open-source key-value cache). Each object in the cache resides in a class corresponding to the object’s size. Classes are allocated slabs and maintain their own independent eviction containers. A pointer to the object’s data residing on a slab is stored in a global hash table that is indexed by the object’s key for fast lookups.

The diagram below introduces our multi-tier extension to the single tier allocator developed by Meta. We duplicate the allocator structure for each tier and allow data to move among the tiers using the existing data movement mechanisms.

Design of a multi-tier software cache

Multi-tier memory systems have several key design considerations compared to single tier systems as the data movement among the tiers incurs extra overhead (unlinking from current tier, data copy, linking in next tier) - in our multi-tier CacheLib we are actively exploring several choices:

Cache Inclusiveness - should the items in first tier be inclusive with respect to the second tier. That is, the first tier item's are a subset of the second tier items. In this design, effective cache space is reduced, but we only need to write back dirty data to the next tier, therefore reducing the amount of data copies to the next tier.
Initial allocation policy - when a new item is inserted into the cache, we can allocate from either 1) the first tier since it is likely to be reused again or 2) any free allocation slot - regardless of which tier the allocation belongs to. In (1), we will have the benefit of the item already being in the fastest tier if the item is reused before evicted but we may have to evict an item from first tier to next tier in order to make space for the new incoming item. In (2), we can allocate immediately without the need to go through the eviction path, but if the item is to be reused quickly, we may pay the extra access cost on a slower medium compared to tier 1.
Promotion - when an item in a slower tier is accessed, we may want to promote the item to the first tier so that future accesses will be from the fastest medium. However, promotion does introduce additional overhead and may not always be beneficial. We are currently experimenting with different promotion policies and how they effect the overall performance as first tier hit ratio increases due to promoted items being reused.

Reducing the cost of moving data among the tiers

As a first step, we focus on optimizations to reduce data movement overhead:

Background threads for data movement (eviction and promotion)
Reducing the locking overhead in eviction

Background data movement

To reduce data movement overhead, in CacheLib we added background threads to perform eviction and copying data among tiers. By preemptively move items among tiers, user allocation requests (SET key value) can immediately allocate an item without the need to evict an old item since the threads keep allocation slots open. They can also promote popular items from lower tiers to higher tiers thereby keeping the most popular data residing in the tier with the fastest device latency.

We use a watermark threshold strategy inspired by the kernel tiering design. There are two thresholds: 1) low threshold - which triggers the background eviction thread when a class has a given percentage of free memory in the class and 2) high threshold – which tells the background thread to stop evicting when we have reached a given percentage of free memory in this class.

For example, if class A has a low threshold 2% and a high threshold of 4% then, the background eviction will start when the class has 2% of memory space remaining and end when there is 4% of memory free.

Below is an illustration of the background eviction process and design. Our design allows us to move items in batch among the tiers which reduces contention to the LRU lists. The promotion process and design is similar with the exception that data moves in the opposite direction (from tier 2 to tier 1).

Reducing the critical section length in the eviction path

The eviction process in multi-tier environment is significantly longer compared to single tier. In single-tier system an object is 1) unlinked from the eviction queue, 2) removed from the hash table and 3) recycled to the slab allocator. In comparison, multi-tier eviction is a six step process:

Unlink item from eviction queue
Allocate new item in next tier (can cause recursive eviction in next tier)
Copy data to next tier
Link new item in eviction queue
Update new item in hash table
Recycle item back to allocator

Each link/unlink requires locking the eviction queue for that tier and class. Updating an item in the hash table also requires locking the hash table entry. Step 2, perhaps the most expensive, can trigger all steps again for the next tier (in case that it is the last tier, there are only the 3 steps from single-tier design).

Our design removes holding the eviction queue lock for steps 1-6 and only locks the appropriate containers when necessary. We use control bits in the object’s metadata to mark an object as being evicted and fine-grain object locks during data copy. This allows concurrent evictions in the same object class to proceed and improved multi-tier performance significantly as the figure below demonstrates the impact of locking the LRU lists measured using Intel VTUNE.

Performance Analysis With PMEM as a substitute for CXL

CacheLib is shipped with its own benchmarking tool, CacheBench, that provides several open-source synthetic workloads inspired by Meta’s own services. When evaluating the performance of our multi-tier design we often look at the following metrics: hit ratio, throughput, p50-999 latency, and CPU utilization. We have found that several of CacheLib’s internal configuration options have significant impact on performance, such as hash table size (larger size means less collisions and chaining).

In our evaluation here we focus on the hit ratio (leader and follower) workloads for Meta’s Graph Cache system. We scale these workloads to use 3.2B keys (leader) and 1.3B keys (follower) with a backing 256GB SSD cache and vary the byte-addressable cache size. We use PMEM in these experiments in place of CXL. Later, in our collaboration with SK Hynix we show results with CXL memory.

In the charts below – we show that an additional 8GB of byte addressable memory improves the throughput by over 1.5x – comparing the 20GB cache sizes to the 12GB cache sizes. This is because the cache throughput is bound by the I/O done by the SSD cache. Since more data now resides in byte-addressable memory as shown by the improved hit ratio (mem), a larger portion of requests will be served from byte-addressable memory compared to the SSD.

We also show that the performance does not degrade when using multi-tier (4G DRAM + 16GB PMEM and 4GB DRAM + 8GB PMEM). The tiered version of CacheLib is able to provide sufficient peformance such that the overhead of moving the data among the tiers is not the bottleneck in performance. In fact, compared to the upstream version of CacheLib (20GB) – our multi-tier design beats the single tier slightly due to the productivity of the background workers. By keeping allocation slots free, incoming allocations take significantly lower time since there is no need to peform any eviction.

The latency chart shows the drop in P99 latency compared to single tier 12GB system when using background eviction in multi-tier CacheLib. We significantly decrease the GET P99 latency due to reduced time spent waiting for spin locks on the LRU lists. This is illustrated by lower CPU utilization on the right pane. SET P99 latency is also reduced since incoming allocation requests do not always need to perform eviction as the background threads will have kept an allocation slot open. This is the same reason why throughput improves slightly over the single tier version without background eviction.

Performance Analysis with CXL memory from SK Hynix

CXL memory expansion with Multi-Tier CacheLib

For our configuration for CXL Memory Expander, we experimented on Intel Xeon Platinum 8480. We use the CDN synthetic workload from CacheBench with number of keys to 8M, the number of operations per thread to 5M, and the number of threads to 24. In the case of OS Interleaving (1:3), SK Hynix's HMSDK (https://github.com/skhynix/hmsdk), was used to allocate memory according to the DRAM and CXL Memory Capacity Ratio. In the case of CacheLib Multi-tier, 10 background promoters and evictors were set. For the case of CXL memory capacity expansion, DRAM:CXL was set to 1:3 ratio. The cache sizes and configuration are detailed in the table below:

This experiment shows the benefits of expanding the memory capacity of a single server with CXL Memory. We show that multi-tier CacheLib can obtain additional performance gains compared to OS interleaving with HMSDK from SK Hynix. With the additional CXL memory, GET throughput is improved by 1.26x~1.70x compared to DRAM only configuration.

CXL bandwidth expansion with CacheLib

For our configuration for CXL Memory Expander, we experimented on Intel Xeon Platinum 8480. The maximum bandwidth of a single server was increased by 50% with the CXL Expander. SK Hynix's HMSDK was used and allocated according to the bandwidth ratio of DRAM and CXL. We use the CDN synthetic workload from CacheBench with number of keys to 8M, the number of operations per thread to 5M, and the number of threads to 200. The cache size was set variously from 8GB to 224GB.

CXL Memory can expand system memory additionally without consuming DIMM channels. In other words, it has the advantage of being able to expand the capacity and bandwidth of a single server. This experiment can show the bandwidth expansion effect of CXL for workloads with high memory bandwidth requirements. When CXL memory was added compared to the server configured only with DRAM, the memory bandwidth was expanded and GET throughput improved by 1.25x~1.41x compared to DRAM only. In the case of p99 latency, it decreased by 17%~41% compared to DRAM Only. We show the results on the figures below:

Conclusion

CXL provides new opportunities to re-architect memory systems used by datacenter applications as we have demonstrated with CacheLib. Enabling these new architectures such as tiering can result improving application performance for systems traditionally bound by I/O.

We have implemented optimizations to reduced data movement overhead and we are currently researching methods additional opportunities to enable more performance in the tiering environment. Our fork of CacheLib can be found at https://github.com/intel/CacheLib and our develop branch contains our latest working version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intel Multi Tier CacheLib

Introduction

CacheLib - multi-tier memory value proposition with CXL

CXL memory expansion

CXL bandwidth expansion

CacheLib Background and Internals

Design of a multi-tier software cache

Reducing the cost of moving data among the tiers

Background data movement

Reducing the critical section length in the eviction path

Performance Analysis With PMEM as a substitute for CXL

Performance Analysis with CXL memory from SK Hynix

CXL memory expansion with Multi-Tier CacheLib

CXL bandwidth expansion with CacheLib

Conclusion

Uh oh!

Uh oh!

Clone this wiki locally