The Facebook Content Distribution Network (FBCDN) delivers photos and videos to people who use Facebook. The stack for the FBCDN includes several layers of caches and multiple backend storage systems (e.g. Haystack, f4). Today, we are sharing the story of how the caching layers of that stack has evolved over the last few years. It’s also a story about how research and production are tightly coupled within Facebook.
Caching within the FBCDN is beneficial for several reasons: it delivers photo and videos to people faster, it reduces Facebook’s backbone traffic, and it decreases the load on the backend storage systems. To those ends we want to maximize the caches’ hit ratio, i.e., the fraction of requests served with cached data. Yet, as we’ll see throughout this post, there is a tension between the hit ratio and the characteristics of the hardware used in the caches.
Facebook’s caches store photos and videos on NAND-flash based Solid State Drives (SSDs). We use SSDs because they provide much higher capacity than DRAM (i.e., can cache many more photos) and have much higher random read rates than traditional hard drives (i.e., can serve many more requests each second). But, due to their physical characteristics, SSDs have an issue with write amplification, where a single logical write results in multiple physical writes to the device. Higher write amplification is undesirable because it decreases the throughput and the lifespan of the SSD.
Figure 1: Hit-ratio vs. write amplification of three generations of SSD caching for FBCDN.
As shown in Figure 1, this post tells a story of how we evolved the caching layers along these two dimensions, from McDipper to BlockCache, and now RIPQ, trying to improve the hit ratio as much as possible while keeping the write amplification as low as possible. The rest of the post is divided into four parts. Part I describes a research study that identified the benefits of using advanced caching algorithms to increase the hit ratio and motivated the rest of the project. McDipper was used in production back then. Using FIFO (First-In-First-Out), it had the lowest hit-ratio as well as the lowest write amplification. (It also approximately implemented LRU [Least-Recently-Used], but was not used in production because of the marginal improvement with additional overhead.) Part II describes BlockCache, our first effort to implement the advanced caching algorithms in practice. BlockCache supported SLRU (Segmented LRU), which greatly improved the hit ratio in production, but had high write amplification and did not achieve the full benefits of advanced caching algorithms. Part III describes another research project, RIPQ, that addresses advanced caching on flash in a principled way. RIPQ provides a flexible priority queue interface that makes it simple to implement advanced caching algorithms; e.g., SLRU and GDSF (Greedy-Dual-Size-Frequency) with high fidelity, thus achieving the highest hit-ratio. At the same time, it keeps write amplification low, which increases the throughput and the lifespan of the SSD. Finally, Part IV describes lessons learned from deploying RIPQ in production.
At SOSP 2013 and in a February 2014 blog post we presented a study of the FBCDN to characterize how the stack worked and to identify new directions that could improve it. Among our discoveries, one important finding was that some advanced caching algorithms, such as Quadruply Segmented LRU (S4LRU), could significantly increase the hit ratio of the caches over other simple algorithms, including First-In-First-Out (FIFO), which was used in production at the time.
Figure 2. Hit ratio of Clairvoyant/S4LRU/LRU/FIFO caching algorithms for varying sizes of the Origin cache. The origin cache is colocated with the storage backend to reduce accesses to the disk-based storage, and 1x was the cache size used in production at the time of the study. The clairvoyant algorithm relies on future information and is unattainable in practice, but provides an upper bound on achievable hit ratios.
The FIFO caching algorithm was originally picked over other common choices like Least-Recently-Used (LRU) and Least-Frequently-Used (LFU) primarily for two reasons: (1) FIFO naturally maps to a sequential workload that ensures low write amplification on the SSDs used to cache photos, (2) LRU and LFU provided only marginal hit ratio improvements over FIFO while being much harder to map to a SSD-friendly workload. LRU, LFU, and most other caching algorithms (excluding FIFO) naturally generate workloads with many small random writes. These small random writes come from moving items in the caching space for almost every access. Small random writes result in high write amplification on SSDs because they induce frequent garbage collections in the Flash Translation Layer (FTL).
In terms of hit ratio and write amplification, FIFO was chosen over LRU and LFU initially because the potential improvement in hit ratio was small and increase in write amplification was high. Our study changed this equation by discovering that more advanced caching algorithms could increase the absolute hit ratio by 8 to 21%. This motivated us to start exploring how to implement these advanced caching algorithms while keeping write amplification as low as possible.
Based on the results of our initial study, the Facebook CDN team worked to implement an initial solution in late 2014 — named BlockCache — which applies the Segmented LRU logic to large fixed-size blocks of items. In a nutshell, BlockCache restricts its writes to fixed-sized blocks that are aligned in the SSD’s logical address space. These logical blocks were configured to be 64MiB and contain hundreds of photos and video blobs with varying sizes. An in-memory index is used to locate each item with a block ID and its offset within the block. The Segmented LRU data structure is also maintained in memory, tracking relative positions of all flash blocks in an array of linked lists, each representing a segment. When an item in cache gets a hit, the entire block it belongs to is promoted to a segment closer to the cache head (and thus it is less likely that the block will be evicted) to reflect its increased cache priority. To handle cache misses, a few 64MiB-sized RAM buffers are pre-allocated to accept newly inserted items. Once a buffer becomes full, it is flushed into an empty block on the flash device. When there is not enough space to accommodate flushed memory buffers, BlockCache evicts and re-uses the least prioritized block from its tail.
Figure 3: How BlockCache simulates Segmented LRU on a block level to handle cache hits/misses/evictions. When an item gets hit in BlockCache, the whole block is promoted even though the rest of the items in it might be rarely accessed, resulting in a divergence of priorities of items within a block. Later when a block is evicted from the queue tail, many items in that block may belong to higher levels in the exact Segmented LRU algorithm, decreasing the approximation fidelity.
The size of the blocks in BlockCache controls a trade-off between the fidelity to the caching algorithm and write amplification. In the degenerate case when a block contains a single item, it is similar to a direct item-based implementation of Segmented LRU. (There are still small differences because fixed-sized blocks require padding but are better aligned than variable-sized items.) As blocks increase in size, more and more items are packed into them. Packing more items into a block decreases FTL write amplification, but decreases algorithm fidelity. FTL write amplification decreases because the resulting random writes are larger. Algorithm fidelity decreases because each time an item is moved by the algorithm, the other items in the block that the Segmented LRU logic did not intend to move are moved in the block queue as well. Blocks of 64 MiB were used because they provided the best trade-off between the two competing factors.
After being launched in production, BlockCache improved the hit ratio compared to the FIFO baseline by 10% (in this article we use absolute hit ratio improvements, so 10% improvement means that the new hit ratio is the old hit ratio plus 10%). At the same time, the write amplification increased from 1x to around 3.9x because the 64MiB block size was still small compared to the large erase block size, therefore generating an undesirable small random write workload on the flash devices. Although the hit ratio improvement was substantial, two major drawbacks of BlockCache still drove us to seek a better solution:
Based on these two observations, we sought to design a Flash caching solution that could both reduce the negative effects of write amplification across Facebook’s workloads and also decouple the caching policy from the underlying block-level implementation. We discuss next our solution, RIPQ, which is now running in production at Facebook.
To solve these challenges we faced with flash caching, we conducted a study and came up with a new design called Restricted Insertion Priority Queue (RIPQ). RIPQ is a novel caching framework that decouples flash I/O management from data caching policies to support many different caching algorithms efficiently on modern flash devices.
PRIORITY QUEUE: AN INTERFACE TO ENABLE ALL CACHING ALGORITHMS
RIPQ provides a relative priority queue interface, which can be used to implement a large class of caching algorithms including LRU, Segmented LRU, and Greedy Dual Size Frequency. A relative priority queue assigns all items in queue with priority values relative to their positions. The item at the queue head has the highest priority close to 1.0, and the item at tail has the lowest priority close to 0.0.
The RIPQ interface provides three basic operations:
The priority value can be thought of as the distance from the tail and is implicitly changed when a new item is inserted or an existing item’s priority increased. If an item is inserted/increased, then all items with lower priorities will be pushed even closer to the tail and their priorities implicitly decreased. When the cache is full, the total number of items in the queue is usually stable (i.e. one eviction comes with one insertion) and items with higher priorities (i.e. closer to the head) will retain the same priority values.
Figure 4. Using the relative priority queue interface of RIPQ to implement a caching algorithm: insert(x,p) on cache miss, increase(x,p’) on cache hit, and delete-min() is called implicitly for cache eviction. The animation also demonstrates how these interfaces can be used to implement an advanced algorithm such as Segmented LRU.
To demonstrate the use of RIPQ’s interface in implementing a caching algorithm, here we pick two simple examples with LRU and Segmented LRU with K segments:
Certain advanced caching algorithms, such as GDSF, require an absolute priority queue interface, meaning each item needs to be assigned an absolute priority value that does not change along with the position of the item in the queue. In order to support them, RIPQ provides an in-memory data structure to map absolute priorities to relative priorities. In practice, we find that RIPQ’s interface allows for quick cache policy development and testing — supporting many caching algorithms with only a few lines of code. As a result, the interface allows cache policy designers to quickly explore the design space to come up with policies optimized for specific workloads. For example, increasing the priority of a class of items if their accesses to storage backend use more resources.
DESIGN OF RIPQ: A FLASH-FRIENDLY APPROXIMATION OF PRIORITY QUEUE
Similar to BlockCache, RIPQ also uses large blocks to hold photos on the flash device and memory buffers to accept new insertions. However, its key ideas are in insertions with priority-aware memory buffers, and lazy priority increases via virtual blocks and reinsertions during cache evictions:
1. Priority-aware memory blocks to support insertions with priorities approximately. An insertion to RIPQ comes with a priority indicating the importance of the item. In order to put the item at the right location, a tunable number of memory blocks are distributed evenly in the priority order. When inserting a new item with priority p, RIPQ finds the memory buffer nearest to the insertion point to append. The number of memory buffers therefore control the trade-off between approximate insertion accuracy and memory usage. From our experiments with 8 memory buffers, the approximation does not hurt the cache hit-ratio by more than 0.5% compared to the exact algorithms.
Figure 5. RIPQ is simulating Segmented LRU. When inserting an item with a given priority, we append it to the memory buffer with the nearest priority. When full, that buffer will be flushed to flash and transitions to a flash block.
2. Virtual blocks to perform lazy priority increase. A naive way to perform priority increase is to immediately copy the data on flash to a new memory buffer, but this would duplicate data in the cache and result in many writes to the flash. To solve this issue, RIPQ uses a new structure, virtual blocks, to track the increased priority value when an item receives a cache hit. The virtual block is a small in-memory object treated equally to memory buffers and flash blocks within the priority queue. It serves a placeholder for the items with updated priority. On priority increases, we only lazily modify the virtual block ID of an item, with no immediate flash I/O needed.
Figure 6. We use virtual blocks as placeholders in the priority queue to track updated priorities. When increasing the priority of an item, we modify its virtual block ID to the virtual block nearest to its new priority.
3. Carrying out priority increases via reinsertions in evictions. The actual data movement associated with priority increases is carried out during cache evictions. The item data gets reinserted to the memory buffer nearest to its virtual block (if its virtual block field is set), reflecting its more updated priority value. Because the old block is immediately evicted from the cache, only one copy of data ever exists on flash. In addition, we save flash I/O because multiple priority increases to the same item can be merged into one data copying operation.
Figure 7. The actual priority change is carried out during evictions. We copy the data of an item to the memory buffer closest to its virtual block, reflecting its new priority.
In summary, RIPQ tries to align item priorities with their locations in the block queue on flash. The alignment will be violated when the priority of an item is updated (this usually happens on a cache hit), but then will be corrected later through the reinsertion operation during cache evictions. This alignment allows us to approximate the priority queue operations with only large block writes to flash at a relatively low overhead.
PERFORMANCE RESULTS OF RIPQ
After finishing prototyping and publishing our initial design and experiment results in a paper at FAST ’15, we have integrated RIPQ into Facebook’s caching in production. Here we present the first analysis of our performance improvements by adopting the same Quadruply Segmented LRU algorithm for the same traffic. Compared to BlockCache, RIPQ improves the hit ratio by 3%-5% at Origin cache colocated with the storage backend. Because both RIPQ and BlockCache are approximating Quadruply Segmented LRU on flash, the improvement comes from RIPQ’s more faithful approximation at the item granularity, while BlockCache updates priorities of all items within a block altogether.
Figure 8. Compare the caching performance of two clusters with similar workloads, one running BlockCache, and the other running RIPQ. RIPQ increases the hit-ratio by 3~5% from its better implementation of S4LRU, and correspondingly reduces the reads to backend by around 7%.
Another benefit we have seen is a large drop in write amplification. Because modern flash cards typically have a large erase block on the order of hundreds of megabytes, the 64MiB block writes of BlockCache still causes significant write amplification (around 3~6) in FTL garbage collection. Although increasing the block size could potentially decrease the write amplification, it would hurt the hit ratio of BlockCache as each block is more likely to be moved around due to a single popular photo. RIPQ uses the same block size as BlockCache, but it enjoys better data locality because (1) most of the writes come from misses and go to the same insertion point corresponding to the lowest segment in SLRU, (2) it does not move blocks around in the priority space. So consecutive blocks in the queue are likely flushed to flash together, and also later evicted and overwritten together. The improved data locality results in a much smaller write amplification at around 1.2. The write amplification reduction frees up to 3X available flash bandwidth, or extends the lifespan of the device up to 3X given the same workload.
Figure 9. After switching from BlockCache to RIPQ in mid October (around October 18), the average flash write amplification dropped from over 3 to around 1.2 in a cache cluster.
After testing RIPQ in a production environment, aside from the write amplification benefit brought by better data locality, we also found a side-effect: popular items are often clustered in a few consecutive blocks, forming “hot blocks.” When evicting these hot blocks, RIPQ was performing a lot of reinsertions into the cache because a large amount of data in these blocks is hot enough to stay in the cache. This large number of reinsertions saturated the bandwidth of the SSD and caused latency spikes for flash reads. We have devised several heuristics to address this issue.
First, when evicting a hot block, if its reinsertion ratio is too high (>0.6), we directly move it to the queue head and process the next block instead. The eviction of the hot block will be deferred until it becomes colder, and the reinsertion ratio and correspondingly the flash I/O it uses drops. We have observed that some very hot blocks can be moved from queue tail to queue head, and then reach the queue tail again multiple times before they are evicted — with this heuristic, the reinsertions of all except the last eviction is saved.
Second, we maintain an exponential moving average (with coefficient 0.5) of the reinsertion ratio, and if it reaches above 0.2, we find a cold block (with reinsertion ratio <0.1) near the queue tail for eviction instead. In this way the average reinsertion ratio over short time scales is less spiky, and more flash I/O bandwidth can be used for cache admission and cache-hit reads.
With these techniques combined, we have largely alleviated the flash read latency spikes during eviction. In addition, we find that the large flash block reads/writes saturate flash bandwidth and interfere with cache-hit reads. By reading/writing one flash block in multiple chunks, we reduce the interference and further cut down the cache-hit latency.
Figure 10: The above graph shows the 95-th percentile (p95) cache-hit latency (over one minute intervals) before and after switching from BlockCache to RIPQ around October 18. RIPQ with the eviction heuristics reduces the latency by over 60% compared to BlockCache.
At this moment, we have gradually rolled out RIPQ to all cache clusters inside FBCDN and we are glad to realize the benefits of RIPQ compared to its counterpart, BlockCache. The evolution of advanced caching for Facebook is also a great example showing how a research project motivated by real-world problems finds its way back to production systems. We are especially grateful to the researchers and engineers at Facebook, who have helped make this happen so quickly. Looking forward, we are even more excited to leverage RIPQ to design better caching policies for FBCDN, extend it to the full memory/storage hierarchy (e.g., RAM and disk), and find synergy with other Flash-based systems in Facebook.