Facebook serves billions of people each day. To support this scale, we distribute our workloads at many different levels. Our users’ traffic is routed to one of several worldwide data-centers to improve scalability, fault tolerance and latency. As a single host has limited storage and compute resources, our storage systems shard data items over multiple hosts and our batch jobs execute over clusters of thousands of workers, to scale and speed-up the computation.
At the heart of these systems, is a fundamental decision on how to assign a set of elements (requests, data items, or jobs) to one of several groups (data centers, hosts, or workers). There is an exponential number of ways to make such an assignment, however the choice can lead to drastically different response times and service quality. The first aspect to consider is workload balance: if one of the groups is overloaded, it won’t be able to meet the required response time or service level. For a given workload distribution, there is further optimization potential since so many items are better when they are co-located. For example, placing two data items that are frequently accessed together on the same storage host can improve the performance of the queries which rely on them.
Balanced graph partitioning provides a useful method to address the underlying task assignment problem. Here, nodes in the graph representing items are assigned to one of several buckets representing these groups in a balanced manner, all the while optimizing for some quantity such as within-group item similarity. We’ve talked in the past about our usage of balanced graph partitioning with the edge cut objective to improve systems’ performance. Our VLDB’17 paper, Social Hash Partitioner: A Scalable Distributed Hypergraph Partitioner, describes a new method for partitioning bipartite graphs while minimizing fan-out. This approach helps when optimizing many of our workload distribution tasks. We describe the resulting framework as a Social Hash Partitioner (SHP) because it can be used as the hypergraph partitioning component of the Social Hash framework introduced in our earlier NSDI’16 paper.
In this post we’ll explain the highlights of the Social Hash partitioner.
Our primary motivation for studying fan-out reduction comes from the problem of storage sharding common in distributed databases. Consider a scenario with a large dataset whose data records are distributed across several storage servers. A query to the database may consume several data records. If the data records are located on multiple servers, the query is answered by sending requests to each server. Hence, the assignment of data records to servers determines the number of requests needed to process a query; this number is often called the fan-out of the query. Queries with low fan-out can be answered more quickly, as there is less chance of contacting a slow server, and they impose less load on the system by lowering the communication overhead. Thus, a common optimization is to choose an assignment of data records that co-locates the data required by different queries.
Figure 1 demonstrates this issue with two measurements, one done with a synthetic setup, and the measured on a live system:
Here, t represents the average latency in the system. We can see how high fan-out queries suffer from higher latencies. Specifically, one can almost halve the average latency by reducing fan-out from 40 to 10.
Storage sharding can be modeled as a bipartite graph partitioning problem. In Figure 2, we represent the data items and the queries that fetch them as vertices in a bipartite graph. We would like to partition the data items into k “buckets”, such that a similar amount of data assigned to each bucket, and that each query is connected to as few distinct buckets as possible, on average.
Finding the optimal partition of a graph is often hard computationally. A typical heuristic is to start with some initial balanced partition, and iteratively improve its quality by making local changes to the assignment of specific vertices: during an iteration, each of the vertices estimates their preference to belong to other buckets, which we refer to as a ‘gain’. If the new gain is higher than the current assignment, we attempt to move the vertex to the new bucket while retaining overall balance.
This technique is often too weak for fan-out optimization. Figure 3 demonstrates this issue, where the dotted V1 and V2 sets indicate the two buckets:
Each query (q1, q2, q3) accesses exactly 2 buckets, and so the average fan-out is 2. This is not optimal, since swapping the assignment of vertices 3 and 4 with vertices 5 and 6 will reduce the fan-out of q1 and q2 to 1, resulting in an average fan-out of 1⅓. However, from the point of view of each of the vertices, there is no “gain” in moving to a different bucket: the fan-out of the queries relying on that vertex will stay the same.
In our work, we have eased this problem by ‘smoothing’ the optimization goal: instead of assuming a query vertex fans-out to all of its data items, we assume it accesses each data item with some probability p. This allows expressing the gain of each vertex v from swapping from bucket i to bucket j as
where N(v) is the set of queries accessing v, and ni(q) is the number of data items query q accesses in bucket i. With this form of gain estimation, a vertex tends to favor buckets that host other vertices with common queries, even if the actual fan-out does not change following such a movement.
We can now roughly express our algorithm in the following manner:
The problem of fan-out minimization is equivalent to balanced hypergraph partitioning. There are several available hypergraph partitioning frameworks, but we found that they are unable to handle the scale of our graphs.
Our solution is built on top of Apache Giraph, and is carefully designed to scale with the size of the graph and the desired number of buckets: vertex movement evaluation can be done in a distributed manner after vertices communicate their assigned positions to neighbors. Swap operations can be done in a distributed manner as well by distributing (i, j) pairs to different hosts, or by approximating the decision using a probabilistic choice similar to the one we described in our previous post. In practice, we found that the distributed implementation of algorithm can handle graphs 100x bigger than the largest instances the existing approaches can handle, without sacrificing optimization quality.
Figure 5 shows the run time of algorithm (in two variants: SHP-2 and SHP-k) in comparison to other available partitioning framework, over three different graphs with varying number of edges (10M to 5B), and different bucket counts. On smaller instances SHP is often the fastest partitioner, on larger instances SHP is the only partitioner that manages to compute any solution in a reasonable amount of time.
Figure 6 shows the quality of the SHP partitioner in comparison to other frameworks. As we don’t know the optimal average fan-out for the networks we use, we measure the percentage of increase in fan-out, compared to the result of the best algorithm. For the large instance, the SHP partitioners are often the best, for the smaller instance they are at most 12% worse than the best approach.
The fan-out reduction model is applicable to many infrastructure optimization problems at Facebook, like data sharding, query routing and index compression.
We use the Social Hash Partitioner regularly for reducing fan-out in graphs with billions of vertices and trillions of edges. Our experiments demonstrate that using these outputs for data sharding in a distributed system, can reduce CPU consumption by half. The VLDB’17 paper provides more details.
The partitioner is open-sourced as a Giraph application, and it is available for usage and evaluation. An in-depth description of how the partitioner is used to optimize systems’ performance at Facebook can be found in our NSDI’16 paper.