This project is collaborative work among the Facebook Core Data Science team, the Experimentation Platform team, and the Messenger team.
Experimentation is ubiquitous in online services such as Facebook, where the effects of product changes are explicitly tested and analyzed in randomized trials. Interference, sometimes referred to as network effects in the context of online social networks, is a threat to the validity of these randomized trials as the presence of interference violates the stable unit treatment value assumption (SUTVA) important to the analysis of these experiments. Colloquially, interference means that an experimental unit’s response to an intervention depends not just on its own treatment, but also on other units’ treatments. For example, consider a food delivery marketplace that tests a treatment that causes users to order deliveries faster. This could reduce the supply of delivery drivers to users in the control group, leading the experimenter to overstate the effects of the treatment.
Figure 1. An illustrative cartoon showing potential interference between test and control units and how cluster randomization accounts for the within-cluster interference.
In our paper we propose a network experimentation framework, which accounts for partial interference between experimental units through cluster randomization (Fig. 1). The framework has been deployed at Facebook at scale, is as easy to use as other conventional A/B tests at Facebook, and has been used by many product teams to measure the effects of product changes. On the design side, we find imbalanced clusters are often superior in terms of bias-variance trade-off than balanced clusters often used in past research. On the analysis side, we introduce a cluster-based regression adjustment that substantially improves precision for estimating treatment effects as well as testing for interference as part of our estimation procedure. In addition, we show how logging which units receive treatment, so-called trigger logging, can be leveraged for even more variance reduction.
While interference is a widely acknowledged issue with online field experiments, there is less evidence from real-world experiments demonstrating interference in online settings. By running many network experiments, we have found a number of experiments with apparent and substantive SUTVA violations. In our paper, two experiments, a Stories experiment using social graph clustering and a Commuting Zones experiment based on geographic clustering, are described in detail, showing significant network effects and demonstrating the value of this experimentation framework.
The design of network experimentation has two primary components: treatment assignment and clustering of experimental units. The component that deploys treatments is depicted visually in Figure 2, where the figure should be read from left to right. A clustering of experimental units, represented by larger circles encompassing colored dots for units, is taken as input. A given clustering and the associated units are considered as a universe, the population under consideration. These clusters of experimental units are deterministically hashed into universe segments based on the universe name, which are then allocated to experiments. Universe segments allow a universe to contain multiple mutually exclusive experiments at any given time, a requirement for a production system used by engineering teams. After allocation to an experiment, segments are randomly split via a deterministic hash based on the experiment name into unit-randomized segments and/or cluster-randomized segments. The final condition allocation deterministically hashes units or clusters into treatment conditions, depending on whether the segment has been allocated to unit or cluster randomization. The result of this final hash produces the treatment vector that is used for the experiment.
Figure 2. Visualization of the network experiment randomization process.
The other main component of network experimentation is clustering of experimental units. An ideal clustering will include all interference within clusters so that there is no interference between clusters, which removes the bias in our estimators. A naive approach that captures all interference is grouping all units into a giant single cluster. This is unacceptable, though, since a cluster-randomized experiment should also have enough statistical power to detect treatment effects. A single cluster including all units has no power, and a clustering that puts every unit in its own cluster, equivalent to unit randomization, leads to good power but captures no interference. This is essentially a bias-variance trade-off: More captured interference leads to less bias, while more statistical power requires smaller clusters. In our paper, we consider two prototypical clustering algorithms due to their scalable implementation: Louvain community detection and recursive balanced partitioning. We find that imbalanced graph clusters generated by Louvain are typically superior in terms of the bias-variance trade-off for graph-cluster randomization.
We are mainly interested in the average treatment effect (ATE) of an intervention (a product change or a new feature), the average effect when the intervention is applied to all users. Many estimation methods exist for ATE for cluster-randomized trials, from methods via cluster-level summaries, to mixed effect models, to generalized estimating equations. For the purpose of easy implementation at scale and explainability, the difference-in-means estimator, i.e., test_mean – control_mean, is used in our framework. The details of the estimands and estimators can be found in our paper. Here we briefly present our two methodological innovations for variance reduction: agnostic regression adjustment and trigger logging (logging units that receive the intervention). Variance reduction is essential since cluster-randomized experiments typically have less power than unit-randomized ones. In our framework, we use the contrast across conditions of pretreatment metrics as covariates to perform regression adjustment. We show that the adjusted estimator is asymptotically unbiased with a much smaller variance. Additionally, trigger logging allows us to perform estimation of the ATE using only the units actually exposed in the experiment. Under mild assumptions, we show that the ATE on the exposed units is equivalent to the ATE on all units that are assigned to the experiment. In Fig. 3, it is shown, for seven metrics in a Stories experiment, how point estimates and CI’s change if we perform an Intent-to-Treat (ITT) analysis on the triggered clusters, instead of triggered users, and if we do not use regression adjustment. The variance reduction from regression adjustment and trigger logging is significant.
Figure 3. Comparison of ATE estimates with scaled 95 percent confidence intervals computed on triggered users and triggered clusters (ITT), with and without regression adjustment (RA) for cluster test and control in a Stories experiment.
We describe in this blog a Commuting Zones experiment as an illustrative example. Commuting Zones, as shown in Fig. 4, are a Facebook Data for Good product and can be used as a geographic clustering for network experiments at Facebook. For products like Jobs on Facebook (JoF), geographical clusters may be especially appropriate as individuals are likely to interact with employers closer to their own physical location. To demonstrate the value of network experimentation, we conducted a mixed experiment, running side-by-side unit-randomized and cluster-randomized experiments, for a JoF product change that up-ranks jobs with few previous applications.
Figure 4. Facebook Commuting Zones in North America
Table 1. Commuting Zone experiment results
Table 1 summarizes the results of this experiment. In the user-randomized test, applications to jobs with no previous applications increased by 71.8 percent. The cluster-randomized conditions, however, showed that these estimates were upwardly biased, and we saw a 49.7 percent increase instead. This comparison benefited substantially from regression adjustment, which can reduce the confidence interval size in Commuting Zone experiments by over 30 percent.
By randomizing this experiment at the Commuting Zone level, the team also confirmed that changes to the user experience that increase this metric can cause employers to post more jobs on the platform (the probability that an employer posted another job increased 17 percent). Understanding the interactions between applicants and employers in a two-sided marketplace is important for the health of such a marketplace, and through network experiments we can better understand these interactions.
Experimentation with interference has been researched for many years due to its practical importance across different industries. Our paper introduced a practical framework for designing, implementing, and analyzing network experiments at scale. This framework allows us to better predict what will happen when we launch a product or ship a product change to Facebook apps.
Our implementation of network experimentation accommodates mixed experiments, cluster updates, and the need to support multiple concurrent experiments. The simple analysis procedure we present results in substantial variance reduction by leveraging trigger logging as well as our novel cluster-based regression adjusted estimator. We also introduce a procedure for evaluating clusters, which indicates that bias-variance trade-offs are in favor of imbalanced clusters and allows researchers to evaluate these trade-offs for any clustering method they would like to explore. We hope that experimenters and practitioners find this framework useful in their applications and that insights from the paper will foster future research in design and analysis of experiments under interference.