Kevin Liou is a Research Scientist within Core Data Science, a research and development team focused on improving Facebook’s processes, infrastructure, and products.
Companies routinely turn to A/B testing when evaluating the effectiveness of their product changes. Also known as a randomized field experiment, A/B testing has been used extensively over the past decade to measure the causal impact of product changes or variants of services, and has proved to be an important success factor for businesses making decisions.
With increased adoption of A/B testing, proper analysis of experimental data is crucial to decision quality. Successful A/B tests must exhibit sensitivity — they must be capable of detecting effects that product changes generate. From a hypothesis-testing perspective, experimenters aim to have high statistical power, or the likelihood that the experiment will detect a nonzero effect when such an effect exists.
In our paper, “Variance-weighted estimators to improve sensitivity in online experiments,” we focus on increasing the sensitivity of A/B tests by attempting to understand the inherent uncertainty introduced by individual experimental units. To leverage this information, we propose directly estimating the pre-experiment individual variance for each unit. For example, if our target metric is “time spent by someone on the site per day,” we may want to give more weight to those who previously exhibited lower variance for this metric through their more consistent usage of the product. We can estimate the variance of a person’s daily time spent during the month before the experiment and assign weights that are higher for people with less noisy behaviors.
Applying our approach of using variance-weighted estimators to a corpus of real A/B tests at Facebook, we find opportunity for substantial variance reduction with minimal impact on the bias of treatment effect estimates. Specifically, our results show an average variance reduction of 17 percent, while bias is bounded within 2 percent. In addition, we show that these estimators can achieve improved variance reduction when combined with other standard approaches, such as regression adjustment (also known as CUPED, a commonly used approach at Facebook), demonstrating that this method complements existing work. Our approach has been adopted in several experimental platforms within Facebook.
There are several ways in which one can estimate the variance for each unit, and this is still an active area of research. We studied unpooled estimators (using the pre-experiment user-level sampling variance), building a machine learning model to predict out-of-sample variance from features, and using Empirical-Bayes estimators to pool information across those using our platform.
Statistically, we prove that the amount of variance reduction one can achieve when weighting by variance is a function of the coefficient of variation of the variance of experimental users, or roughly, how variable people are in their variability. Details of this proof can be found in our paper.
We tested these approaches on Facebook data and experiments. Figure 1, below, shows how better estimates of in-experiment unit-level variance provide much larger variance reduction. Poorer models of user-level variance can actually increase variance of the estimator, so good estimation is important. To demonstrate that variance-weighted estimators are likely to be useful in practical settings, we collected 12 popular metrics used in A/B tests at Facebook (such as likes, comments, posts shared, and so on) to estimate the predictability of the variance for each metric and its coefficient of variation. The results, shown in Figure 2, indicate that the variance of most of the metrics is highly predictable (as measured using R^2). In addition, the coefficient of variation of the variances is large enough that they can be used effectively in a variance-weighted estimator.
We took a sample of 100 Facebook A/B tests that experimented for an increase in time spent, with the average sample size of each test at around 500,000 users. Before analyzing the results of each test, we assembled the daily time spent for each user in the month prior to the experiment and estimated the variance for each user. To see how accurate the estimated variance of each user was, we compared how well the pre-experiment variance correlated with the post-experiment variance. The results showed an R^2 of 0.696 and a Pearson correlation of 0.754, indicating that the pre-exposed variances, when calculated over an extended period of time, do show reasonable estimations of post-exposed variance.
Next, for each experiment, all users were ranked based on their estimated variance and applied stratification, as in section 4.1 of our paper. To do this, we divided users into quantiles based on pre-experiment estimated variance, and then we calculated the sample variance of the experiment based on various numbers of quantiles. Across all experiments, we found an average of 17 percent decrease in variance with less than 2 percent bias. We also found that our approach worked well with other popular variance reduction approaches, such as CUPED. Table 1, below, shows that we can achieve close to 50 percent variance reduction when both approaches are used together.
There are several opportunities to explore in future work. In particular, there may be significant gains in devising conditional variance models that estimate variance more accurately. Figure 1 showed in simulations how increased estimate qualities can improve variance reduction, suggesting very large gains possible for more precise estimation. Moreover, we would like to understand how variance-weighted estimators may improve the variance reduction observed from other approaches (such as machine learning–based methods), as well as analytically understand the interactions when using multiple variance reduction approaches at once.