Experimentation is a central part of data-driven product development, yet in practice the results from experiments may be too imprecise to be of much help in improving decision-making. One possible response is to reduce statistical noise by simply running larger experiments. However, this is not always desirable, or even feasible. This raises the question of how we can make better use of the data we have and get sharper, more precise experimental estimates without having to enroll more people in the test.
In a collaboration between Meta’s Core Data Science and Experimentation Platform teams, we developed a new methodology for making progress on this problem, which both has formal statistical guarantees and is scalable enough to implement in practice. The work, described in detail in our NeurIPS paper, allows for general machine learning (ML) techniques to be used in conjunction with experimental data to substantially increase the precision of experimental estimates, relative to other existing methods.
Our algorithm, MLRATE (machine learning regression-adjusted average treatment effects), involves two main steps. First, we train a model predicting the experimental outcome of interest, given a set of pre-experimental covariates. Second, we use these predictions as a control variable in a linear regression. The coefficient on the treatment effect estimator is our variance-reduced average treatment effect estimator.
In the first step, we use sample splitting, so that predicted outcomes for each observation are generated by a model trained on data not including that observation. This allows us to use a broad class of ML methods in the first step, and gives us the flexibility to choose whichever model does the best job at predicting outcomes. The ML method in question may even be asymptotically biased, and not even converge to the truth in large samples, without affecting the validity of our estimator.
In the second step, we treat the predictions from the first step as a control variable in a linear regression. This form of linear regression adjustment is relatively common in the analysis of experimental data (e.g., Lin [2013], Deng et al. [2013]). The contribution of our paper is to show how this methodology can be generalized to accommodate control variables, which are themselves the output of a potentially complex ML algorithm.
To quantify the variance reduction gains one might expect from MLRATE in practice, we implemented it in A/A tests for a set of 48 outcome metrics commonly monitored in Meta experiments. Using either gradient-boosted decision trees or elastic net regression for the ML prediction step, we find that MLRATE has, on average, over 70 percent lower variance than the simple difference-in-means estimator for these metrics, and about 19 percent lower variance than the common univariate procedure, which adjusts only for pre-experiment values of the outcome.
Alternatively, to achieve the same precision as MLRATE, the conventional difference-in-means estimator would require sample sizes over five times as large on average across metrics, and the univariate linear regression procedure would require sample sizes about 1.6 times as large. The figure above displays the metric-level distribution of confidence interval widths relative to the univariate adjustment case. There is substantial heterogeneity in performance across metrics: For some, ML regression adjustment delivers only modest gains relative to univariate adjustment; for others, it drastically shrinks confidence intervals. This is natural given the variety of metrics in the analysis: Some, especially binary or discrete outcomes, may benefit more from more sophisticated predictive modeling, whereas for others, simple linear models may perform well.
A couple of features of this methodology make it relatively straightforward to implement in practice. First, the formulas for calculating treatment effect estimators and confidence intervals are no more complex than they are in case of conventional linear regression adjustment. Second, most common off-the-shelf ML methods can be used for the prediction stage, as long as the covariates used are pre-experiment. Finally, MLRATE does not require an investment in ML modeling for each individual experiment to work well. Once predictive models have been trained for an outcome of interest, they can be used for many experiments, so the cost of the ML training does not scale with the number of experiments.
If you’re dealing with the problem of excessive noise in your experiments and you can construct good predictors of the outcome of interest, MLRATE may be a helpful new tool for variance reduction. Depending on the metric, it may even be the difference between experimentation being feasible or not. For more details, check out our NeurIPS paper.