The preferred approach in online experimentation is usually audience-level testing, where we divide people into a test group and a control group. The test and control groups receive different experiences, and then we compare outcomes across the two groups. The test group might, for example, see a differently designed page layout, or have access to a newly-introduced feature.
This kind of experiment design, while desirable, is often tricky to pull off online. Most websites use cookies, a small piece of data stored in the web browser, to indicate whether a given person is accessing the site. Cookies are then assigned to test and control groups for the experiment.
This can cause problems for understanding the impact of the experiment. If people access a website from different devices or browsers, they might have some cookies assigned to the test group and others assigned to the control group. For example, they may see the test experience on their phone, but the control experience on their laptop. Moreover, we have no way of mapping cookie-level outcomes back to people. Instead of comparing outcomes across treated and untreated people, we’re comparing outcomes across test and control cookies, and cookies in each group belong to people who may have cookies in the other group too!
How problematic is this in practice? In a recent paper we show that cookie-level tests underestimate the true audience-level effects by a factor of about three, and require two to three times the number of people to achieve the same statistical power. In order to attain the same level of statistical power with a cookie test as a person test, advertisers would need two to three times as many people in their test.
Our simulations are based on actual advertising tests run on the Facebook platform. Facebook observes the same user across multiple devices, so advertisers and businesses on the Facebook platform can run audience-level tests without resorting to cookies. We also observe cookie assignments with Facebook’s Atlas technology, allowing us to simulate the bias from cookie-level experiments.
The more cookies each user has, the more pronounced the problem becomes. The histogram of cookies per person for July 2015 is below:
Slightly over half of users have more than one cookie over this period, which means they could be randomized into different treatments in a cookie-level test. The single-cookie group only contributes about 15% of the total cookies.
So what’s the solution? If you can run true, audience-level tests, you should do so. You’ll avoid the attenuation bias that cookie-level tests introduce, and get more precise estimates. Conversion Lift is Facebook’s solution for running audience-level tests. Running audience-level tests is not always possible, and cookie-level tests are certainly better than nothing. Be aware, however, that failing to find a significant effect may be because of the experimental design, rather than the absence of any underlying effect.