February 10, 2021

Validating symptom responses from the COVID-19 Survey with COVID-19 outcomes

By: Meta Research

In collaboration with Carnegie Mellon University (CMU) and the University of Maryland (UMD), Facebook has been helping facilitate a large-scale and privacy-focused daily survey to monitor the spread and impact of the COVID-19 pandemic in the United States and around the world. The COVID-19 Survey is an ongoing operation, taken by about 45,000 people in the U.S. each day. Respondents provide information about COVID-related symptoms, vaccine acceptance, contacts and behaviors, risk factors, and demographics, allowing researchers to examine regional trends throughout the world. To date, the survey has collected more than 50 million responses worldwide.

In addition to visualizing this data on the Facebook Data for Good website, researchers can find publicly available aggregate data through the COVIDcast API and UMD API, and downloadable CSVs (USA, world). The analyses shown here are all based on publicly available data from CMU and other public data sources (e.g., the U.S. Census Bureau and the Institute for Health Metrics and Evaluation). Microdata is also available upon request to academic and nonprofit researchers under data license agreements.

Now that the survey has run for several months, the aggregated, publicly available data sets can be analyzed to determine important properties of the COVID-related symptom signals obtained through the survey. Here, we first investigate whether survey responses provide leading indicators of COVID-19 outbreaks. We find that survey signals related to symptoms can lead COVID-19-related deaths and even cases by many days, although the strength of the correlation can depend on population size and the height of the peak of the pandemic.

Following this observation, we analyzed under which conditions these leading indicators are detectable. We find that small-sample-size statistics and the presence of a small but significant “confuser” signal can contribute an offset in the signals that obscure actual changes in the COVID-19 Survey signals.

Survey responses can provide leading indicators for COVID-19 outbreaks

For the following analyses, we used publicly available aggregate data from the CMU downloadable CSV that has been smoothed and weighted. We focus on illness indicators (symptoms) that surveyed individuals reported having personally or knowing about in their local community between May 1, 2020, and January 4, 2021. To determine whether symptom signals from the survey act as leading indicators of new COVID-19 cases or deaths, we take data at the U.S. state level, lag symptom signals in time, and note the correlation with COVID-19 outcomes (e.g., new daily cases or new daily deaths).

In the figure below, we show how Community CLI (COVID-like illness in the local community) from the survey is a leading signal of new daily deaths in Texas, as tabulated by the Institute for Health Metrics and Evaluation (IHME). In the upper row, we compare the estimated percentage of survey respondents who know people in their local community with CLI symptoms (fever along with cough, shortness of breath, or difficulty breathing) with new daily deaths over time when lagging the symptom signal by 0, 12, or 24 days. In the lower row, we plot the time-lagged Community CLI against new daily deaths and determine the Pearson correlation coefficient (Pearson’s r: 0.57, 0.86, 0.98, respectively).

We can use this approach with the various illness indicators captured in the survey and multiple lag times to determine how “leading” the signal is to COVID-19 outcomes, as in the figure below. In the upper row, we show time series plots of symptom signals in the survey (% CLI, % Community CLI, % CLI + Anosmia, and % Anosmia), new daily cases, and new daily deaths in Texas and Arizona from May 2020 through December 2020. In the lower row, we plot the Pearson correlation coefficient of symptoms and new daily deaths when lagging the symptom signal between -10 and 40 days.

For each U.S. state, we can approximate how leading a symptom signal is by determining the optimal time lag, or the time lag that gives the highest Pearson’s r for that symptom. However, this method will not find an optimal time lag when a region 1) has poor outcome ascertainment (e.g., insufficient testing), 2) is less populated and has too few survey samples (see below), or 3) has data only for one side of an outcome peak (e.g., cases constantly falling or constantly rising), as the optimal lag is ambiguous. In the figure below, we show the optimal time lag (days, mean ± 95 percent c.i.) for four symptom signals (CLI, Community CLI, CLI + Anosmia, and Anosmia) in 39 U.S. states with large populations that experienced large COVID-19 outbreaks.

While all four symptom signals lead new deaths by many days, the symptom signal CLI appears to lead new deaths by more time than new cases (left, CLI: 21.3±3.0 days, new cases: 17.7±2.3 days). This is confirmed when running the same analysis for all four symptom signals using new daily cases as the outcome (right, CLI leads new cases by 8.2±4.0 days).

Detectability of COVID-19-related signals

In regions with relatively large outbreaks and reliable COVID prevalence data, the strength of symptom-outcome correlations depend on the height of the peak of the pandemic. The plot below shows that states with larger populations or that experienced a high pandemic peak (maximum number of COVID-19 cases per million people) show better correlation between CLI and new cases than smaller states or states that avoided a large outbreak.

We observed two major influences that reduced the survey signal quality for these states with poor correlations: 1) statistical noise arising from a limited number of survey responses, and 2) the presence of a confuser signal in the data.

The statistical noise in the COVID-19 Survey originates from the fact that surveys are conducted on small samples of a population (read more about our sampling and weighting methodology here). That is, if you were to ask a random person on a random day about their health, it is unlikely that they would be experiencing COVID-19 symptoms at that time. Further, if not enough people are sampled, the survey will be unlikely to identify even one person with COVID-19 symptoms. This means that survey signals for rare symptoms like CLI will have higher relative variance than Community CLI, since the probability of a person knowing another with COVID-like symptoms is typically higher than the probability of the respondent’s having symptoms.

Turning to the second point, our analysis revealed that even in the absence of a COVID-19 outbreak, there exists a persistent baseline in symptom signals like CLI and Community CLI. Irrespective of the origin of this confuser signal (one explanation being survey respondents who happen to have COVID-like symptoms but not COVID-19), it can obscure actual outbreaks even in situations with a large number of survey responses and low statistical noise.

Take the state of Washington, for example, from April 2020 to December 2020. In the left panel, % CLI (green) shows high relative variance and never falls below the confuser baseline of ~0.25 percent, obscuring the summer 2020 COVID-19 outbreak and rendering the fall outbreak barely visible. On the other hand, % Community CLI (orange) has lower relative variance, and both the summer and fall outbreaks are clearly visible. The right panel affirms this, showing that the Community CLI survey signal with approximately 7 days’ lag correlates very well with new deaths, while CLI does not.


In our preliminary exploration of the illness indicators available in the public COVID-19 Survey data sets, we find that symptom signals like COVID-like illness (CLI) and Community CLI correlate with COVID-19 outcomes, sometimes leading new COVID-19 cases and deaths by weeks. Additionally, we observe two main components of these signals that are not COVID-related and that can ultimately obscure the real effects of an outbreak in the data. More work will be needed to quantify this in more detail, and to expand this analysis globally.

Because the COVID-19 Surveys are run daily, worldwide, and are not subject to the types of reporting delays associated with COVID test results, for example, survey responses may represent the current pandemic situation better than official case counts. In agreement with past work showing that COVID-19 Survey signals can improve short-term forecasts, our analysis here demonstrates the potential of the survey to power COVID-19 hotspot detection algorithms or improve pandemic forecasting.

Facebook and our partners encourage researchers, public health officials, and the public to make use of the COVID-19 survey data (available through the COVIDcast API and UMD API) and other data sets (such as Facebook Data for Good’s population density maps and disease prevention maps) for new analyses and insights. Microdata from the surveys is also available upon request to academic and nonprofit researchers under data license agreements.