Ground truth data is the foundation upon which we build models, generate inferences, and make decisions. What is ground truth data? We define it as a dataset that contains the values we want to infer for a particular population of interest (the data could be human labels, survey data, behavioral data, etc.). Whether it is modeling user characteristics to ensure appropriate and personalized user experiences, detecting and removing harmful misinformation and hate speech, or executing other data-driven tasks, the underlying machine learning processes rely on models trained and validated on some ground truth data.
The quality and trustworthiness of these models depends on the quality of the ground truth used to develop and validate them. For instance, model performance cannot be properly assessed without a ground truth evaluation set that is accurate and representative of the population of interest. Model performance could be degraded due to noisy ground truth data (Gupta et al., 2019). Therefore, it is crucial to understand how good these ground truth datasets are and ensure that they are of sufficiently high-quality for reliable models and inferences.
We developed the ground truth maturity framework (GTMF) as a unified and comprehensive framework to evaluate the quality of ground truth data at Meta.
This blog post will benefit those who rely on some ground truth data for modeling and decision-making and are interested in understanding and improving their ground truth quality and thus better assessing and improving their models and decisions.
Before jumping into a more detailed introduction to the GTMF, let’s define some ground truth–related terminology.
GTMF consists of a step-by-step guide to assess ground truth maturity and methodologies to improve ground truth maturity. By maturity, we mean how well you understand, measure, and minimize errors in your ground truth data.
GTMF operates as a seven-step framework to address the following questions:
In each step, GTMF provides guidelines to assign one of four maturity levels that indicates to what extent the potential errors in a ground truth dataset are understood, measured, and minimized.
The GTMF assessment leads to a “report card” that provides us with a comprehensive overview of how mature our ground truth dataset is across the GTMF seven steps presented above. Based on this report card, we can decide where to prioritize and set goals on our ground truth maturity. Repeated assessments can then help track and communicate progress towards these goals.
After the assessment, GTMF provides guidance to mature our ground truth, which we introduce in more detail in the next section.Each step in GTMF has a suite of methodologies and approaches for assessment and improvement, which we do not cover in this post. For each step, we also explain how it could be used to evaluate a ground truth dataset for training a teen/adult classifier on Instagram.
Here are some prerequisites to take care of before starting the GTMF assessment process. These prerequisites are the foundations for many other steps. You need clarity about what you are intending to measure with your ground truth data and what you're intending to predict with your machine learning model.
Completing these prerequisites first is also an opportunity for team members involved in the evaluation to align on these core definitions and to resolve any differences in understanding before undertaking the rest of the evaluation.
This step aims to identify whether there is anything obviously wrong with the dataset or its collection process. The types of error considered in this step include errors from nonresponse, mismeasurement, and processing, but exclude errors in how the data were sampled. To think about sources of item error, we recommend that users think about what an ideal dataset measuring their ground truth variable looks like, and how their preferred method of gathering ground truth data may deviate from that. This is an opportunity to take stock of what your goals are while understanding potential shortcomings that may not be fully captured in the later steps.
Example: Age heaping happens when people are asked their age and overreport ages that end in 0 or 5 [Pardeshi, 2010; Mark Lyons-Amos and Tara Stones, 2017]. If you collect age labels through a survey that asks people how old they are, then there could be age heaping in your labels. A better approach is to ask them their date of birth, though people aren’t always accurate (or honest), and we’ve seen in practice that misrepresenting age is a common problem across the industry. [Erica et al., 2022]
This step aims to assess the representativeness of the ground truth data to the target population. This ensures that any inferences derived from the ground truth data can be generalized to the target population and your model can be evaluated and calibrated on the ground truth that represents the target population. To assess representativeness, we need to specify the target populations we wish to make inferences about and assess the degree to which our data diverge from the target population along key characteristics that are correlated with the content or coverage of those labels.
Example: Suppose we want to know the teen prevalence across IG users. Using age labels collected in the United States may not give us the proper estimation across the world, since teen/adult distributions could differ across countries. Using birthday posts as a source of labels could lead to bias to the estimator if younger users are more likely to write “happy birthday” posts on our services.
Here we want to consider how far our ground truth data deviate from the true values for our target population. Stated differently, are we actually measuring what we think we are measuring with our ground truth? Or are we measuring something only related to it? Refer to your definition in the Prerequisite step to remind yourself what you are intending to measure.
In assessing ground truth accuracy, whether we have a golden set (see“Key terminology,” above) makes a big difference in how we might proceed. In general, evaluation of ground truth accuracy against a golden set is straightforward, with standard methodologies and metrics such as F scores for categorical labels and RMSE for continuous labels. Without a golden set, you’ll likely look at the construct validity or similar of ground truth labels.
Example: Suppose we want to assess the accuracy of our survey-based age ground truth. If we had access to government IDs that were uploaded by users [see this 2022 blog post], we could merge that data with our ground truth and check RMSE of our age labels against it. Without a golden set, we could check our survey age measure against age measured another way, such as birthday posts (i.e., convergent validity).
In this step, we ask you to consider how reliable your labels are, or how much your labels vary if you measure them repeatedly in the same way. Different from accuracy (step 3), which measures the degree of closeness of a measurement to the true value, reliability is a measure of the similarity between repeated measurements. For cases where this is not applicable, or where it is difficult to obtain repeated measures of labels such as with conversion data, you can consider ways to measure label variance across similar users or skip this step while keeping in mind the risk. In this case, however, it is valuable to scope future improvements to your ground truth data–gathering process that let you measure label reliability through repeated measurements.
Example: Suppose we were having labelers review accounts for birthday images and labeling users’ ages. If we asked two labelers to review the same account, how often would the labelers agree or disagree? If instead we were using panel survey data, suppose we sampled the same users 12 months apart; what fraction of users report an age that is exactly one year older in the second wave of the survey?
This step considers whether the ground truth data is large enough to generate sufficiently precise estimates of the metrics that rely on aggregating ground truth data (e.g., survey mean estimates, F1 scores computed over your ground truth dataset) and assess whether their confidence intervals are narrow enough to facilitate decision-making.
Example: Assume we have N=5,000 binary teen/adult labels and want to infer the teen prevalence and the uncertainty around it. We construct our estimate by taking the proportion of teens in all 5,000 labels (i.e., sample mean of labels using teen as the positive label). We then decide to run B=2,000 nonparametric bootstrap iterations. On each iteration, we sample 5,000 data with replacement from our labeled dataset and calculate a mean estimate from that bootstrap sample. The variation of our B=5,000 bootstrap means gives us an estimate of the sampling-based uncertainty and can inform whether we have collected enough data to accurately calibrate or evaluate an associated model.
This step considers whether the ground truth data is up-to-date enough to reflect the current reality, whether the label collection is at an appropriate cadence given resource constraints, and whether the label collection cadence captures expected changes in the target population over time.
Example: Some ground truth data are measurements of things that are expected to remain constant over time (e.g., date of birth), while some ground truth data are measurements of things that naturally vary over time (e.g., city of residence or subjective measures, such as interests). The more ground truth is expected to vary over time, the more frequently these data should be collected and refreshed. It is important to understand the gap between how much your ground truth tends to vary over time and how much it actually varies over time in your target population to set the optimal cadence of ground truth data collection.
This step asks you to consider whether the ground truth data is collected in a sustainable and efficient manner, given potential time and financial constraints. By evaluating the value generated from the ground truth data under a certain cost, we measure the resource demand at different aggregation levels, such as individual label level, usable label level, and use case level. Based on the evaluation results, we identify the opportunities to improve the efficiency of the ground truth collection strategy.
Example: To identify users who misrepresent their age, we have content reviewers that are trained to flag reported accounts that appear to be used by people who are underage. [Pravni et al., 2021]. Some reviewer decisions could be nonconvertible (e.g., reviewers are unable to make an age determination), which leads to unusable labels. The conversion rate should be considered for cost efficiency measurement.
GTMF has been applied and tested across various applications at Meta. Across applications, GTMF provided the teams with comprehensive understanding of their ground truth maturity across many dimensions, pointed to directions for further improvement, and standardized methodologies that drove more mature ground truth and thus higher confidence in models and inferences. Beyond Meta, GTMF has the potential to provide a generalizable and production-tested ground truth toolset that could benefit anyone who relies on ground truth data for machine learning or decision-making.
—
Special thanks to CDS GRAph Science and Statistics, Demography and Survey Science, and People Data Ground Truth.