Crowdsourcing is a well-studied domain in which tasks are assigned to humans for a variety of applications, including training machine learning models, removing abusive content from social platforms, and measuring the prevalence of labels in a population. A key challenge is that humans have been found to be noisy decision makers; thus, the obtained labels may be incorrect. A common mitigation strategy is to collect labels from several labelers and aggregate them while taking into account the accuracy of the individual labelers.

Our work is focused on crowdsourcing for prevalence estimation, specifically of content that violates community standards of online platforms __such as Meta__. In our paper __Crowdsourcing with contextual uncertainty__, we propose Theodon, a Bayesian non-parametric model, developed and deployed at Meta, that learns the prevalence of label categories and the accuracy of labelers as functions of a given context. Our model leverages Gaussian Processes (GPs) as flexible priors to model the prevalence, sensitivity, and specificity functions.

Figure 1 illustrates this setup. We start with the entire population (content, accounts, or other entities on the platform) over which we want to measure the prevalence of violations. Due to the large volume, it is impossible to label the entire population, and due to the low prevalence of violations (often below 0.1%), sampling uniformly from the population would result in few labeled violations. Thus, we upsample likely violations, which are sent to one or more labelers. This upsampling is done using a classifier that predicts the likelihood of a violation for each entity in the population. We note that this is different from enforcement classifiers that remove violating content with high certainty.

The up-sampling classifier, which is trained using content features, provides a very natural context for learning both prevalence and labelers' performance. The labels together with the classifier score is passed to Theodon, which infers labelers' performance and the aggregated labels. For our experiments, we used data generated from integrity applications at Meta as well as public datasets. We showed that Theodon (1) obtains 1–4% improvement in AUC-PR predictions on items’ true labels compared to state-of-the-art baselines for public datasets, (2) is effective as a calibration method, and (3) provides detailed insights on labelers' performances.

Following a rich body of research on crowdsourcing models, we take a Bayesian probabilistic approach to define different latent variables and the generative process of the observed data (Figure 2). Similar to the Dawid and Skene model and many of its extensions, we define three main latent variables: prevalence, sensitivity, and specificity. The key modeling novelty here is that our proposed model, Theodon, assumes all three quantities are dependent on a classifier score *s*, and are captured by a prevalence function, a sensitivity function, and a specificity function, respectively. As in previous models, we also assume that each item has a true, but latent label.

Figure 3 illustrates an overview of how Theodon was deployed at scale in production at Meta. The content is non-uniformly sampled from the population and sent to a centralized labeling platform for human review. The human labels and the sampling weights are stored in the distributed file system (based on Hadoop), which are then used to fit Theodon models. The posterior inference is done using Stan on the FBLearner platform. The output estimates from Theodon are stored back to the distributed file system and then used for monitoring and analytical purposes, including (1) measuring and monitoring the labeler performances using the sensitivity and specificity functions, (2) measuring and monitoring the prevalence of the positive class by aggregating the prevalence function, and (3) calibrating the input classifier using the prevalence function.

For our experiments, we used data generated from integrity applications at Meta as well as public datasets. In this post, we’ll focus on the former, but detailed results on this data and on public datasets are available in __the full paper__.

At Meta, there are numerous applications of crowdsourcing, and here, we focus on measurement of prevalence for a range of integrity problems, which has a setup as shown in Figure 1. Since we want to study the performance of Theodon under different scenarios, we derive distributions from the logged data and use them to explore various operating parameters. Specifically, we extract the empirical distribution of the classifier scores and fit a Beta distribution to generate a score distribution. Similarly, we obtain the true prevalence function by fitting a third degree polynomial function on the empirical prevalence function. Finally, to simulate different types of labeling errors, we generate three global sensitivity and specificity functions: a linear function, representing a simple correlation between scores and accuracy, and two non-linear functions, one concave, indicating high levels of accuracy, and one convex, indicating lower levels of accuracy.

Figure 4 shows the results of our evaluation, with comparisons against several state-of-the-art baselines. Additional details about the generative process of each baseline can be found in the supplementary material of the full paper. Overall, Theodon and the LR-LR baseline outperforms the other methods for this dataset across all metrics, as they both capture the dependency between sensitivity and specificity on the scores. For the coverage rate, however, Theodon outperforms the LR-LR baseline significantly.

Crowdsourcing is a method commonly leveraged for a range of applications. The inherent uncertainty in human-generated labels introduces challenges in the form of noisy outcomes. This work leverages contextual information in a novel way, i.e., through a Bayesian non-parametric approach with Gaussian Processes, by capturing the dependency of prevalence and labeller accuracy on the given context.

We showed through extensive empirical studies based on real applications at Meta as well as publicly available data that our model is effective on a range of tasks. In addition to obtaining good per-item aggregated labels, our system calibrates the base classifier (context) under the presence of labeling errors and provides useful insights into how the labeler performances change with respect to the input classifier score.

__https://research.facebook.com/publications/crowdsourcing-with-contextual-uncertainty/__