Core Data Science (CDS) is a central science organization that drives impact for Meta and the world through use-inspired advancements to our fundamental understanding of the internet, society, and technology, and development of novel methodology and services. We work as a central organization, partnering with teams throughout the company. The work we do enhances the Meta family of apps that enable more than 2.8 billion daily people to communicate with each other.
Core Data Science is interdisciplinary, with expertise in computer science, statistics, machine learning, economics, political science, operations research, and sociology, among many other fields. This diversity of perspectives enriches our research and expands the scope and scale of projects we can address. We deliver value through collaborative projects with other groups at Meta and with the academic community. In addition, we build and open-source technical products aligned with our areas of expertise.
Engaging with the academic community is of key importance to our research group. We publish findings, host PhD students through our internship program, collaborate with professors and PhD students, and highlight open areas of interest through our requests for research proposals.
Learn more about Core Data Science on our webpage.
Our unique blend of deep scientific expertise, engineering, data science, and research skills enables us to solve some of the hardest and most important problems Meta faces in some of the following areas. We have focused on these areas in 2021 and will continue to do so in 2022.
- Experimentation: For any model or product to be effective, one needs to learn key parameters and optimize over a decision space, trading off various objectives. This direction investigates advanced experimentation methods that can do this effectively, with applications to product development, machine learning systems, and value function tuning, among others.
- Economic modeling, computation, and optimization: Reality might be complex to capture, and modeling decisions are crucial to capture reality and provide informative insights, particularly when there are incentive and dimensionality issues. We contribute methods for ads and commerce marketplaces, propose solutions for large infrastructure problems under capacity or other side constraints
- Statistics and artificial intelligence: Learning models from data and understanding causality enable data-driven decision making. We work on a variety of problems in this space, ranging from building tools that can provide advice on model fit, sample weighing for surveys, temporal embeddings, and crowdsourcing decisions when labels are uncertain.
- Privacy: Data needs to be carefully handled to ensure privacy. We contribute to the development of privacy methods, including risk assessment, differential privacy, and synthetic data generation, with applications in experimentation and analytics.
- Computational social science: Understanding how people interact with each other and with the world through our platforms allows us to build features and tools to improve the user experience, with a mind toward well-being, equity, and societal impact.
- Network science: Because our products enable connection and interaction, understanding and building from a network perspective provides better solutions to supporting everything from experimentation to conversation to community.
Core Data Science supports experimentation in many ways including advancing statistical methods for analyzing experiments, improving methods and building open source tools for sequential experimentation and optimization, and developing alternative experimental designs. This research enables better decision-making and product development.
Machine learning for variance reduction in online experiments
Experimentation is a central part of data-driven product development, yet in practice, the results from experiments may be too imprecise to be of much help in improving decision-making. This raises the question of how we can make better use of the data we have and get sharper, more precise experimental estimates without having to enroll more people in the test. We developed a new methodology for making progress on this problem, which both has formal statistical guarantees and is scalable enough to implement in practice. The work allows for general machine learning (ML) techniques to be used in conjunction with experimental data to substantially increase the precision of experimental estimates, relative to other existing methods.
Network experimentation at scale
Network interference can make experimentation more difficult. Colloquially, interference means that an experimental unit’s response to an intervention depends not just on its own treatment, but also on other units’ treatments. We proposed a network experimentation framework, which accounts for interference between experimental units through cluster randomization.
High-dimensional Bayesian optimization with sparse axis-aligned subspaces
Bayesian optimization is a popular approach to black-box optimization, with machine learning hyperparameter tuning being a popular application. Sparse axis-aligned subspace Bayesian optimization is a new sample-efficient method for expensive-to-evaluate black-box optimization.
Economic modeling, computation, and optimization
2021 Operations Research Workshop
The Economics, Algorithms, and Optimization team within CDS conducted the team's first Operations Research workshop. The workshop aimed to highlight the team's work in the areas of algorithms and optimization to a group of external researchers and exchange ideas about the latest academic work in the areas of applied causal inference, optimization, stochastic modeling, and mechanism design. The discussion groups focused on the themes of Optimization under Uncertainty, Algorithmic Game Theory, and Combinatorial Optimization.
The parity ray regularizer for pacing in auction markets
Budget management systems are one of the key components of modern auction markets. Internet advertising platforms typically offer advertisers the possibility to pace the rate at which their budget is depleted, through budget-pacing mechanisms. We focus on multiplicative pacing mechanisms in an online setting in which a bidder is repeatedly confronted with a series of advertising opportunities.
Speaking engagement at INFORMS Annual Meeting 2021
Algorithmic Code Optimization
Algorithmic Code Optimization is the process of improving how binaries are compiled through algorithmic techniques. These improvements are important to ensure we optimize computational efficiency and other attributes that we may care about (e.g., binary size). Our recent work involves how we can post-process profiling data to improve profile-guided optimization and a theoretical study of the Ext-TSP problem, which is used for code block layout optimization.
QUEST: Queue simulation for content moderation at scale
Simulations are an important tool within queueing theory to analyze and improve the performance of queueing networks. In this paper, we describe how we can think of Meta's human content moderation system as a queueing network, and some operational problems we solve through a simulation platform that our team has built.
Multi-armed bandits with cost subsidy
Many businesses, including Meta, send SMS messages to their users for a variety of reasons including phone number verification, two-factor authentication, and notifications. In order to deliver these SMS messages to their users, companies generally leverage aggregators (e.g., Twilio) that have deals with operators throughout the world. Choosing the best aggregator in real time is a difficult problem since it involves balancing costs with the quality of an aggregator in a non-stationary environment. Our team has developed a multi-armed bandit based solution that helps solve these challenges.
Our Economics Modeling team works on a broad array of questions all about understanding the behavior of users and advertisers on our complex multi-sided platform. We build models that inform both experimental and non-experimental settings. When we have experimental evidence, economic models are key for interpreting results into learnings about behavior. For questions difficult to answer with experiments, we are even more reliant on economic modeling to guide us. Some of our work is highly empirical, like the estimation of observational models. Other models tend toward more theoretical, like the modeling of how to transform an experimental estimate into a key elasticity.
Economic Opportunity for Digital Platforms request for proposals
The goal of this RFP was to enable research into how (and whether) the digital economy and online platforms create opportunity and encourage social mobility, as well as identify and address inequalities in opportunity. Another area of focus was how tools and programs could help disadvantaged groups (including those who lack digital skills) transition into careers that require more digital skills or create and grow their businesses.
GRAph Science and Statistics
We leverage structure and behavioral modeling in order to enhance various downstream applications in integrity as well as commerce. One of the team’s active projects has been temporal interaction embeddings, which focus on incorporating sequence modeling in various applications. More recently, we’ve expanded on graph modeling and its applications in commerce, user-interest learning, self-supervised sequence modeling, and interpretability.
We work on state of the art methodologies to improve best practices for various data problems, which are both commonly encountered across the company and critical to a trustworthy use of data. For example, we have developed the Ground Truth Maturity Framework, a step-by-step guide to help teams understand, measure, and minimize errors in their ground truth data sets and ensure that it is helping us make better decisions.
How Do We Know Someone Is Old Enough to Use Our Apps?
Creating an experience on Instagram that’s safe and private for young people, but also fun, comes with competing challenges. We want them to easily make new friends and keep up with their interests, but we don’t want them to deal with unwanted DMs or comments from strangers. We’ve developed AI technology that allows us to estimate people’s ages, like if someone is below or above 18.
Additionally, we work on delivering a more transparent, understandable, and controllable recommendation experience across the Meta family of apps. In particular, we are developing graph matching and representation learning algorithms to develop and optimize the taxonomy of canonical descriptions of content and user interests.
Popularity prediction for content moderation
Popularity prediction is important for many applications, including detection of harmful viral content to enable timely content moderation. We developed a highly scalable system for popularity prediction based on a self-excited Hawkes point process. This system leverages both dynamic and static features of content without incurring a runtime complexity proportional to popularity like leading alternatives. We provide predictions for any future time horizon which enables a wide range of applications in the content moderation space.
We work on designing scalable tools for integrity teams to understand and mitigate the effects of engagement with problematic content through our recommender systems before they are deployed in production.
In a paper published in KDD 2021, we proposed a theoretical framework for when preference amplification occurs in recommender systems, and suggest various mitigation strategies to reduce problematic outcomes from preference amplification.
Statistics for Improving Insights, Models, and Decisions request for proposals
This RFP is a continuation of the 2019 and 2020 RFPs in applied statistics. Through this series of RFPs, the Facebook Core Data Science team, Infrastructure Data Science team, and Statistics and Privacy team aim to foster further innovation and deepen their collaboration with academia in applied statistics.
Core Data Science performs research in data privacy, aiming to enable privacy-safe access to Meta data through both risk assessment and reduction. To do so, we develop and apply techniques from differential privacy, statistical disclosure limitation, and synthetic data generation. Recent highlights from this research include:
Computational social science and Network Science
Data for Good
In 2021, CDS has continued to release several data sets externally with both humanitarian organizations and academic researchers. These data sets are part of our Data for Good program where we aim to empower those responding to disease outbreaks, natural disasters, and more. Our 2021 releases include the following:
- Relative Wealth Index: A data set on micro-estimates of wealth for 100+ low- and middle-income countries.
- Business Activity Trends: A data set that measures how businesses are affected by crisis events through their rate of posting on Facebook. In particular, we have released how COVID-19 has affected business activity throughout the world.
- Bias Correction for Maps for Good: We have developed a new methodology to reduce the biases in our displacement data set and make the trends more representative of the population on the ground.
- Commuting Zones: A data set of geographic areas where people live and work which can be used to understand local economies across the world.
We have co-organized the 2nd KDD Humanitarian Mapping workshop, which focused on scientific and community-based solutions to address pressing humanitarian challenges, such as climate change-induced threats, the COVID-19 coronavirus pandemic, natural disasters, economic inequalities, racial-and-gender violence, and human conflicts. Additionally, we have supported research projects from Oxford University, University of California Berkeley, and Direct Relief, which have been using our Data for Good datasets for cutting-edge research on disaster effects and crisis response.
Starting in February 2022, we have provided key insights to partners supporting Ukrainian refugees through our Data for Good program. In particular, we have shared aggregated insights and datasets with trusted partners, including aggregated estimates of displaced populations across Europe, information on social connections between Ukraine and the rest of Europe, as well as real time mobility data for countries bordering Ukraine. Organizations such as UNHCR, International Organization for Migration, the World Bank, UNICEF, Médecins Sans Frontières and Crisis Ready are actively using the data to better support communities in need.
Climate change and the COVID-19 pandemic
There are concerns that climate change attention is waning as competing global threats intensify. This fluctuating pattern suggests new climate communication strategies — focused on “systemic sustainability” — are necessary in an age of competing global crises.
What Does Perception Bias on Social Networks Tell Us about Friend Count Satisfaction?
Social network platforms have enabled large-scale measurement of user-to-user networks such as friendships. Less studied is user sentiment about their networks, such as a user’s satisfaction with their number of friends. We surveyed over 85,000 Facebook users about how satisfied they were with their number of friends on Facebook, connecting these responses to their on-platform activity and social network signals. This work is forthcoming in The Web Conference 2022 and was joint work with former intern Shen Yan (University of Southern California).
Understanding Conflicts in Online Conversations
With the rise of social media, users from across the world are able to connect and converse with each other online. While these connections have facilitated a growth in knowledge, online discussions may also end in conflict. Previous computational studies have focused on creating online conflict detection models from inferred labels and do not examine the conflict’s emergence. Instead, we aim to interpret and understand how online conflicts arise in online personal conversations from ground truth labels. We make use of an existing Facebook tool which allows group members to report conflict comments to collect conflict discussions and paired non-conflict discussions from the same post. This work is forthcoming for The Web Conference 2022 and was joint work with former intern Sharon Levy (University of California, Santa Barbara).
Interested in learning more about some of the publications and research? Check out the publications section on the CDS team page and the list of references to all CDS articles here.