January 11, 2022

Introducing the Researcher Platform: Empowering independent research analyzing large-scale data from Meta

By: Kiran Jagadeesh, Runchao Jiang, Da Li, Robert Pyke, Lauren Wagner

Facebook Open Research and Transparency (FORT) is dedicated to sharing privacy-protected data with independent researchers so they can study Meta’s impact on society. We are balancing the complexities of sharing billions of data points with academics, while maintaining the privacy of the people who use our platforms. We must also provide the computing resources and data infrastructure required to analyze large-scale data offerings. The Researcher Platform was developed to meet these needs: a scalable platform that grants access to sensitive information in a controlled environment.

Background

Today, FORT shares Meta platform data with researchers around the world in a privacy-protective environment. Our focus has been supporting social scientists who apply computational approaches to large-scale human behavioral data in order to study and explain social phenomena. Computational social science is an emerging field that has exploded in popularity in the past decade, in part because of newly available observational data, experimental data, and large-scale simulations collected via digital tools. However, research can be impeded when academics are unable to analyze large-scale data in a way that maintains user privacy through access controls and monitoring.

From Jiang, Li, Pyke 2021, “In the early years of CSS, researchers would use their personal computers to analyze collected data or conduct simulations. Nowadays, CSS research has become increasingly dependent on computing infrastructure availability, including data storage and computational resources...a lack of programming expertise and data infrastructure resources have become barriers to conducting interdisciplinary research in computational social science."

To help address this, many researchers rely on Jupyter Notebook, an open source tool, to write code and display inline results from a web-based interface that can be shared with other users. But to analyze large data sets, researchers need to access a cluster and submit jobs from the Notebook so that the cluster can process offloaded computation. Historically, this may have been difficult for many small research labs or universities due to resource and budget constraints.

In order to democratize the data infrastructure needed to conduct computational social science research, our team created the Researcher Platform, a cloud-based scalable architecture that is cost-efficient, flexible, and maintains data security. In the future, we hope to help establish public standards that allow institutions to create data-sharing infrastructures for their own policy and legal needs, making the architecture widely available.

Design Principles

We prioritized the following principles when developing the Researcher Platform:

  • Security first: Social science researchers often need access to sensitive user data. We take necessary measures to ensure the system meets commensurate security standards. This includes, but is not limited to, protecting the data and only granting access to vetted researchers, as well as meeting predetermined security, privacy, and governance compliance objectives dictated by the FTC.
  • Provide free data and compute: We support thousands of researchers collaborating on terabyte to petabyte-scale data sets. The Researcher Platform provides free compute to qualifying academics and makes it possible for them to analyze web-scale data sets for free.
  • Incorporate widely used products: The Researcher Platform is built using JupyterLab, which is already an industry standard in computational social science. Academics working with large-scale data sets have found it useful for combining code with computational output, explanatory text, and multimedia in one document. As a result, new users onboarded to the product shouldn’t have a steep learning curve since they are familiar with the UI and functionality. As we expand our product suite, plan to incorporate existing tools that researchers already use as much as possible.
  • Leverage cloud-agnostic design with standard components: We designed the architecture to be cloud-agnostic with standard components so that it can be adopted on other cloud providers. With zero or minimal modification, it can be be used with common public cloud providers (Amazon Web Services, Google Cloud Platform, etc.) or on-premise data infrastructure.

How It Works

Once a researcher applies and is approved to use the Researcher Platform, they gain access to a virtual environment where data can be analyzed, and in some instances joined, under defined guidelines and restrictions that keep the data secure. Researchers open a JupyterLab instance and can manipulate available data using Python, including custom Python libraries our team has built in partnership with independent academics to facilitate their work. These pre-installed libraries allow researchers to perform common statistics such as data processing, data analysis, machine learning, and data visualization with free compute in a private research environment.

Security Features

Access to the Researcher Platform is controlled through a Virtual Private Network and data access control policies. The environment permits the implementation of various security mechanisms that can be applied depending on the privacy and security requirements. For example, it is equipped with network security and researcher data security features, like preventing one researcher from seeing another’s work.

Auditability

The Researcher Platform addresses auditability in two ways: First, it enables auditing of activities conducted in the environment. Second, it allows researchers to verify select data that Meta has released. On the former, the Researcher Platform incorporates native logging and auditability support of the infrastructure components (e.g., S3 Access Logs and CloudTrail Events) to provide comprehensive oversight of user level actions. Furthermore, by giving a proxy service access to data and leveraging the logging frameworks of the supporting services, we can provide comprehensive audit capabilities to all user actions. In terms of researchers validating Meta’s work, Meta releases reports that make company activities more transparent. In select cases, they can be verified by researchers analyzing data in the Researcher Platform and cross referencing them to Meta’s transparency reports, such as the Widely Viewed Content Report.

Case Study

We recently launched an early access version of the Researcher API, which provides billions of historical and near-real time data points from U.S. and EU public Pages, Groups, Events, and Post-level Facebook App data. It equips researchers to study a range of societal issues, such as the spread of misinformation, public health (COVID-19, vaccinations), climate change, and elections, as well as other emerging topics of interest. The Researcher API contains user data, so it is only accessible via the Researcher Platform to protect people’s information and facilitate the type of computational analyses that researchers require. By offering this data through the Researcher Platform, we can provide the most robust view of Facebook App activity to researchers while upholding user privacy.

Interested in applying for access to the Researcher Platform? Sign up here.

Read the full working paper to learn about the Researcher Platform, “A Scalable Cloud-based Architecture to Deploy JupyterHub for Computational Social Science Research in Practice and Experience in Advanced Research Computing,” presented at the Practice & Experience in Advanced Research Computing (PEARC) Conference Series in July 2021.