Publications - Meta Research

October 26, 2021

Sangmin Lee, Zhenhua (Gerald) Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, Biren Damani, Pol Mauri Ruiz, Vikas Mehta, Chunqiang Tang

Paper

Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications

Sharding is widely used to scale an application. Despite a decade of effort to build generic sharding frameworks that can...

Areas

Systems & Infrastructure

Paper

November 4, 2020

Scott Pruett, Kevin Doherty, Jinyu Han, Dmitri Petrov, Jim Carrig, John Hugg, Nathan Bronson

Paper

FlightTracker: Consistency across Read-Optimized Online Stores at Facebook

This paper introduces FlightTracker, a family of APIs and systems which now manage consistency for online access to Facebook’s graph. FlightTracker implicitly provides RYW and can be explicitly used to provide alternative consistency guarantees for special use cases; it enables flexible communication patterns between caches, which we have found important as the number of datacenters increases; it extends the same consistency guarantees to cross-shard indexes and materialized views, allowing us to transparently optimize queries; and it provides a uniform primitive for clients to obtain desired consistency guarantees across a variety of data stores.

Areas

Databases, Systems & Infrastructure,

Paper

November 4, 2020

Mahesh Balakrishnan, Jason Flinn, Chen Shen, Mihir Dharamshi, Ahmed Jafri, Santosh Ghosh, Hazem Hassan, Aaryaman Sagar, Rhed Shi, Jingming Liu, Filip Gruszczynski, Xianan Zhang, Huy Hoang, Ahmed Yossef, Francois Richard, Yee Jiun Song

Paper

Virtual Consensus in Delos

We propose virtualizing consensus by virtualizing the shared log API, allowing services to change consensus protocols without downtime. Virtualization splits the logic of...

Areas

Systems & Infrastructure

Paper

November 4, 2020

Chunqiang (CQ) Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, Ben Christensen, Alex Gartrell, Maxim Khutornenko, Sachin Kulkarni, Marcin Pawlowski, Tuomas Pelkonen, Andre Rodrigues, Rounak Tibrewal, Vaishnavi Venkatesan, Peter Zhang

Paper

Twine: A Unified Cluster Management System for Shared Infrastructure

We present Twine, Facebook’s cluster management system which has been running in production for the past decade. Twine has helped convert our infrastructure from a collection of siloed pools of customized machines dedicated to individual workloads, into a large-scale shared infrastructure with fungible hardware.

Areas

Systems & Infrastructure

Paper

October 17, 2020

Sulav Malla, Qingyuan Deng, Zoh Ebrahimzadeh, Joe Gasperetti, Sajal Jain, Parimala Kondety, Thiara Ortiz, Debra Vieira

Paper

Coordinated Priority-aware Charging of Distributed Batteries in Oversubscribed Data Centers

The problem caused by simultaneous recharging of batteries in a data center has not been extensively studied and no real-world solutions have been proposed in the literature. In this paper, we identify the problem due to battery recharging with case studies from Facebook’s data centers. We describe the solutions we have developed to coordinate charging of batteries without exceeding the circuit breaker power limit.

Areas

Systems & Infrastructure

Paper

October 29, 2019

David Chou, Tianyin Xu, Kaushik Veeraraghavan, Andrew Newell, Sonia Margulis, Lin Xiao, Pol Mauri Ruiz, Justin Meza, Kiryong Ha, Shruti Padmanabha, Kevin Cole, Dmitri Perelman

Paper

Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge

We present Taiji, a new system for managing user traffic for large-scale Internet services that accomplishes two goals: 1) balancing the utilization of data centers and 2) minimizing network latency of user requests.

Areas

Systems & Infrastructure

Paper

October 31, 2018

Justin Meza, Tianyin Xu, Kaushik Veeraraghavan, Onur Mutlu

Paper

A Large Scale Study of Data Center Network Reliability

This paper fills the gap by presenting a large-scale, longitudinal study of data center network reliability based on operational data collected from the production network infrastructure at Facebook, one of the largest web service providers in the world.

Areas

Systems & Infrastructure

Paper

October 9, 2018

Kaushik Veeraraghavan, Justin Meza, Scott Michelson, Sankaralingam Panneerselvam, Alex Gyori, David Chou, Sonia Margulis, Daniel Obenshain, Ashish Shah, Yee Jiun Song, Tianyin Xu

Paper

Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently

We present Maelstrom, a new system for mitigating and recovering from datacenter-level disasters. Maelstrom provides a traffic management framework with modular, reusable primitives that can be composed to safely and efficiently drain the traffic of interdependent services from one or more failing datacenters to the healthy ones.

Areas

Systems & Infrastructure

Paper

March 28, 2018

Chang-Hong Hsu, Qingyuan Deng, Jason Mars, Lingjia Tang

Paper

SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-scale Datacenters

With the ever growing popularity of cloud computing and web services, Internet companies are in need of increased computing capacity to serve the demand. However, power has become a major limiting factor prohibiting the growth in industry: it is often the case that no more servers can be added to datacenters without surpassing the capacity of the existing power infrastructure.

Areas

Systems & Infrastructure

Paper

October 28, 2017

Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song

Paper

Canopy: An End-to-End Performance Tracing and Analysis System

This paper presents Canopy, Facebook’s end-to-end performance tracing infrastructure. Using Canopy, Facebook engineers can query and analyze performance data in real-time.

Areas

Systems & Infrastructure

Paper

Research

Research from Meta

All Publications