This paper presents Facebook’s risk-driven backbone management strategy to ensure high service performance throughout the COVID-19 pandemic.
This paper presents Facebook’s risk-driven backbone management strategy to ensure high service performance throughout the COVID-19 pandemic.
In this paper, we present Facebook’s BGP-based data center routing design and how it marries data center’s stringent requirements with BGP’s functionality. We present the design’s significant artifacts, including the BGP Autonomous System Number (ASN) allocation, route summarization, and our sophisticated BGP policy set.
In this paper we provide a perspective of the scale of Internet traffic growth and how well the Internet coped with the increased demand as seen from Facebook’s edge network.
In this paper, we leverage different components of the end-to-end networking infrastructure to prevent or mask any disruptions in face of releases. Zero Downtime Release is a collection of mechanisms used at Facebook to shield the end-users from any disruptions, preserve the cluster capacity and robustness of the infrastructure when updates are released globally.
In this paper, we execute a large-scale, 10-day measurement study of metrics at the TCP and HTTP layers for production user traffic at all of Facebook’s PoPs worldwide, collecting performance measurements for hundreds of trillions of sampled HTTP sessions.
We present FBOSS, our own data center switch software, that is designed with the basis on our switch-as-a-server and deploy-early-and-iterate principles.
This paper presents a system that uses a set of paths computed using Räcke’s oblivious routing algorithm, as well as a centralized controller to dynamically adapt sending rates. Although oblivious routing and centralized TE have been studied previously in isolation, their combination is novel and powerful.
In this paper we propose Janus, a system which makes two major contributions to network policy abstractions. First, we extend the prior policy graph abstraction model to represent complex QoS and dynamic stateful/temporal policies. Second, we convert the policy configuration problem into an optimization problem with the goal of maximizing the number of satisfied and configured policies, and minimizing the number of path changes under dynamic environments.
In this study, we explore the fine-grained behaviors of a large production data center using extremely highresolution measurements (10s to 100s of microsecond) of rack-level traffic. Our results show that characterizing network events like congestion and synchronized behavior in data centers does indeed require the use of such measurements.
In this article, we present our passive hybrid approach that combines network path information with end-host-based statistics to rapidly detect and pinpoint the location of datacenter network faults inside a production Facebook datacenter.