To learn more about this new RFP, we reached out to Engineering Director Sriram Sankar. In this Q&A, Sankar discusses what research on silent data corruptions at scale looks like at Meta, the goal of this RFP, the inspiration behind it, as well as the context.
Q: What is your role at Meta, and what does your team do?
Sriram Sankar: Meta applications like Facebook, WhatsApp, and Instagram are hosted on a large-scale server infrastructure situated across global data centers. Collectively, the teams that I support are responsible for the hardware availability of all the servers running in Meta’s fleet. We have teams that work on hardware, firmware, data science, tooling, and large-scale labs for new hardware experimentation. We see some of the most challenging server issues at scale, from early-stage hardware to hard-to-diagnose issues in servers running in live production. People on the team come from different backgrounds and domains, with expertise across hardware design, data centers, distributed systems, software development, data science, operations research, and more.
Q: What does research on silent data corruptions at scale look like at Meta?
SS: We have a team of engineers that are super-detectives, and their role is to take some of the toughest problems from our large-scale infrastructure and solve them from the initial clues. Silent data corruptions research at Meta started out similarly. When we came across bitflips and data corruptions, initially it was dismissed as a rare, one-in-a-million occurrence. However, from our fleet statistics, we observed that silent data corruptions are a one-in-a-thousand occurrence, thus raising serious concern. Everyone in the industry believed this to be a rare occurrence, and we had to challenge existing beliefs and show how these errors were impacting application behavior. We looked at our full stack, all the way from applications down to assembly-level reproducers, and ran detection mechanisms across our entire fleet (equivalent to finding a needle in a haystack but doing so reliably while the needle moves around at every turn).
We worked across the industry with our partners and shared the results of over three years of work in our infrastructure in this paper and blog. I am extremely proud of the team to challenge existing notions on hardware computational accuracy, partner with industry, and also push the boundaries of research in this emerging domain.
Q: What’s the goal of this RFP?
SS: The goal of this RFP is to tackle the challenge that silent data corruption poses to both hardware and software domains, by joining forces with academia and industry. The importance of AI and customized computing in today’s applications means that hardware design is going to be critical for several companies. We want to stress the importance of accuracy in the early stages of silicon development and design architecture.
More critically, large-scale applications are experiencing silent data corruptions at a scale that is not well understood in the industry. We have a wealth of data and real-world experience to share that we hope will inspire and challenge curious researchers to delve deeper into this domain.
We would like to learn from academia about novel approaches to counter and mitigate the impact of silent data corruptions. We believe that computational accuracy from silicon to applications should be the cornerstone of large-scale computing, and industry and academia should come together to solve this challenge.
Q: What inspired this RFP?
SS: To detect silent data corruptions at our scale, we had to adopt several approaches across the large number of servers in our fleet. Even with such extensive reliability data, it was an uphill task for us to challenge common industry notions. In academia, it can sometimes be challenging to get to this data set and also identify these signals on what aspects of real-world problems are most critical. We began this RFP as a way for us to bridge this gap and enable collaboration with academia.
We also do not believe that silent data corruption is a hyperscale problem only. The rate of occurrence means that most companies might face this issue. However, typical approaches to fault tolerance and redundancy can be cost-prohibitive for several companies. We believe that innovative approaches in different domains in academia can be powerful to solve this emerging problem across the larger industry.
Q: How does this RFP fit into the bigger picture of silent data corruptions at scale research at Meta?
SS: As new hardware and software approaches evolve, we believe that silent data corruptions are going to become critical as a first-order computational accuracy problem. Hardware designs will consider this at initial stages and build solutions to counter them from silicon level. Software libraries and architectures will account for the probability of silent data corruption and provide different methods for services to adopt to tackle this issue.
Our end goal is to prevent silent data corruptions from impacting customers that use large-scale services. While we have invested a lot in building hardware, system, and software approaches to counter silent data corruptions within Meta, we believe that the industry can evolve to make this a common expectation across hardware and software alike. This RFP will enable academia to partner and to define new areas of research that will be impactful in this domain.
Q: Where can people stay updated and learn more?
SS: Visit the RFP page to stay updated on the progress of the RFP. To receive email notifications about our new research awards and proposal deadlines, subscribe to our email newsletter.