In February 2022, Meta launched the Silent Data Corruptions at Scale request for proposals (RFP). Today, we are announcing the winners of these awards.
Within this novel research domain, we identify a diversity of research opportunities, such as architectural solutions to data corruption; fleetwide testing strategies and distributed computing resiliency models; software and library resiliency; and silicon level design, simulation, and manufacturing approaches.
The RFP attracted 62 proposals from 54 universities and institutions around the world. Thank you to everyone who took the time to submit a proposal, and congratulations to the winners.
Principal investigators are listed first unless otherwise noted.
Hardware failures root causing: Harnessing microarchitectural modeling
Dimitris Gizopoulos (National and Kapodistrian University of Athens)
Lightweight in-production SDC detection tools inspired by coding theory
Rashmi Vinayak (Carnegie Mellon University)
Quarantine and vaccination framework for SDC mitigation at-scale
Devesh Tiwari (Northeastern University)
Software-hardware strategies for enhancing ML application resilience
Prashant Nair (University of British Columbia), Karthik Pattabiraman (University of British Columbia), Sathish Gopalakrishnan (University of British Columbia)
Testing for corrupt execution errors
Caroline Trippel (Stanford University)