Today, Meta launched its request for proposals (RFP) in networking for AI at NSDI 2022. With this RFP, which closes on May 23, 2022, at 5:00 p.m. AOE, the Network AI team at Meta aims to award proposals that focus on all aspects of networking for AI/machine learning (ML) applications, including improving the performance, efficiency, resiliency, and operations of ML/high-performance computing (HPC) clusters. More information about RFP timing, eligibility, and proposal requirements is available at the link below.
To learn more about this new RFP, we reached out to the Engineering Director supporting Meta’s Network AI team, Shashi Gandham. In this Q&A, Gandham discusses what networking for AI looks like at Meta, the types of research that Meta is interested in, the goal of and inspiration for this RFP, and the larger context of data center networking at Meta.
Q: What is your role at Meta, and what does your team do?
Shashi Gandham: ML and AI applications are some of the fastest-growing and demanding services that run on Meta’s infrastructure. I support the Network AI team, and we are dedicated to designing, building, and operating the network for these AI applications and are involved at all the layers of the clusters. We have people focusing on the networking fabric, and we also have people working on the host. On the host side, we look at the transport and NIC parts of the pipeline as well as work closely with the application stack, such as the communication libraries/collectives and PyTorch. Up and down the stack, we do a lot of analysis, looking for optimization opportunities and ways to improve operations. Finally, we also look ahead to future cluster designs, working with Meta’s broader hardware teams and the software stack.
Q: What sort of research in networking for AI is Meta interested in?
SG: There is a wide range of research domains that we’re interested in seeing, especially as we cover the entire stack of the HPC clusters as well as their whole life cycle. Here’s a quick overview:
Q: What’s the goal of this RFP and what inspired it?
SG: We wanted to kick off our engagement with academia and research institutions in this space of networking for AI. Most of our networking RFPs to date have largely focused on networking for general services and for Meta’s family of applications (e.g., Facebook, Instagram, WhatsApp). Because networking plays such a critical role in designing high-performance clusters for machine learning, we wanted to dedicate an RFP to this. We’re excited because there are so many different research areas to pursue.
Q: How does this RFP relate to the broader Networking and Communications research agenda at Meta?
SG: Meta’s networking teams have engaged in many collaborations with professors and research groups over the years. These are definitely mutually beneficial, yielding important insights into Meta’s network through close partnerships with professors and students. We’ve published research together on the core of our data center networks (trends, protocols, and triage), ways to make a more resilient backbone, understanding and improving performance from the edge for the billions of people using Meta’s applications, and much more, including transport and NIC areas.
The networking for AI RFP is situated in the midst of this broader networking work at Meta, and it is especially aligned with the data center network research and AI research. Collaborators in networking for AI will be involved with our dedicated Network AI team as well as other teams in data center networking and general AI infrastructure.
Q: Where can people stay updated and learn more?
SG: You can stay updated by subscribing to our RFP newsletter. Any updates will be reflected on the RFP page, and winners will be announced on our blog and the RFP page as well.