Today, Meta launched its request for proposals (RFP) in networking for AI at NSDI 2022. With this RFP, which closes on May 23, 2022, at 5:00 p.m. AOE, the Network AI team at Meta aims to award proposals that focus on all aspects of networking for AI/machine learning (ML) applications, including improving the performance, efficiency, resiliency, and operations of ML/high-performance computing (HPC) clusters. More information about RFP timing, eligibility, and proposal requirements is available at the link below.
To learn more about this new RFP, we reached out to the Engineering Director supporting Meta’s Network AI team, Shashi Gandham. In this Q&A, Gandham discusses what networking for AI looks like at Meta, the types of research that Meta is interested in, the goal of and inspiration for this RFP, and the larger context of data center networking at Meta.
Q: What is your role at Meta, and what does your team do?
Shashi Gandham: ML and AI applications are some of the fastest-growing and demanding services that run on Meta’s infrastructure. I support the Network AI team, and we are dedicated to designing, building, and operating the network for these AI applications and are involved at all the layers of the clusters. We have people focusing on the networking fabric, and we also have people working on the host. On the host side, we look at the transport and NIC parts of the pipeline as well as work closely with the application stack, such as the communication libraries/collectives and PyTorch. Up and down the stack, we do a lot of analysis, looking for optimization opportunities and ways to improve operations. Finally, we also look ahead to future cluster designs, working with Meta’s broader hardware teams and the software stack.
Q: What sort of research in networking for AI is Meta interested in?
SG: There is a wide range of research domains that we’re interested in seeing, especially as we cover the entire stack of the HPC clusters as well as their whole life cycle. Here’s a quick overview:
- New network interconnect architectures. This could include any new data center topologies or interconnects to address scalability and very high bandwidth requirements that are introduced by AI workloads. These could range from hundreds of gigabits to dozens of terabits per accelerator in the network and could cover topology designs as well as low-power, high-bandwidth interconnect technologies themselves. A systems perspective would be useful, looking at codesign of hardware and software to ensure that the benefits of these interconnects are realizable end to end.
- Hardware computational offloading for AI workloads. This could include any offloading and accelerating AI compute and inference through programmable switches, smart NIC, and other novel hardware/software codesign techniques at the network layer.
- End-to-end novel transport designs for distributed AI training. These could include tackling transport layer challenges for computational fabrics using very high bandwidth and low-latency interconnects. The techniques could include but would not be limited to HPC fabric technologies such as OmniPath, InfiniBand, RoCE, and other optimization on top of these transport solutions (or comparable alternatives).
- Scheduling, resource allocation, communication collectives, and network joint optimization. These could include AI workload and network joint optimization for resource allocation and dynamic scheduling as well as looking at better integration opportunities between network control and collective layer.
- And more! If you have research proposals that improve the performance, efficiency, resiliency, and operations of ML/HPC clusters that don’t fall into the above, we’re interested in hearing from you.
Q: What’s the goal of this RFP and what inspired it?
SG: We wanted to kick off our engagement with academia and research institutions in this space of networking for AI. Most of our networking RFPs to date have largely focused on networking for general services and for Meta’s family of applications (e.g., Facebook, Instagram, WhatsApp). Because networking plays such a critical role in designing high-performance clusters for machine learning, we wanted to dedicate an RFP to this. We’re excited because there are so many different research areas to pursue.
Q: How does this RFP relate to the broader Networking and Communications research agenda at Meta?
SG: Meta’s networking teams have engaged in many collaborations with professors and research groups over the years. These are definitely mutually beneficial, yielding important insights into Meta’s network through close partnerships with professors and students. We’ve published research together on the core of our data center networks (trends, protocols, and triage), ways to make a more resilient backbone, understanding and improving performance from the edge for the billions of people using Meta’s applications, and much more, including transport and NIC areas.
The networking for AI RFP is situated in the midst of this broader networking work at Meta, and it is especially aligned with the data center network research and AI research. Collaborators in networking for AI will be involved with our dedicated Network AI team as well as other teams in data center networking and general AI infrastructure.
Q: Where can people stay updated and learn more?
SG: You can stay updated by subscribing to our RFP newsletter. Any updates will be reflected on the RFP page, and winners will be announced on our blog and the RFP page as well.