October 22, 2020

From academia to industry: How Facebook Engineer Jason Flinn started his journey in Core Systems

By: Meta Research

Partnering with university faculty helps us drive impactful, innovative solutions to real-world technology challenges. From collaborations to funding research through requests for proposals, working with academia is important to our mission of giving people the power to build community and bringing the world closer together.

Many members of our Facebook research community come from long and accomplished careers in academia. One example is Jason Flinn, a former professor at the University of Michigan. After an extensive academic career in software systems, which recently earned him the prestigious Test of Time award, Flinn became a Software Engineer on Facebook’s Core Systems, a team that performs forward-looking research in the area of distributed systems and applies key systems architecture techniques at Facebook’s scale.

Flinn’s first industry collaboration with Facebook was with one of his PhD students, Mike Chow, who was a PhD intern at the time. This experience gave Flinn a preview of what it would be like to work in industry as a researcher. “I do my best work when I build systems that have real-world use,” he explains. “In my early career in mobile computing, I was the person using the system, and I learned the right research questions to ask from examining my own experiences. Today, with Core Systems, I have thousands of engineers using the systems that I am building, and I am learning the right research questions to ask from deploying these systems at scale.”

We sat down with Flinn to learn more about how he came to work at Facebook after a career in academia, the differences between industry and academia for someone in Core Systems, his current research projects, advice for those looking to follow a similar path, and more.

Q: Tell us about your experience in academia before joining Facebook.

Jason Flinn: Prior to joining Facebook, I was a professor at the University of Michigan for over two decades. My research interests over the years have been really varied. I’ve always enjoyed the opportunity afforded by computer science research to explore new topics and branch out into related subfields. My PhD dissertation looked at software power management and developed ways to extend the battery lifetime of mobile computers. I’ve returned to mobile computing throughout my career, developing distributed storage systems and investigating ways to improve wireless networking through better power management and strategic use of multiple networks. I also was fortunate to get involved in some of the earliest work in edge computing and vehicle-to-infrastructure networking. For another large part of my career, I studied topics in storage systems and distributed computing, including distributed file systems and software applications of speculative execution and deterministic replay.

Since joining Facebook, I have run into so many former students who now work for the company and have taken these classes with me. This has been a great reminder that one of the primary contributions to academia is our impact on the students we teach.

Q: What has your journey with Facebook been like so far?

JF: When I was at the University of Michigan, I participated in a couple of joint research projects with Facebook engineers. In both cases, the collaborations were kicked off by discussions with my former PhD students who had joined Facebook as full-time engineers. One of my then-PhD students, Mike Chow, joined Facebook for an extended nine-month internship, and we jointly developed a tool with Dan Peek (another former student), David Meisner, and Thomas Wenisch called the Mystery Machine. The key insight in this paper was that we could apply data at massive scale to learn the relationships and dependencies between execution points in software systems without needing to annotate or fully instrument such systems by hand.

This paper received a lot of visibility when it was published at OSDI, and has proved to be quite influential in showing the community the potential of applying machine learning and data at scale to software tracing and debugging. This collaboration was so successful that Mike did a subsequent internship with another of my PhD students, Kaushik Veeraraghavan, resulting in the DQBarge paper at OSDI 2016.

In 2018, I was eligible for a sabbatical and looking for a change of pace. I wound up talking to Mahesh Balakrishnan about the Delos project he had recently started at Facebook around the idea of virtualizing consensus through the use of a reconfigurable, distributed shared log. Delos offered me the chance to dive right in and design new, cutting-edge protocols, so I quickly jumped into this project. We were originally only a small team of four people, but within my first few months on the project, we were deploying our code at the heart of the Facebook control plane. After about nine months, I decided to join Facebook as a full-time employee.

Q: What are you currently working on?

JF: I’ve worked on two major projects at Facebook. The first is the Delos project mentioned above. Our team built a strongly consistent metadata store for Facebook control plane services like the container scheduler and resource allocation systems. Such systems are notoriously complex and fraught with peril to develop and deploy, often because they are a low-level building block on which all higher levels of the software stack depend.



One of the most fun parts of this project for me was when we deployed this new protocol in production for the first time. We executed a single command and the Delos virtualized architecture swapped an entire data center to the new protocol with zero downtime and no fuss. I don’t think anything like this had ever been done before, so it felt like quite an achievement to see it happen. The team has leveraged virtualized consensus in lots of different ways since then: for example, in deploying a point-in-time database restore capability, swapping protocols for Delos’s own internal metadata storage, and swapping between disaggregated and aggregated logs to help mitigate production issues.

My second project is called Hedwig. This project is unifying the delivery of large, hot data content to very large numbers of consumer machines distributed around the world. In academic research and in industry, there has been a lot of prior work on highly decentralized systems for delivering such content (BitTorrent is one example of a system in this space). Yet, with the deployment of public and private clouds, there is an opportunity to reexamine this space and look at how we can optimize such systems for a managed data center environment in which the system has greater visibility into network topology and resource availability, and in which we also have the opportunity to leverage highly-reliable, well-maintained centralized services.

Hedwig aims to achieve the best of both worlds by providing a simple, decentralized peer-to-peer data plane in combination with a well-managed, centralized control plane. Hedwig is also designed to be highly customizable through flexible policy modules in its centralized control plane. These policies let Hedwig employ different caching, routing, and failure handling policies for different use cases. In turn, this lets us easily optimize Hedwig for different workloads and services.

Q: What’s the difference between working in systems at Facebook versus in academia?

JF: I have always admired the industry papers that appeared in early SOSP conferences that described experiences building production software systems (Unix, AFS, etc.). What makes these papers great is that they not only contain big ideas, but they also combine these ideas with practical lessons and observations that come from deploying and using the systems. Reading the papers, I can feel how the deployment of the systems really helped the authors understand what was most important and innovative about their work (for example, the simplicity of the Unix interface, or the concept of scalability in the AFS paper that was decades ahead of its time).

Working in Core Systems gives me the opportunity to replicate some of the ingredients that helped make these papers so great. In academia, my focus was on writing papers and working with my students. My students and I built systems to validate our ideas, and together we might write several papers about a particular system as we were developing the ideas. At Facebook Core Systems, my focus has been first on building the systems, deploying them at scale, and learning from them. I can let the systems bake and evolve over time before writing a paper that describes what we did. This process leads to fewer papers, but I hope it also leads to stronger papers like the early industry papers I admire.

We followed this path with our Delos paper that’s appearing at OSDI this year, and I hope to take a similar approach to describing my current work on Hedwig.

Q: You recently earned a Test of Time award for your work in adaptable battery use in mobile apps. What influenced this research?

JF: It’s often said that asking the right questions is the hardest part of research, and I think this is especially true in this situation. It was all about being in the right place at the right time.

I was really fortunate to attend grad school at Carnegie Mellon when they had just deployed the first campus-wide wireless network. This gave me the opportunity to take my laptop outside and work with an actual internet connection. (Although hard to imagine today, this was incredibly novel at the time.) Almost the first thing I noticed was that my laptop battery would quickly die. This was the “aha!” moment — that reducing energy usage was going to be incredibly vital for any type of mobile computer. This led to all sorts of interesting questions: Can we measure energy usage and attribute that energy to the software running on the computer? What types of strategies can software employ to extend the battery lifetime of the computer? Can the operating system adapt the behavior of the software to optimize for energy savings or quality/performance?

Q: For someone in academia curious about collaborating with or working at Facebook, where would you recommend they start?

JF: My best collaborations (both with Facebook and elsewhere) have involved sending a student to work directly with industry teams for a period of time (i.e., an internship) or working directly on the project myself (e.g., on sabbatical or for a few hours every week). My Facebook collaborations started out with long conversations with Facebook engineers at conferences where we would kick a bunch of ideas around. The final project wound up in the same general area as these conversations, but it was really the process of embedding with a Facebook team that generated the best research directions.

Working with Facebook, there is a tremendous opportunity to collect real-world systems measurements at scale to validate ideas. It’s important to utilize this opportunity during any collaboration.

I also learned to budget some time after any internship or sabbatical to work on the idea in academia where one can build a smaller-scale replica, tweak, and measure the system in a way that is not possible in a production system. Combining these two styles of research can result in really strong work.