In this monthly interview series, we turn the spotlight on members of the academic community and the important research they do — as thought partners, collaborators, and independent contributors.
For May, we nominated Tianyin Xu, a visiting scientist from the University of Illinois at Urbana-Champaign (UIUC). Before starting his professorship at UIUC, Xu joined Facebook’s Core Systems Disaster Recovery team in order to explore real-world systems applications. Visiting scientist positions are short-term employees (STEs) sponsored by research teams and are posted on the Facebook Careers page.
In this Q&A, Xu shares his experience as a visiting scientist at Facebook, discusses the research projects he’s worked on so far, and offers advice for academics thinking about spending some time in industry.
Q: Tell us about your role at UIUC and the type of research you and your department specialize in.
Tianyin Xu: I’m an assistant professor in the computer science department at UIUC. My research interests are broadly in computer systems, with a focus on software and system reliability. I’m particularly interested in computer systems being operated at the cloud and data center scale.
UIUC has a very strong, active computer science department, with more than a hundredfaculty members. With such a big department, we have a strong presence in pretty much every field of computer science.
Q: What inspired you to spend some time at Facebook Core Systems at the beginning of your professorship?
TX: Taking a 6–12 month stint (a so-called prebbatical) is a common practice for new assistant professors of computer science nowadays. I also liked the idea — it would help me take a break to be physically and mentally ready, and, more important, would allow me to spend time thinking about the type of research I would like to do for my faculty job.
As a PhD graduate with a faculty job lined up, I was looking for an environment drastically different from that of an ivory tower. Particularly, I was seeking opportunities that allowed me to step into real-world large-scale systems and to understand the important problems that truly matter. I believed such experiences would be invaluable for my growth as a systems researcher. For example, a key question I always seek answers for is “Why do existing systems still fail in practice, despite the rigorous software engineering process and the wide adoption of reliability techniques?” Answers to such questions open doors for me to think clearly and to make relevant technical contributions; however, it is challenging to accurately and comprehensively answer such questions in a purely academic environment.
Facebook Core Systems provides a fantastic environment, where I can have firsthand experience on large-scale production systems and develop deep, comprehensive understandings on real-world challenges. The open culture lets me access almost all the resources and encourages me to connect to researchers and engineers with diverse expertise and experiences. One really special thing I find is the incredibly flat organization — everyone sits in the same open space and is close to one another, no matter whether they’re a VP, a director, or a level-3 engineer. I constantly used to look folks up, walk to their desks, ask them questions, and have great conversations.
Q: What is it like being a visiting scientist at Facebook?
TX: The position provides the luxury to understand large-scale distributed systems from the inside out, while thinking about fundamental research problems. Very few jobs provide both at the same time. I had a wonderful experience — I learned a huge amount (many of which can never be learned in an ivory tower), did really interesting research, had a lot of fun, built strong connections, made very close friends, and ate too much gourmet (and free!) food.
Q: What research projects have you worked on?
TX: I worked on two infrastructure systems, Maelstrom and Taiji. Maelstrom isa system for mitigating data center–level disasters by draining interdependent traffic safely and efficiently, and Taiji isa system for managing global user traffic for large-scale internet services at the edge. We later published the two systems at premier computer system conferences, with Maelstrom published at OSDI 2018 and Taiji at SOSP 2019.
One question I frequently received is why I didn’t choose to work on configuration management systems. Configuration management was my PhD thesis topic, and it was what connected me to Facebook researchers (I metCQ Tang and the Configerator team atSOSP 2015, where they published the paper “Holistic configuration management at Facebook”). In fact, I always thought I would join the Configerator team.
When I finally showed up in Menlo Park in October 2017, CQ suggested that I meet a few teams at Core Systems to explore more potential collaboration opportunities. In one of those meetings, I talked toKaushik Veeraraghavan andJustin Meza from the Disaster Recovery (DR) team. Kaushik threw me an incredibly intriguing research problem: What can we do when an entire data center is failing (for example, due to fiber cuts)? I had no answer, as all the reliability techniques I had in mind could not handle such widespread failures to that scale. That was the problem Maelstrom tried to address.
When I joined the DR team, my initial plan was to switch to a new team after six months (so I could see different systems and research problems). However, I ended up spending my entire prebbatical on the DR team because I enjoyed the work and my colleagues so much.
Q: What is the impact of your STE experience on your research and teaching?
TX: This is also a common question I get! There are too many impacts, which will overflow this interview if I try to list them all. So, let me give some examples.
Doing a PhD is more about depth. I worked on one research problem (misconfiguration detection and prevention) and dug deep on this topic to claim a PhD. However, upon graduating, I found myself being very narrowly focused, as I only knew my thesis topic. I asked myself, “How can I be a professor who is supposed to have broad knowledge?” Yes, I did read papers on many other systems topics, but I often find it hard to get to the bottom of the problems from reading papers.
The STE experience helped me develop a direct, holistic understanding of many real-world problems and answered many of my questions/doubts. Furthermore, my work on the Disaster Ready team pushed me to understand various types of production systems and how each system fits in place (our mission was to make every system at Facebook disaster-ready). Based on my understanding and experience, I later created a new course at UIUC entitled “Reliability of Cloud-Scale Systems.” It was a success. The course was ranked as “excellent,” and one major praise was the relevance and importance of the materials.
The STE experience also greatly benefits my research. In particular, it helps me think much deeper about the practicality of my work, which I care deeply about. For example, I took some time to rethink my PhD work on configuration management based on the configuration-related failures at Facebook. The rethinking and reflection led to my recent project,configuration testing (also known as ctest), which is a more practical technique to defend against misconfigurations and prevent production failures. The work is published at OSDI 2020 and is supported by theFacebook Distributed Systems research award.
Q: What advice would you give to university researchers looking to become visiting scientists at Facebook?
TX: I’ve internally shared my experience transitioning from a PhD student to a Facebook engineer because I learned a lot. It was not easy in the beginning. At the time, I had suddenly found myself no longer good and lacking in many skills. I later changed the way I worked at Facebook and started to be effective and enjoyed myself. Here is what I learned:
Q: Where can people learn more about your research?
TX: People can find more information on my website. If anyone ever wants to discuss anything about my research, they can always feel free to reach out.