July 22, 2021

Automating root cause analysis for infrastructure systems

By: Zhichao Wang, Chengjun Zhu, Sourav Chatterjee, Jeffrey Handler, Weijie Yuan, Jun Gao, Zhihui Xie, Dawei Li, Hechao Sun, Alex Kalinin, Xin Fu

What the research is:

Facebook products run on a highly complex infrastructure system that consists of servers, network, back-end services, and client-facing software. Operating such systems at a high level of performance, reliability, and efficiency requires real-time monitoring, proactive failure detection, and prompt diagnostic of production issues. While a number of research and applications have addressed the need for monitoring the use of state-of-the-art anomaly detection, the diagnostics of root causes remains a largely manual and time-consuming process. Modern software systems can be so complex that unit/integration testing and error logs alone are not humanely tractable for root causing. Triaging an alert, for instance, would require manually examining a mixture of structured data (e.g., telemetry logging) and unstructured data (e.g., code changes, error messages).

The Infrastructure Data Science team at Facebook is developing a unified framework of algorithms, as a Python library, to tackle such challenges (see Figure 1). In this blog post, we illustrate applications of RCA from large-scale infrastructure systems, and discuss opportunities for applying statistics and data science to introduce new automation in this domain.

Figure 1. RCA methodologies and applications to infrastructure problems

How it works:

I. Attributing ML performance degradation to data set shift

Machine learning is an important part of Facebook products: It helps recommend content, connect new friends, and flag integrity violations. Feature shifts caused by corrupted training/inference data are a typical root cause of model performance degradations. We are investigating how to attribute a sudden change of model accuracy to the shifting data distributions. Machine learning models usually consume complex features, such as images, text, and high-dimensional embeddings as inputs. We apply statistical methods to perform changepoint detection on these high-dimensional features, and build black-box attribution models, agnostic of the original deep learning models, to attribute model performance degradation to feature and label shifts. See Figure 2 for an example of exposing shifted high-dimensional embedding features between two model training data sets. The methodology is also applicable to explaining accuracy degradations of an older model whose training data distribution differs from the inference data set.

Figure 2. An example of a sudden drastic data set shift in high-dimensional embedding features. Two-dimensional projections of the embeddings (using T-SNE) before and after the shift are visualized. This example, shown as an illustration using synthetic data, is similar to shifts observed in production settings.

II. Automatic diagnosis of key performance metric degradation

Infrastructure systems are monitored in real time, which generates a large amount of telemetry data. Diagnostic workflows usually start with drill-down data analysis, e.g., running analytical data queries to find which country, app, or device type shows the largest week-over-week reliability drop. Such insights could point the on-call engineer to the direction for further investigations. We experiment with dynamic programming algorithms that can automatically traverse the space of these subdimensions. We also try to fit a predictive model using the metrics and dimensions data set, and identify interesting dimensions by looking at feature importance. With the help of such tools, the time spent on repetitive analytical tasks is reduced.

Another diagnostic task is to examine what correlated telemetry metrics may have caused the key performance metric degradation. For instance, when latency of a service spikes, its owner may manually browse through the telemetry metrics of (sometimes a large number of) dependent services. Simple automations such as setting up anomaly detection for every metric can lead to noisy and false positive discoveries. A better approach, shown in Figure 3, is to learn from historical data about the temporal correlations between suspect metrics and the key performance metric, and tease out real root causes from spuriously correlated anomalies.

Figure 3. Methodology for evaluating and rank-ordering potential root-causing factors.

III. Event ranking and isolation

Many production issues are caused by internal changes to the software/infrastructure systems. Examples include code changes, configuration changes, and launching A/B tests for new features that affect a subset of users.

An ongoing research is to develop a model to isolate the changes that are potential root causes. As a first step, we use heuristic rules such as ranking based on time between code change and production issue. There is an opportunity to adopt more signals such as team, author, and code content to further reduce false positives and missing cases compared with the simple heuristic. A machine learning–based ranking model can effectively leverage such inputs. The limited amount of labeled data is a roadblock to automatically learning such rules. A possible solution is to explore a human-in-the-loop framework that iteratively collects subject-matter-expert feedback and adaptively updates the ranking model (see Figure 4).

Figure 4. A human-in-the-loop framework for blaming bad code changes.

At Facebook scale, there are numerous code/configuration/experimentation changes per day. Simply trying to rank order all of them cannot work. The ranking algorithm needs “prior” knowledge about the systems so as to narrow down the pool of suspect root-causing changes. For example, all the back-end services can be represented as a graph with edges representing how likely the degradation of one node can cause production issues of its neighbors. One example algorithm to build such a graph is to apply a deep neural network framework that represents the dynamic dependencies among a large number time series. Another possible direction is to apply causal graph inference models to discover the degree of dependencies among vertices. With the help of such prior knowledge, the isolation of bad changes can be achieved more effectively.

Why it matters:

Operating an efficient and reliable infrastructure is important to the success of Facebook products. While production issues would inevitably happen, quickly identifying root causes using data can expedite remediation and minimize the damage of such events. The proposed framework of algorithms will enable automated diagnosis using a mix of structured data (e.g., telemetry) and unstructured data (e.g., traces, code change events). The methodologies are developed in such a way that they can be generically applicable across different types of infrastructure systems. The algorithms, written as a Python library, can also be useful to the data science and software engineering community externally. Root cause analysis is an emerging space in data science that is at the intersection of existing areas such as data mining, supervised learning, and time series analysis.