December 3, 2021

Improving the performance of large-scale applications via basic block reordering

By: Julián Mestre, Sergey Pupyrev

What the research is:

At Meta, we develop compilers to optimize the performance of our large-scale applications running in data center environments. Profile-guided optimization (PGO) is an important step in modern compilers for improving the performance of large-scale applications based on their runtime behavior. The technique leverages program execution profiles, such as the execution frequencies of binary functions or individual instructions, to guide compilers to optimize critical paths of a program more selectively and effectively.

Basic block reordering is among the most impactful PGO techniques. As part of the compilation process, a binary is broken up into smaller blocks. Many of these basic blocks are executed one after the other, but some contain conditional branches (control-flow instructions, like if-then-else, while, and switch statements) where the execution can jump to two or more blocks. Depending on the relative frequency of these jumps, some orderings of basic blocks can lead to fewer CPU instruction cache misses and therefore faster executions.

The source code on the left is transformed into the control-flow graph in the middle. The code blocks that make up this graph are laid out in memory on the right.

Profiling is used to collect information about the typical execution of an application. It allows us to learn how many times each basic block has been executed and how many times each branch has been taken. Given this information, a compiler’s job is to produce the most “CPU-friendly” ordering of basic blocks that leads to the best binary performance.

Traditional compiler approaches for basic block reordering optimize a specific dimension of CPU, such as instruction cache line utilization or branch prediction. However, we found that such orderings may impose suboptimal results. To overcome these shortcomings, we proposed a new model for block reordering that combines multiple effects and does a better job at predicting the performance of an application. Our model is based on a new optimization problem that we call the extended traveling salesman problem, or Ext-TSP for short.

How it works:

Given a collection of cities and distances between every pair of cities, the classical traveling salesman problem (TSP) involves finding an order in which to visit the cities to minimize the total distance traveled. There are many variations of this problem, like MAX-TSP, where we want to maximize the total distance traveled. Ext-TSP is a generalization of the latter problem, where we want to maximize the distance of not only two adjacent cities but also cities that are close enough together in the order — say, no more than a fixed number of positions away.

In the context of basic block reordering, the blocks play the role of the cities, and the jump counts play the role of the distances between two cities. The ordering corresponds to how the basic blocks are laid out in memory. If two basic blocks are laid out close together, then there is a good chance that a jump from one to another will not incur an instruction cache miss. In a sense, the Ext-TSP objective aims to optimize the utilization of the instruction cache and thus the performance of the application.

Our paper “Improved basic block reordering” introduces the new optimization problem. It shows that finding the best ordering is NP-hard, but also that there is a greedy efficient heuristic that produces good solutions for the instances typically arising from real-world binaries. In addition, for the aforementioned optimization problem, we describe a mixed integer programming formulation that is capable of finding optimal solutions on small functions. Our experiments with the exact method demonstrate that the new suggested heuristic finds an optimal ordering of basic blocks in over 98 percent of real-world instances. From the practical point of view, the new basic block reordering has been implemented in BOLT, an open source binary optimization and layout tool developed at Meta. An extensive evaluation on a variety of real-world data center applications indicate that the new method outperforms existing block reordering techniques, improving the resulting performance of applications with large code size.

This image shows a control-flow graph and two block layouts, one maximizing the number of fall-through jumps and another maximizing the Ext-TSP objective.

Theory behind extended TSP

Given the notoriety of the original TPS problem, we wanted to understand how much more difficult Ext-TSP was compared to the classical TSP. In our paper “On the extended TSP problem,” we study Ext-TSP from a mathematical perspective and prove both negative and positive results about the problem.

On the negative side, it turns out that Ext-TSP is much harder than classical TSP. We prove that conditional on the exponential time hypothesis, it is unlikely that there exists an efficient algorithm for solving the problem optimally, even for very simple treelike instances, like those arising from simple programs without loops. This is very surprising, as most optimization problems (including the classical TSP) admit such efficient algorithms on trees.

On the positive side, we design so-called approximation algorithms that are efficient and return a solution that is guaranteed to be at most a given fixed factor worse than the optimal solution. Given the previous impossibility of having efficient optimal algorithms, such approximation algorithms are the best we can hope for.

Why it matters:

Developing new compiler technology to optimize the performance of our servers is an impactful area of research at Meta. Faster applications mean less computer power is needed to serve our users, which directly translates into less electricity usage in our data centers and a small environmental footprint of our operations.