High-dimensional Bayesian optimization with sparsity-inducing priors

What the research is:

Sparse axis-aligned subspace Bayesian optimization (SAASBO) is a new sample-efficient method for expensive-to-evaluate black-box optimization. Bayesian optimization (BO) is a popular approach to black-box optimization, with machine learning (ML) hyperparameter tuning being a popular application. While they’ve had great success on low-dimensional problems with no more than 20 tunable parameters, most BO methods perform poorly on problems with hundreds of tunable parameters when a small evaluation budget is available.

In this work, we propose a new sample-efficient BO method that shows compelling performance on black-box optimization problems with as many as 388 tunable parameters. In particular, SAASBO has shown to perform well on challenging real-world problems where other BO methods struggle. Our main contribution is a new Gaussian process (GP) model that is more suitable for high-dimensional search spaces. We propose a sparsity-inducing function prior that results in a GP model that quickly identifies the most important tunable parameters. We find that our SAAS model avoids overfitting in high-dimensional spaces and enables sample-efficient high-dimensional BO.

SAASBO has already had several use cases across Facebook. For example, we used it for multiobjective Bayesian optimization for neural architecture search where we are interested in exploring the trade-off between model accuracy and on-device latency. As we show in another blog post, the SAAS model achieves much better model fits than a standard GP model for both the accuracy and on-device latency objectives. In addition, the SAAS model has also shown encouraging results for modeling the outcomes of online A/B tests, where standard GP models sometimes have difficulties to achieve good fits.

How it works:

BO with hundreds of tunable parameters presents several challenges, many of which can be traced to the complexity of high-dimensional spaces. A common behavior of standard BO algorithms in high-dimensional spaces is that they tend to prefer highly uncertain points near the domain boundary. As this is usually where the GP model is the most uncertain, this is often a poor choice that leads to overexploration and results in poor optimization performance. Our SAAS model places sparse priors on the inverse lengthscales of the GP model combined with a global shrinkage that controls the overall model sparsity. This prior results in a model where the majority of dimensions are “turned off,” which avoids overfitting.

An appealing property of the SAAS priors is that they are adaptive. As we collect more data, we may gather evidence that additional parameters are important, which allows us to effectively “turn on” more dimensions. This is in contrast to a standard GP model with maximum likelihood estimation (MLE), which will generally exhibit non-negligible inverse lengthscales for most dimensions—since there is no mechanism regularizing the lengthscales—often resulting in drastic overfitting in high-dimensional settings. We rely on the No-Turn-U-Sampler (NUTS) for inference in the SAAS model as we found it to clearly outperform maximum a posteriori (MAP) estimation. In Figure 1, we compare the model fit using 50 training points and 100 test points for three different GP models on a 388 dimensional SVM problem. We see that the SAAS model provides well-calibrated out-of-sample predictions while a GP with MLE/NUTS with weak priors overfit and show poor out-of-sample performance.

Figure 1: We compare a GP fit with MLE (left), a GP with weak priors fit with NUTS (middle), and a GP with a SAAS prior fit with NUTS (right). In each figure, mean predictions are depicted with dots, while bars correspond to 95 percent confidence intervals.

Many approaches to high-dimensional BO try to reduce the effective dimensionality of the problem. For example, random projection methods like ALEBO and HeSBO work directly in a low-dimensional space, while a method like TuRBO constrains the domain over which the acquisition function is optimized. SAASBO works directly in the high-dimensional space and instead relies on a sparsity-inducing function prior to mitigate the curse of dimensionality. This provides several distinct advantages over existing methods. First, it preserves—and therefore can exploit—structure in the input domain, in contrast to methods that rely on random embeddings, which risk scrambling it. Second, it is adaptive and exhibits little sensitivity to its hyperparameters. Third, it can naturally accommodate both input and output constraints, in contrast with methods that rely on random embeddings, where input constraints are particularly challenging.

Why it matters:

Sample-efficient high-dimensional black-box optimization is an important problem, with ML hyperparameter tuning being a common application. In our recently published blog post on Bayesian multiobjective neural architecture search, we optimized a total of 24 hyperparameters, and the use of the SAAS model was crucial to achieve good performance. Because of the high cost involved with training large-scale ML models, we want to try as few hyperparameters as possible, which requires sample-efficiency of the black-box optimization method.

We find that SAASBO performs well on challenging real-world applications and outperforms state-of-the-art methods from Bayesian optimization. In Figure 2, we see that SAASBO outperforms other methods on a 100-dimensional rover trajectory planning problems, a 388-dimensional SVM ML hyperparameter tuning problem, and a 124-dimensional vehicle design problem.

Figure 2: For each method, we depict the mean value of the best value found at a given iteration (top row). For each method, we show the distribution over the final approximate minimum as a violin plot, with horizontal bars corresponding to 5 percent, 50 percent, and 95 percent quantiles (bottom row).

February 8, 2023

Research

What the research is:

How it works:

Why it matters:

Read the full paper:

View tutorial:

Featured News

Introducing LLaMA: A foundational, 65-billion-parameter large language model

Token Merging: Your ViT but faster

How Meta uses AI to better understand people’s ages on our platforms