We propose a method for sample-efficient optimization of the trade-offs between model accuracy and on-device prediction latency in deep neural networks.
Neural architecture search (NAS) aims to provide an automated framework that identifies the optimal architecture for a deep neural network machine learning model given an evaluation criterion such as model accuracy. The continuing trend toward deploying models on end user devices such as mobile phones has led to increased interest in optimizing multiple competing objectives in order to achieve an optimal balance between predictive performance and computational complexity (e.g., total number of flops), memory footprint, and latency of the model.
Existing NAS methods that rely on reinforcement learning and/or evolutionary strategies can incur prohibitively high computational costs because they require training and evaluating a large number of architectures. Many other approaches require integrating the optimization framework into the training and evaluation workflows, making it difficult to generalize to different production use-cases. In our work, we bridge these gaps by providing a NAS methodology that requires zero code change to a user’s training flow and can thus easily leverage existing large-scale training infrastructure while providing highly sample-efficient optimization of multiple competing objectives.
We leverage recent advances in multi-objective and high-dimensional Bayesian optimization (BO), a popular method for black-box optimization of computationally expensive functions. We demonstrate the utility of our method by optimizing the architecture and hyperparameters of a real-world natural language understanding model used at Facebook.
We focus on the specific problem of tuning the architecture and hyperparameters of an on-device natural language understanding (NLU) model that is commonly used by conversational agents found in most mobile devices and smart speakers. The primary objective of the NLU model is to understand the user’s semantic expression and to convert it into a structured decoupled representation that can be understood by downstream programs. The NLU model shown in Figure 1 is an encoder-decoder non-autoregressive (NAR) architecture based on the state-of-the-art span pointer formulation.
Figure 1: Non-autoregressive model architecture of the NLU semantic parsing
The NLU model serves as the first stage in conversational assistants and high accuracy is crucial for a positive user experience. Conversational assistants operate over the user’s language, potentially in privacy-sensitive situations such as when sending a message. For this reason, they generally run “on-device,” which comes at the cost of limited computational resources. Moreover, it is important that the model also achieves short on-device inference time (latency) to ensure a responsive user experience. While we generally expect a complex NLU model with a large number of parameters to achieve better accuracy, complex models tend to have high latency. Hence, we are interested in exploring the trade-offs between accuracy and latency by optimizing a total of 24 hyperparameters so we can pick a model that offers an overall positive user experience by balancing quality and latency. Specifically, we optimize the 99th percentile of latency across repeated measurements and the accuracy on a held-out data set.
BO is typically most effective on search spaces with less than 10 to 15 dimensions. To scale to the 24-dimensional search space in this work, we leverage recent work on high-dimensional BO . Figure 2 shows that the model proposed by , which uses a sparse axis-aligned subspace (SAAS) prior and fully Bayesian inference, is crucial to achieving good model fits and outperforms a standard Gaussian process (GP) model with maximum a posteriori (MAP) inference on both accuracy and latency objective.
Figure 2: We illustrate the leave-one-out cross-validation performance for the accuracy and latency objectives. We observe that the SAAS model fits better than a standard GP using MAP.
To efficiently explore the trade-offs between multiple objectives, we use the parallel noisy expected hypervolume improvement (qNEHVI) acquisition function , which enables evaluating many architectures in parallel (we use a batch size of 16 in this work) and naturally handles the observation noise that is present in both latency and accuracy metrics: prediction latency is subject to measurement error and and accuracy is subject to randomness in NN training due to optimizing parameters using stochastic gradient methods.
We compare the optimization performance of BO to Sobol (quasi-random) search. Figure 3 shows the results, where the objectives are normalized with respect to the production model, making the reference point equal to (1, 1). Using 240 evaluations, Sobol was only able to find two configurations that outperformed the reference point. On the other hand, our BO method was able to explore the trade-offs between the objectives and improve latency by more than 25% while at the same time improving model accuracy.
Figure 3: On the left, we see that Sobol (quasi-random) search is an inefficient approach that only finds two configurations that are better than the reference point (1,1). On the right, our BO method is much more sample-efficient and is able to explore the trade-offs between accuracy and latency.
This new method has unlocked on-device deployment for this natural language understanding model as well as several other models at Facebook. Our method requires zero code changes to the existing training and evaluation workflows, making it easily generalizable to different architecture search use cases. We hope that machine learning researchers, practitioners, and engineers find this method useful in their applications and foundational for future research on NAS.
 Eriksson, David, and Martin Jankowiak. “High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces.” Conference on Uncertainty in Artificial Intelligence (UAI), 2021.
 Daulton, Samuel, Maximilian Balandat, and Eytan Bakshy. “Parallel Bayesian Optimization of Multiple Noisy Objectives with Expected Hypervolume Improvement.” arXiv preprint arXiv:2105.08195, 2021.
Check out our tutorial in Ax showing how to use the open-source implementation of integrated qNEHVI with GPs with SAAS priors to optimize two synthetic objectives.