February 14, 2018

Announcing Tensor Comprehensions

By: Nicolas Vasilache, Oleksandr Zinenko - Inria & DI ENS, Theodoros Theodoridis - ETH Zürich, Zachary DeVito, William S. Moses - MIT CSAIL, Sven Verdoolaege, Andrew Adams, Albert Cohen - Inria & DI ENS & FAIR

Today, Facebook AI Research (FAIR) is announcing the release of Tensor Comprehensions, a C++ library and mathematical language that helps bridge the gap between researchers, who communicate in terms of mathematical operations, and engineers focusing on the practical needs of running large-scale models on various hardware backends. The main differentiating feature of Tensor Comprehensions is that it represents a unique take on Just-In-Time compilation to produce the high-performance codes that the machine learning community needs, automatically and on-demand.

Order of magnitude productivity gains

The typical workflow for creating new high-performance machine learning (ML) layers can span days or weeks of engineering work through a two phase process:

A researcher writes a new layer at a numpy-level abstraction, chaining existing operations in a deep learning library like PyTorch, and tests it in small-scale experiments. The performance of the code implementing the validated idea needs to be accelerated by an order of magnitude to run large-scale experiments.
An engineer takes the layer and writes efficient code for GPUs and CPUs:

a. The engineer needs to be a High-Performance Computing expert of which only a limited supply of talent is available

b. The engineer needs to acquire context, map out a strategy, write and debug code

c. Moving the code to the backend involves mundane tasks, such as verbose argument checking and adding boilerplate integration code

As a consequence and over the last few years, the deep learning community has grown to rely on high-performance libraries such as CuBLAS, MKL, and CuDNN to get high-performance code on GPUs and CPUs. Experimenting with ideas that deviate from the primitives provided in these libraries involves a level and magnitude of engineering that can be intimidating to researchers.

We anticipate great practical value in open-sourcing a package that shortens this process from days or weeks to minutes. With Tensor Comprehensions, our vision is for researchers to write their idea out in mathematical notation, this notation automatically gets compiled and tuned by our system, and the result is specialized code with good performance.

In this release, we provide:

a mathematical notation to express a broad family of ML ideas in a simple syntax
a C++ frontend for this mathematical notation based on Halide IR
a polyhedral Just-in-Time (JIT) compiler based on Integer Set Library (ISL)
a multi-threaded, multi-GPU autotuner based on evolutionary search

Related earlier work

A recent language that has become popular in the adjacent field of high-performance image processing is Halide. Halide uses similar high-level functional syntax to describe an image processing pipeline, and then, in a separate block of code, explicitly schedules it onto the hardware, specifying in detail how operations are tiled, vectorized, parallelized, and fused. This makes it a very productive language for people with architectural expertise, but it is difficult to use for most ML practitioners. Automatic scheduling of Halide is an active research area, but there is no good solution yet for ML code running on a GPU.

Tensor Comprehensions uses the Halide compiler as a library. We build on Halide’s intermediate representation (IR) and analysis tools, and pair it with polyhedral compilation techniques, so that you can write layers using similar high-level syntax but without the need to explicitly say how it is going to run. We also found ways to make our language even more concise, eliminating the need to specify loop bounds for reductions.

The details

Tensor Comprehensions use Halide and Polyhedral Compilationtechniques to automatically synthesize CUDA kernels with delegated memory management and synchronization. This translation performs optimizations for general operator fusion, fast local memory, fast reductions and JIT specialization for specific sizes. Since we do not try to own or optimize memory management, our flow is easily and efficiently integrated into any ML framework and any language that allows calling C++ functions.

Contrary to classical compiler technology and library approaches, Polyhedral Compilation allows Tensor Comprehensions to schedule computation of individual tensor elements on-demand for each new network.

At the CUDA level, it combines affine loop transformations, fusion/fission and automatic parallelization while ensuring data is correctly moved through the memory hierarchy.

The numbers in the figure show the order in which tensor elements were initially computed and arrows represent dependencies between them. In this example, the figure rotation corresponds to loop interchange which enables deep operator fusion.

To drive the search procedure, we also provide an integrated multi-threaded, multi-GPU autotuning library which uses Evolutionary Search to generate and evaluate thousands of implementation alternatives and select the best performing ones. Just call the tune function on your Tensor Comprehension and watch the performance improve, live; stop when you are satisfied. The best strategy is serialized via protobuf and reusable immediately or in offline scenarios.

On the performance side, while we still have many improvements in the works, Tensor Comprehensions can already match or beat the performance of current ML frameworks integrated with hand-tuned libraries, in favorable cases. This is mainly achieved by the ability to adapt code generation strategies to specific problem sizes. The following bar chart illustrates performance gains we observed when comparing kernels produced automatically by Tensor Comprehensions against existing alternatives in Caffe2 and ATen (which use vendor library implementations such as CuDNN). For more details about the performance one may currently expect from Tensor Comprehensions please see our accompanying paper.

As we extend our contribution to more hardware backends, Tensor Comprehensions will complement fast libraries written by hardware manufacturers such as NVIDIA and Intel, and will be used in conjunction with libraries such as CUDNN, MKL or NNPack.

What to expect next

This release will allow researchers and programmers to write layers in a notation that is similar to the maths they use in their papers and communicate concisely the intent of their program. They will also be able to take that notation and translate it easily into a fast implementation in a matter of minutes rather than days. As the toolchain grows, we expect usability and performance to increase and benefit the whole community.

We will release PyTorch integration for Tensor Comprehensions at a later date.

We are grateful for frequent exchanges with and feedback from the frameworks teams and are looking forward to bringing this exciting new technology to your favorite ML framework.

FAIR is committed to open science and working with the machine learning community to push AI research further. Tensor Comprehensions is already a collaboration between Facebook, Inria, ETH Zurich and MIT. Our work is in the early stages and we’re excited to share it early and look forward to improving it with feedback from the community.

Get started

Tensor Comprehensions is available under the Apache 2.0 license.
Documentation
On ArXiv
On Slack
Email: tensorcomp@fb.com

Areas

Artificial Intelligence

Share

Research

Announcing Tensor Comprehensions

Order of magnitude productivity gains

Related earlier work

The details

What to expect next

Get started

Featured News

Introducing LLaMA: A foundational, 65-billion-parameter large language model

Token Merging: Your ViT but faster

How Meta uses AI to better understand people’s ages on our platforms