Meta AI is sharing new research to reduce the latency of existing Vision Transformer (ViT) models without the need for additional training. Our approach, called Token Merging (ToMe), combines similar tokens to reduce computation without losing information. Our algorithm is lightweight and can merge tokens in each layer without any detrimental overhead. ToMe has been evaluated on major datasets in different domains, including ImageNet 1K (images), K400 (videos), and AudioSet-2M (audio), and has been found to increase inference throughput by 2-3 times with minimal accuracy loss. To enable the research community to reproduce and build upon these advancements, we have released the ToMe code and benchmarks.
How it works
ViT converts image patches into “tokens,” then applies an attention mechanism in each layer that allows these tokens to collect information from one another, proportional to their similarity. To improve the speed of ViT while maintaining its accuracy, ToMe is based on two observations: 1) computation speed and memory use are heavily tied to the number of tokens in the transformer, and 2) these tokens are often redundant. ToMe takes redundant tokens and merges them based on similarity, reducing the number of tokens without losing information. This is in contrast to prior work in token pruning, which deletes tokens outright, potentially removing important information.
ToMe is simple and can easily slot into existing transformers. At its core, ToMe uses a fast and lightweight matching function to group similar tokens together. This function can be inserted into the middle of any standard transformer block without much overhead.
During inference, ToMe lowers the number of tokens gradually over the course of the network, significantly reducing the overall time taken. Since ToMe leaves the rest of the model alone, it can be used in conjunction with existing tools to speed up transformers, such as xformers or half precision.
What it can do
We find ToMe has several exciting properties:
ToMe shows strong performance across several input modalities. We applied ToMe to ViT models trained on images, video, and audio. ToMe doubled inference speed across all modalities and had a negligible impact on accuracy, in many cases without additional training.
ToMe is especially effective for large models and large inputs. We applied ToMe to ViT models with different numbers of parameters and image sizes without additional training. The performance drop consistently decreases as the model and image sizes increase. This characteristic is important for deploying large-scale transformer models.
ToMe accelerates Stable Diffusion (text-to-image model) by 1.7x and reduces memory usage by 63 percent without significant loss of detail.
ToMe can be applied out of the box to most networks that use standard transformer blocks. We focus on ViT models in our paper, but ToMe can also speed up and reduce the memory usage of popular architectures, like Stable Diffusion, with minimal loss of visual quality.
Why it matters
Since their introduction, ViTs have rapidly advanced the field of computer vision, demonstrating scalability, generalization across different domains, and the potential for powerful unsupervised learning. However, running massive models has become an issue for tasks with time constraints or hardware limitations. As a result, convolutional models are still used in practice, despite their lower accuracy. ToMe can easily cut the inference time of ViT models in half, and we expect it will unlock the use of large-scale ViT models in real-world applications.