Design and analysis of hardware friendly pruning algorithms to accelerate deep neural networks at the edge

tinyML Research Symposium (tinyML)

Abstract

Employing unstructured pruning is very popular to get state-of-the-art model compression of convolutional neural network weights. While unstructured pruning results in good model accuracy, it is challenging for hardware architectures to exploit unstructured sparsity for speedups and power savings during model inference. Structured pruning, on the other hand, readily translates to model inference speedups and power reduction due to its hardware-friendly nature but typically results in inferior model accuracy when compared with unstructured pruning. Model pruning is performed by removing network weights according to some criteria. As there are several structured and unstructured pruning criteria being proposed in literature, it is often unclear which criteria offers the best performance in hardware. In this work, we generate the accuracy-sparsity-latency pareto curve for several state-of-the-art filter pruning criteria and generate insights based on an edge-based DNN accelerator. We also propose combining various granularities of pruning and evaluate the benefits. These insights will be useful to prune deep learning workloads for inference at the edge subject to a given accuracy and compute budget.

Featured Publications