Gradient-based Adversarial Attacks against Text Transformers

Conference on Empirical Methods in Natural Language Processing (EMNLP)


We propose the first general-purpose gradient-based adversarial attack against transformer models. Instead of searching for a single adversarial example, we search for a distribution of adversarial examples parameterized by a continuous-valued matrix, hence enabling gradient-based optimization. We empirically demonstrate that our white-box attack attains state-of-the-art attack performance on a variety of natural language tasks, outperforming prior work in terms of adversarial success rate with matching imperceptibility as per automated and human evaluation. Furthermore, we show that a powerful black-box transfer attack, enabled by sampling from the adversarial distribution, matches or exceeds existing methods, while only requiring hard-label outputs.

Latest Publications

Log-structured Protocols in Delos

Mahesh Balakrishnan, Mihir Dharamshi, David Geraghty, Santosh Ghosh, Filip Gruszczynski, Jun Li, Jingming Liu, Suyog Mapara, Rajeev Nagar, Ivailo Nedelchev, Francois Richard, Chen Shen, Yee Jiun Song, Rounak Tibrewal, Vidhya Venkat, Ahmed Yossef, Ali Zaveri