Abstract
Adversarial attacks in the text domain pose significant challenges to the integrity of Natural Language Processing (NLP) systems. Addressing this, our study introduces “TextGuard,” a groundbreaking technique utilizing the Local Outlier Factor (LOF) algorithm for detecting adversarial examples in NLP. This study not only empirically validates the effectiveness of TextGuard on various real-world datasets but also compares its performance with traditional NLP classifiers such as Long Short-Term Memory (LSTM), Convolutional Nueral Nets (CNN), and transformer-based models. Remarkably, TextGuard demonstrates superior detection capabilities with F1 detection accuracy scores reaching up to 94.8%, outperforming recent state-of-the-art methods like Discriminative Perturbations (DISP) and Frequency Guided Word Substitution (FGWS). This marks the first instance of applying the LOF technique in the text domain for adversarial example detection, setting a new benchmark in the field.