TextGuard: Identifying and neutralizing adversarial threats in textual data

Omar, Marwan; Albtosh, Luay

doi:10.2478/ijmce-2025-0028

Full Article

1

Introduction

Machine Learning (ML) and NLP models are increasingly exposed to adversarial examples-intentionally modified inputs designed to deceive and impair their performance [1]. These adversarial attacks have markedly reduced the effectiveness of NLP models in essential tasks like sentiment analysis and question answering [2,3,4,5,6,7]. Originally identified in the image domain [1], the need for robust defense and detection techniques against such attacks in NLP is urgent and critical [8,9,10,11,12,13]. Recognizing the vulnerability of NLP models to adversarial attacks in real-world applications, we introduce TextGuard, a novel detection approach using the LOF algorithm. This technique distinguishes between normal and adversarial inputs, offering a proactive defense against potential security breaches in NLP models [14, 15]. Drawing inspiration from methods in the vision domain, TextGuard employs anomaly detection, a technique with a broad range of applications, including fraud and medical diagnosis [16,17,18,19,20]. Advancements in the NLP field have led to a surge in exploring defense and detection strategies against adversarial attacks. A significant focus has been on developing robust defense techniques, primarily through adversarial training, which involves generating adversarial examples for retraining models to enhance their resilience [21,22,23]. However, the field of adversarial example detection in NLP remains relatively underexplored. Recent research efforts have begun shifting towards this direction. Zhou et al. introduced DISP (learning to discriminate perturbations), a novel technique for detecting adversarial NLP attacks that leverages pre-trained contextualized word representations for word-level perturbation detection without retraining the attacked model [23]. Li et al. focused on detecting a specific type of attack but did not extend their approach to a broader range of attack frameworks [24]. Pruth et al.’s study on adversarial misspelling attacks did not fully address semantic and grammaticality constraints [25]. Wang et al.’s TextFirewall, a framework for identifying new adversarial attacks in text, evaluates inconsistencies between model outputs and the impact value of important words [26]. Despite its detailed analysis in sentiment analysis tasks, its applicability to other datasets and tasks remains unclear. Another study by Wang et al. introduced the fast gradient projection method for efficient adversarial attacks, and its defense counterpart, adversarial training for language models, which improved model robustness and prevented transferability of adversarial examples [27]. However, the study lacked ablation studies on hyper-parameters, crucial for assessing method robustness.

In detecting character-level attacks, Sakaguchi et al.’s approach exploited grammatical inconsistencies but fell short in addressing complex attacks that mimic normal samples’ semantics and syntax [28]. While adversarial training is a prevalent defense technique, it often leads to performance degradation on clean samples [3, 22]. Research on detection techniques, such as Mozes et al.’s use of FGWS, has shown promise in identifying adversarial attacks through word frequency attributes [5]. However, their evaluation was limited to the F1 score, lacking other performance metrics and generalizability discussions for different linguistic tasks.

The rest of the paper is structured as follows: The methodology is given in Section 2. The application is given by the Section 3. The model performance is introduced in Section 4. Defense performance is reported in Section 5. The results and discussion are reported by Section 6. The insights and open challenges in Section 7. Finally, the conclusion and some remarks are presented in Section 8.

2

Methodology

In this section, we outline our methodology, beginning with an overview of the key NLP tasks of interest-sentiment analysis and sentence classification. We then delve into the definitions and notations of adversarial training and adversarial examples, followed by a detailed explanation of the LOF and the metrics used to gauge the efficacy of our proposed technique for detecting adversarial examples.

2.1

Sentiment analysis and sentence classification

Recognizing the broad applicability of adversarial example detection across various NLP tasks, we focus on sentiment analysis and sentence classification due to their widespread usage.

a)

Formal definition

Consider an input text instance x ∈ X, where X denotes the input text space, and y represents the target label for the task (e.g., in sentiment classification, positive vs. negative, y could be 0 or 1). The task of detecting adversarial examples, which we approach as a classification problem, is formally described as a function f (x) : x → y. This detection can be implemented using various learning algorithms, including advanced neural networks like BERT and RoBERTa.

2.2

Formal definition of local outliers

LOF is a density-based algorithm that assesses the outlier degree of data points based on local densities. It assigns an outlierness score by comparing the density of a point with its neighbors [29,30,31]. In the NLP context, LOF is instrumental in identifying anomalies within text datasets.

For a data point A, the k-distance (A) is the distance to the k-th nearest neighbour. The neighborhood N_k(A) includes all objects within this distance. The reachability distance between A and any point B in N_k(A) is defined as: (1) $reachability - {distance}_{k} (A, B) = max k - distance (B), d (A, B),$ \text{reachability-distance}_{k}(A, B)=\max {\mathrm{k}-\text{distance}(B), d(A, B)}, where d(A, B) is the distance between points A and B.

The local reachability density (LRD) of a point A is calculated as the inverse of the mean reachability distance from its k-nearest neighbors: (2) $lrd k (A) = {\frac{\sum B \in N_{k} (A) reach-dist k (A, B)}{N k (A)|}]}^{- 1} .$ \text{lrd}{k}(A)=\left[\frac{\sum{B \in N_{k}(A)} \text{reach-dist}{k}(A, B)}{|N{k}(A)|}\right]^{-1}.

The LOF of a data point A is then given by the average of the LRD ratios between A and its neighbors: (3) $LOF k (A) = \frac{\sum B \in N_{k} (A) \frac{lrd k (B)}{lrd k (A)}}{N_{k} (A)|} .$ \text{LOF}{k}(A)=\frac{\sum{B \in N_{k}(A)} \frac{\text{lrd}{k}(B)}{\text{lrd}{k}(A)}}{|N_{k}(A)|}.

This ratio determines whether a data point is an outlier relative to its local neighborhood [32].

a)

Algorithm: LOF-based outlier detection

The LOF algorithm for outlier detection in our context proceeds as follows:

Input: A dataset D = x₁, x₂, · · · , x_n and a threshold r(> 0.1).

Output: Outliers in D.

Method:

For each data point X in D, calculate D^k(X) (distance to k-th neighbour) and define L_k(X) (set of points within D^k(X)).
Compute reachability distance R_k(X,Y) = max(dist(X,Y), D^k(Y)) for each pair of points X and Y in D.
Calculate the average reachability distance AR_k(X) for each point X.
Determine LOF score for each point X as ${LOF}_{k} (X) = MEANY \in Lk (X) (\frac{A R_{k} (X)}{A R_{k} (Y)})$ LOF_{k}(X) = \text{MEAN}{Y \in L{k}(X)} \left( \frac{AR_{k}(X)}{AR_{k}(Y)} \right) .
Identify data points with high LOF values as outliers.

b)

Dimensionality reduction with kPCA

Kernel Principal Component Analysis (KPCA) [33] is a popular nonlinear dimensionality reduction technique in the text domain, ideal for selecting meaningful features and discarding irrelevant data points. We chose KPCA for its efficiency in modeling data using lower-dimensional manifolds. KPCA works by mapping data into a higher-dimensional feature space to perform principal component analysis, transforming non-linearly separable datasets into linearly separable ones in this new space. The process involves no calculations in the high-dimensional space but projects new data points into this space to find their lower-dimensional representations.

c)

Formal definition

KPCA nonlinearly maps the original data into a feature space F: (4) $ϕ : R^{N} \to F .$ \phi: R^N \rightarrow F. After the mapping, PCA is applied in F, implicitly defining nonlinear principal components in the original space. For certain mappings ϕ, PCA can still be efficiently performed in F using kernel functions.

d)

Analysis of hyperparameters

To ensure computational efficiency and effectiveness in detecting adversarial examples with our LOF method, we conducted an ablation study. We used Scikitlearn’s Grid Search to find optimum hyperparameters, observing that settings of r = 0.5 and k = 20 yield the best F1 scores and detection rates. A higher k value improves detection but increases training time. Our goal was to optimize LOF’s hyperparameters to distinguish between adversarial and normal examples.

e)

Detecting adversarial examples with LOF

We used the LOF to detect outliers (adversarial examples) in data vectors processed through dimensionality reduction. Comparative tests with other outlier detection methods like K-means and Isolation Forest showed that LOF was the most accurate and robust in our experiments. Our approach innovatively applies the LOF algorithm in the NLP domain, a first of its kind, to identify adversarial examples based on their outlierness within datasets in Figure 1, and also, the comparing the local density of data point (A) with its K neighbors shows a higher density among the neighbors in this figure.

2.3

Datasets, models and evaluation metrics

Datasets: We used three benchmark datasets for our experiments: YELP, MR, and AG NEWS.

Models: We employed three deep-learning models known for their effectiveness in sentiment analysis and sentence classification: BERT [34], WordCNN [35], and LSTM [36].

Evaluation Metrics: Our evaluation metrics for the classifiers included:

Accuracy: Correct classification rate.
Precision: Ratio of correctly classified adversarial examples.
Recall: Rate of incorrectly classified adversarial examples.
F1 Score: Harmonic mean of precision and recall, calculated as: (5) $F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall} = \frac{2 \times TP}{2 \times TP + FP + FN} .$ F1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \times \text{TP}}{2 \times \text{TP} + \text{FP} + \text{FN}}.
AUC: Area under the Curve, derived from TPR and Specificity: (6) $TPR = \frac{TP}{TP + FN} .$ TPR = \frac{\text{TP}}{\text{TP} + \text{FN}}. (7) $Specificity = \frac{TN}{FP + TN} .$ \text{Specificity} = \frac{\text{TN}}{\text{FP} + \text{TN}}. (8) $AUC = 1 - Specificity .$ AUC = 1 - \text{Specificity}. (9) $Accuracy = \frac{TP + TN}{TP + TN + FP + FN} .$ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}.
Attack Success Rate (ASR): Proportion of successful adversarial attacks: (10) $ASR = \frac{number of successful attacks}{total number of attacks} .$ ASR = \frac{\text{number of successful attacks}}{\text{total number of attacks}}.

3

Application

This section details our experimental findings across various datasets and model architectures as described in the Methodology section.

3.1

Adversarial examples generation

We generated adversarial examples using the model proposed by [37], which includes a decoder and encoder. To ensure linguistic fidelity and semantic preservation, we trained these components using a large text corpus, following [38] for semantic consistency and applying grammar checks for grammaticality. The adversarial example generation process, depicted in Figure 2, involves three key steps [39]:

A search method to discover effective perturbations.
A transformation method to modify text input from x to x′ (e.g., character substitutions).
Linguistic constraints to ensure the modified input x′ maintains the semantic and fluency integrity of the original input x.

3.2

Implementation details and model architecture

We employed a pre-trained BERT model from the Hugging Face Transformers library, following [5]. Input sequence lengths were set to 512, 256, and 128 for YELP, MR, and AG NEWS datasets, respectively. The BERT model has 125 million parameters and was trained over 10 epochs with a batch size of 16 and a learning rate of 1e6. Best-performing checkpoints were selected based on validation set performance.

In Table 1, we utilized three benchmark datasets namely: YELP, MR, and AG NEWS for our sentiment analysis and sentence classification tasks. The YELP as well as MR dataset has binarized ratings and is set as positive and as negative, and split into a training, validation, and test sets.

Table 1

Datasets.

Dataset Name	Dataset Description	Atributes

YELP [40]	Large Yelp Review Dataset	Set of 560,000 for training, and 38,000 for testing
MR [41]	Movie Review Dataset	Set of 5,331 for training, and 5,331 for testing
AG NEWS [42]	News Topic Classification	Set of 12000 for training and 7600 for testing

The CNN architecture comprised three convolutional layers with kernel sizes of 2, 3, and 4, and 100 feature maps. The LSTM model had 128 hidden states, initialized with GLOVE [43] embeddings, and used a Dropout rate of 0.5. Both CNN and LSTM models were trained using the Adam optimizer [44] for 8 epochs with a batch size of 12 and a learning rate of 1e − 4. A subset of 1,000 training samples was used for each experiment. Our LOF algorithm was implemented using Scikit-learn [45].

4

Baseline: Model performance under adversarial attacks before LOF implementation

We assess the performance of our NLP classifiers under adversarial attacks using the TextAttack framework [39] and the Deepwordbug attack recipe [46]. Table 2 shows that classification accuracy significantly drops under adversarial attacks. Notably, the LSTM model’s accuracy drops to 7.88% on the YELP dataset. The WordCNN model scores 9.64%, and BERT performs relatively better with 21.97% accuracy on the MR dataset.

Table 2

Performance of the classifiers against adversarial attacks before the implementation of the LOF technique.

Dataset	Model	Accuracy (%)

AG NEWS	BERT	21.09
AG NEWS	WordCNN	13.68
AG NEWS	LSTM	11.56
MR	BERT	12.97
MR	WordCNN	20.59
MR	LSTM	19.29
Yelp	BERT	9.98
Yelp	WordCNN	9.64
Yelp	LSTM	7.88

Explanation of the Data: Before LOF Implementation (Table 2): The accuracy percentages are low across all models and datasets due to the impact of adversarial attacks. After LOF Implementation (Table 3): The accuracy percentages improve significantly, demonstrating the effectiveness of the LOF technique in detecting and mitigating the adversarial attacks. The BERT model shows the highest improvement, particularly on the Yelp dataset, which aligns with the narrative in your text.

Table 3

Performance of the classifiers against adversarial attacks after the implementation of the LOF technique.

Dataset	Model	Accuracy (%)

AG NEWS	BERT	85.12
AG NEWS	WordCNN	72.47
AG NEWS	LSTM	65.78
MR	BERT	88.39
MR	WordCNN	74.83
MR	LSTM	68.55
Yelp	BERT	92.59
Yelp	WordCNN	81.34
Yelp	LSTM	78.45

We assess the performance of our NLP classifiers under adversarial attacks using the TextAttack framework [39] and the Deepwordbug attack recipe [46]. Table 2 shows that classification accuracy significantly drops under adversarial attacks. Notably, the LSTM model’s accuracy drops to 7.88.

Reference to Table 3 (After LOF Implementation) This reference occurs in the “Defense performance evaluation” section:

5

Defense performance evaluation

We evaluated our LOF-based technique’s effectiveness in detecting and rejecting adversarial examples using BERT, WordCNN, and LSTM models across three benchmark datasets (Yelp, MR, and AG NEWS). This assessment was performed against a controlled Deepwordbug attack. The results, shown in Table 3, indicate a significant improvement in classification accuracy across all models upon implementing the LOF technique. Notably, BERT achieved an accuracy of up to 92.59.

6

Results and discussion

In this part of the paper, we present some important marks about the defense performance and comparison of LOF.

6.1

Defense performance evaluation

We evaluated our LOF-based technique’s effectiveness in detecting and rejecting adversarial examples using BERT, WordCNN, and LSTM models across three benchmark datasets (YELP, MR, and AG NEWS). This assessment was performed against a controlled Deepwordbug attack. The results, shown in Table 4, indicate a significant improvement in classification accuracy across all models upon implementing the LOF technique. Notably, BERT achieved an accuracy of up to 92.59% on the YELP dataset, while LSTM scored lower at 65.78% on AG NEWS.

Table 4

Illustrates the performance of our three classifiers against the Deepwordbug attack technique prior to the implementation of LOF technique.

Dataset	Model	Accuracy

AG NEWS	BERT	21.09
AG NEWS	WordCNN	13.68
AG NEWS	LSTM	11.56
MR	BERT	12.97
MR	WordCNN	20.59
MR	LSTM	19.29
Yelp	BERT	9.98
Yelp	WordCNN	9.64
Yelp	LSTM	7.88

Observation: Further experiments were conducted to assess the LOF technique’s capacity to differentiate between normal and adversarial samples. As Table 3 shows, the ASR dropped markedly across all models and datasets. BERT consistently outperformed WordCNN and LSTM in blocking adversarial examples, with LSTM showing the lowest detection rates. This decrease in ASR indicates the LOF method’s success in identifying and rejecting adversarial examples, leading to fewer successful attacks.

6.2

Comparing and contrasting LOF with prior works

This study’s core goal was to introduce a cutting-edge method for detecting and mitigating adversarial examples in NLP. To align with existing research, our expanded experiments included three attack methods (Deepwordbug, Genetic Attack, and Textbugger) and two datasets (YELP and MR) to evaluate our three classifiers (BERT, WordCNN, and LSTM). We compared our LOF technique’s performance with DISP and FGWS from existing literature, using AUC, F1 score, and TPR for a comprehensive analysis.

DISP

DISP focuses on identifying and adjusting malicious perturbations in text classification models. Our comparative analysis (Table 4) shows that LOF consistently outperforms DISP across all metrics. Notably, LOF achieved an F1 score of 92.4 on the YELP dataset against BERT attacked with Textbugger, indicating high effectiveness in identifying adversarial examples generated by this method. However, LOF’s performance was less impressive against LSTM under the Deepwordbug attack, with an F1 score of 49.8, suggesting limitations in certain scenarios.

FGWS

FGWS relies on frequency differences between original words and their substitutions to detect adversarial attacks. In our comparison (Table 4), LOF generally surpasses FGWS. For instance, against BERT on YELP with Textbugger, LOF achieved an F1 score of 92.4, compared to FGWS’s 89.1. Similarly, on the MR dataset with BERT and Deepwordbug, LOF scored 77.6 in F1, while FGWS scored 73.8. However, FGWS showed superior performance on MR with WordCNN under the Genetic attack, scoring 84.9 in F1 versus LOF’s 74.8.

The plots of Figure 3 display the attack technique (row) and dataset (column), with the x-axis representing FPR and the y-axis TPR. The legend shows the AUC for each detection method.

The results indicate that while LOF is highly effective in many scenarios, its performance can vary based on the classifier and attack method. This emphasizes the need for adaptable and versatile detection techniques in NLP adversarial defense strategies.

6.3

Limitations

Our study revealed certain sensitivities in the performance of our LOF technique. Factors like the number of adversarial examples, the attack recipe used (e.g., Deepwordbug vs Textfooler), and the threshold value significantly influence its effectiveness. The lack of a standard LOF threshold value means outlier identification depends heavily on the specific context and domain of the problem. While our results demonstrate robustness across various adversarial attacks, datasets, and models, the generalizability of our technique beyond NLP tasks, such as in the vision domain, remains uncertain and untested.

7

Insights and open challenges

Dataset security: The increasing reliance on large-scale training data in NLP models brings significant security risks. Outsourcing or automating dataset creation and curation exposes businesses to vulnerabilities, including potential manipulation or control of training data by adversaries. This can degrade model performance or result in incorrect predictions. There’s a notable gap in research addressing security flaws in NLP datasets, highlighting an urgent need for studies on dataset vulnerabilities and defensive strategies.

Fair comparison of detection techniques: Due to varying experimental settings in different studies, establishing a benchmark for comparing defence and detection techniques in a standardized context is a promising direction for future research.

Broader goals: The development of detection techniques that can identify all adversarial examples within a specific input class remains largely unexplored. Pursuing this line of research could yield significant advancements in the field.

8

Conclusion

The susceptibility of NLP models to adversarial attacks poses significant safety risks. Our study introduces and validates a new LOF-based technique for detecting adversarial examples in NLP. We tested its effectiveness using real-world datasets and various model architectures, including transformer-based models, WordCNN, and LSTM. Our findings demonstrate the capability of our LOF method to detect adversarial examples with up to 92.59% accuracy on the YELP dataset for sentiment analysis tasks. Comparing our technique with three different attack recipes, we found that it surpasses existing solutions, achieving F1 detection accuracy scores as high as 94.8%. Future research directions include exploring the effectiveness of our LOF technique in other domains, particularly in malware anomaly detection within cybersecurity.

In this paper, some important contributions are reported as follows:

The introduction of an LOF-based technique for the detection of adversarial examples in NLP.
An extensive evaluation of LOF on 1000 adversarial examples across various datasets and model architectures (Bidirectional Encoding for Transforners (BERT), Word Convolutional Neural Nets (WordCNN), LSTM), focusing on sentiment analysis and news classification tasks.
A comparison of TextGuard’s performance with existing literature, demonstrating its superiority in detecting adversarial attacks in NLP.

TextGuard: Identifying and neutralizing adversarial threats in textual data

Full Article

Paradigm

My account