Have a personal or library account? Click to login
The Optimization of n-Gram Feature Extraction Based on Term Occurrence for Cyberbullying Classification Cover

The Optimization of n-Gram Feature Extraction Based on Term Occurrence for Cyberbullying Classification

Open Access
|May 2024

Figures & Tables

dsj-23-1591-g1.png
Figure 1

Research workflow.

Table 1

Min-DF and Max-DF inform the search results for the number of phrases in the training process.

MIN-DFMAX-DFNUMBER OF TERMS
UNIGRAMBIGRAMTRIGRAM
20.62,7091,093323
20.72,7291,047284
20.82,7761,091306
50.688914451
50.788014851
50.887815450
100.63263710
100.7341385
100.8334384
Table 2

TF-IDF feature extraction corpus with the n-Gram technique.

n-GRAMCORPUS
Unigram[..,’andy’, ‘angry’, ‘animal’, ‘animals’, ‘announced’, ‘announces’, ‘annoy’, ‘anonymous’, ‘another’, ‘answer’, ‘anxiety’, ‘anymore’, ‘anyone’, ‘anything’, ‘anyway’, ‘anyways’, ‘anywhere’, ‘ap’, ‘apologize’, ‘app’, ‘apparently’, ‘appeal’, ‘appears’, ‘apple’, ‘apply’, ‘appointed’, ‘apprentice’, ‘apps’, ‘april’, ‘apt’, ‘area’, ‘areas’, ‘argue’, ‘arguing’, ‘arizona’, ‘arm’, ‘around’, ‘arrest’, ‘arrested’, ‘arrived’, ‘arsen’, ‘art’, ‘article’, ‘articles’, ‘artwork’, ‘ashley’, ‘asian’, ‘ask’, ‘asked’, ‘asking’, ‘ass’, ‘assholes’,..]
Bigram[..,’iran nuclear’, ‘iran polit’, ‘iraq syria’, ‘it time’, ‘itunes aus’, ‘jeb republican’, ‘jeremy corbyn’, ‘job ever’, ‘job sarcasm’, ‘joe moore’, ‘john allen’, ‘john boehner’, ‘join us’, ‘joke funni’, ‘jokeoftheday humor’, ‘jokes humor’, ‘joy sarcasm’, ‘kanye west’, ‘keep perfect’, ‘kentucky clerk’, ‘kid sarcasm’, ‘kids families’, ‘kids http’, ‘kim davis’, ‘kindle education’, ‘kindness compassion’, ‘knew right’, ‘knew sarcasm’, ‘know going’, ‘know late’, ‘know man’, ‘know nt’, ‘know really’, ‘know sarcasm’, ‘know sarcastic’, ‘know super’, ‘knows make’, ‘lake erie’, ‘last night’,..]
Trigram[..,’retweet follow followback’, ‘romance ebook kindle’, ‘rss education design’, ‘rss implemetation http’, ‘sarcasm lol funny’, ‘sarcasm lowest form’, ‘sarcastic apt music’, ‘saw one coming’, ‘says ky clerk’, ‘see coming sarcasm’, ‘seed compounds may’, ‘sep et http’, ‘sep market growth’, ‘sesame seed compounds’, ‘special teams sarcasm’, ‘specialty line drugs’, ‘stellar photo algae’, ‘study reveals http’, ‘sunday night sarcasm’, ‘sundries wholesale sep’, ‘tcoefxcglyvk tsemtulku humanrights’, ‘tcojfdzmyky tinder fails’, ‘tcoyhjfohphp reuters health’, ‘tech longr http’, ‘tech longrea http’, ..]
Table 3

Effect of TF-IDF n-Gram classification on changes in minimum parameters of frequency data.

n-GRAMMIN-DFACCURACY (%)PRECISION (%)RECALL (%)F1-SCORE (%)
Unigram20.99470.99460.99730.9960
50.99670.99660.99810.9973
100.99360.99370.99700.9956
Bigram20.84490.82920.88170.8531
50.83590.82170.86620.8383
100.83000.81000.86840.8370
Trigram20.6640.64450.91020.7096
50.64220.62460.91620.6893
100.61170.59670.91590.6751
dsj-23-1591-g2.png
Figure 2

The Impact of the TF-IDF n-Gram classification on variations in the minimum values of frequency data.

Table 4

The effect of TF-IDF n-Gram classification on changes in maximum parameters of data frequency.

n-GRAMMAX-DFACCURACY (%)PRECISION (%)RECALL (%)F1-SCORE (%)
Unigram0.60.99340.99310.99730.9950
0.70.99330.99320.99740.9956
0.80.99820.99860.99760.9983
Bigram0.60.83740.82440.86800.8397
0.70.83780.81900.87400.8441
0.80.83550.81750.87430.8445
Trigram0.60.63800.62560.91480.6890
0.70.64230.61290.91440.6926
0.80.63780.62730.91310.6924
dsj-23-1591-g3.png
Figure 3

The impact of the TF-IDF n-Gram classification on variations in the maximum values of frequency data.

dsj-23-1591-g4.png
Figure 4

Graph showing the distribution of term groups determined by the length of the term.

Table 5

Data distribution groups without outliers.

DATA GROUPINSTANCETERM RANGE
19864–12
21,81513–19
31,15420–32
Table 6

TF-IDF n-Gram classification test results based on term length.

n-GRAMLONG-TERM GROUPACCURACY (%)PRECISION (%)RECALL (%)F1-SCORE (%)
Unigram10.95260.94810.99220.9689
20.99670.99850.99630.9978
30.99370.99590.98440.9915
Bigram10.85040.86220.95740.9070
20.83040.83560.86850.8526
30.82220.71330.69740.6889
Trigram10.76370.76221.00000.8648
20.61890.59810.99780.7485
30.72220.85960.06150.1115
dsj-23-1591-g5.png
Figure 5

Graph of TF-IDF n-Gram classification testing against term length.

dsj-23-1591-g6.png
Figure 6

Classification performance results from TF-IDF feature extraction using the n-Gram approach on term length and minimum frequency data.

dsj-23-1591-g7.png
Figure 7

Classification performance results from TF-IDF feature extraction using the n-Gram approach on term length and maximum frequency data.

Table 7

Classification from TF-IDF feature extraction results using the n-Gram approach.

n-GRAMSVMK-NNLINEAR REGRESSIONNAÏVE BAYESRANDOM FOREST
ACURACYF1-SCOREACURACYF1-SCOREACURACYF1-SCOREACURACYF1-SCOREACURACYF1-SCORE
Data Group 1
Unigram0.95260.96890.96440.97560.96000.97330.93110.95780.84670.9000
Bigram0.85040.90700.87000.92000.87000.92000.85670.90670.98560.9911
Trigram0.76370.86480.77330.87000.77330.87000.77330.87000.78440.8700
Data Group 2
Unigram0.99670.99780.98670.98670.98670.98670.88670.91670.80670.8400
Bigram0.83040.85260.82000.85670.82330.86330.79670.83330.98670.9900
Trigram0.61890.74850.64330.76670.64330.76670.65670.76670.64330.7667
Data Group 3
Unigram0.99370.99150.97000.94780.98780.97890.88670.77000.83330.6733
Bigram0.82220.68890.78780.59220.81330.56780.75670.26670.98560.9733
Trigram0.72220.11150.74670.08670.75780.09440.75000.08000.75330.0833
Maximum0.99670.99780.98670.98670.98780.98670.93110.95780.98670.9911
Minimum0.61890.11150.64330.08670.64330.09440.65670.08000.64330.0833
Table 8

Optimum pattern of TF-IDF n-Gram feature extraction based on forming parameters.

PARAMETERUNIGRAMBIGRAMTRIGRAM
Min-DFAffect insignificantlyLowLow
Max-DFHighAffect insignificantlyHigh
Length of termsSmall term lengthSmall/medium-term lengthMedium/high-term length
Language: English
Submitted on: Jun 8, 2023
Accepted on: May 6, 2024
Published on: May 23, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Yudi Setiawan, Nur Ulfa Maulidevi, Kridanto Surendro, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.