
Figure 1
Research workflow.
Table 1
Min-DF and Max-DF inform the search results for the number of phrases in the training process.
| MIN-DF | MAX-DF | NUMBER OF TERMS | ||
|---|---|---|---|---|
| UNIGRAM | BIGRAM | TRIGRAM | ||
| 2 | 0.6 | 2,709 | 1,093 | 323 |
| 2 | 0.7 | 2,729 | 1,047 | 284 |
| 2 | 0.8 | 2,776 | 1,091 | 306 |
| 5 | 0.6 | 889 | 144 | 51 |
| 5 | 0.7 | 880 | 148 | 51 |
| 5 | 0.8 | 878 | 154 | 50 |
| 10 | 0.6 | 326 | 37 | 10 |
| 10 | 0.7 | 341 | 38 | 5 |
| 10 | 0.8 | 334 | 38 | 4 |
Table 2
TF-IDF feature extraction corpus with the n-Gram technique.
| n-GRAM | CORPUS |
|---|---|
| Unigram | [..,’andy’, ‘angry’, ‘animal’, ‘animals’, ‘announced’, ‘announces’, ‘annoy’, ‘anonymous’, ‘another’, ‘answer’, ‘anxiety’, ‘anymore’, ‘anyone’, ‘anything’, ‘anyway’, ‘anyways’, ‘anywhere’, ‘ap’, ‘apologize’, ‘app’, ‘apparently’, ‘appeal’, ‘appears’, ‘apple’, ‘apply’, ‘appointed’, ‘apprentice’, ‘apps’, ‘april’, ‘apt’, ‘area’, ‘areas’, ‘argue’, ‘arguing’, ‘arizona’, ‘arm’, ‘around’, ‘arrest’, ‘arrested’, ‘arrived’, ‘arsen’, ‘art’, ‘article’, ‘articles’, ‘artwork’, ‘ashley’, ‘asian’, ‘ask’, ‘asked’, ‘asking’, ‘ass’, ‘assholes’,..] |
| Bigram | [..,’iran nuclear’, ‘iran polit’, ‘iraq syria’, ‘it time’, ‘itunes aus’, ‘jeb republican’, ‘jeremy corbyn’, ‘job ever’, ‘job sarcasm’, ‘joe moore’, ‘john allen’, ‘john boehner’, ‘join us’, ‘joke funni’, ‘jokeoftheday humor’, ‘jokes humor’, ‘joy sarcasm’, ‘kanye west’, ‘keep perfect’, ‘kentucky clerk’, ‘kid sarcasm’, ‘kids families’, ‘kids http’, ‘kim davis’, ‘kindle education’, ‘kindness compassion’, ‘knew right’, ‘knew sarcasm’, ‘know going’, ‘know late’, ‘know man’, ‘know nt’, ‘know really’, ‘know sarcasm’, ‘know sarcastic’, ‘know super’, ‘knows make’, ‘lake erie’, ‘last night’,..] |
| Trigram | [..,’retweet follow followback’, ‘romance ebook kindle’, ‘rss education design’, ‘rss implemetation http’, ‘sarcasm lol funny’, ‘sarcasm lowest form’, ‘sarcastic apt music’, ‘saw one coming’, ‘says ky clerk’, ‘see coming sarcasm’, ‘seed compounds may’, ‘sep et http’, ‘sep market growth’, ‘sesame seed compounds’, ‘special teams sarcasm’, ‘specialty line drugs’, ‘stellar photo algae’, ‘study reveals http’, ‘sunday night sarcasm’, ‘sundries wholesale sep’, ‘tcoefxcglyvk tsemtulku humanrights’, ‘tcojfdzmyky tinder fails’, ‘tcoyhjfohphp reuters health’, ‘tech longr http’, ‘tech longrea http’, ..] |
Table 3
Effect of TF-IDF n-Gram classification on changes in minimum parameters of frequency data.
| n-GRAM | MIN-DF | ACCURACY (%) | PRECISION (%) | RECALL (%) | F1-SCORE (%) |
|---|---|---|---|---|---|
| Unigram | 2 | 0.9947 | 0.9946 | 0.9973 | 0.9960 |
| 5 | 0.9967 | 0.9966 | 0.9981 | 0.9973 | |
| 10 | 0.9936 | 0.9937 | 0.9970 | 0.9956 | |
| Bigram | 2 | 0.8449 | 0.8292 | 0.8817 | 0.8531 |
| 5 | 0.8359 | 0.8217 | 0.8662 | 0.8383 | |
| 10 | 0.8300 | 0.8100 | 0.8684 | 0.8370 | |
| Trigram | 2 | 0.664 | 0.6445 | 0.9102 | 0.7096 |
| 5 | 0.6422 | 0.6246 | 0.9162 | 0.6893 | |
| 10 | 0.6117 | 0.5967 | 0.9159 | 0.6751 |

Figure 2
The Impact of the TF-IDF n-Gram classification on variations in the minimum values of frequency data.
Table 4
The effect of TF-IDF n-Gram classification on changes in maximum parameters of data frequency.
| n-GRAM | MAX-DF | ACCURACY (%) | PRECISION (%) | RECALL (%) | F1-SCORE (%) |
|---|---|---|---|---|---|
| Unigram | 0.6 | 0.9934 | 0.9931 | 0.9973 | 0.9950 |
| 0.7 | 0.9933 | 0.9932 | 0.9974 | 0.9956 | |
| 0.8 | 0.9982 | 0.9986 | 0.9976 | 0.9983 | |
| Bigram | 0.6 | 0.8374 | 0.8244 | 0.8680 | 0.8397 |
| 0.7 | 0.8378 | 0.8190 | 0.8740 | 0.8441 | |
| 0.8 | 0.8355 | 0.8175 | 0.8743 | 0.8445 | |
| Trigram | 0.6 | 0.6380 | 0.6256 | 0.9148 | 0.6890 |
| 0.7 | 0.6423 | 0.6129 | 0.9144 | 0.6926 | |
| 0.8 | 0.6378 | 0.6273 | 0.9131 | 0.6924 |

Figure 3
The impact of the TF-IDF n-Gram classification on variations in the maximum values of frequency data.

Figure 4
Graph showing the distribution of term groups determined by the length of the term.
Table 5
Data distribution groups without outliers.
| DATA GROUP | INSTANCE | TERM RANGE |
|---|---|---|
| 1 | 986 | 4–12 |
| 2 | 1,815 | 13–19 |
| 3 | 1,154 | 20–32 |
Table 6
TF-IDF n-Gram classification test results based on term length.
| n-GRAM | LONG-TERM GROUP | ACCURACY (%) | PRECISION (%) | RECALL (%) | F1-SCORE (%) |
|---|---|---|---|---|---|
| Unigram | 1 | 0.9526 | 0.9481 | 0.9922 | 0.9689 |
| 2 | 0.9967 | 0.9985 | 0.9963 | 0.9978 | |
| 3 | 0.9937 | 0.9959 | 0.9844 | 0.9915 | |
| Bigram | 1 | 0.8504 | 0.8622 | 0.9574 | 0.9070 |
| 2 | 0.8304 | 0.8356 | 0.8685 | 0.8526 | |
| 3 | 0.8222 | 0.7133 | 0.6974 | 0.6889 | |
| Trigram | 1 | 0.7637 | 0.7622 | 1.0000 | 0.8648 |
| 2 | 0.6189 | 0.5981 | 0.9978 | 0.7485 | |
| 3 | 0.7222 | 0.8596 | 0.0615 | 0.1115 |

Figure 5
Graph of TF-IDF n-Gram classification testing against term length.

Figure 6
Classification performance results from TF-IDF feature extraction using the n-Gram approach on term length and minimum frequency data.

Figure 7
Classification performance results from TF-IDF feature extraction using the n-Gram approach on term length and maximum frequency data.
Table 7
Classification from TF-IDF feature extraction results using the n-Gram approach.
| n-GRAM | SVM | K-NN | LINEAR REGRESSION | NAÏVE BAYES | RANDOM FOREST | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ACURACY | F1-SCORE | ACURACY | F1-SCORE | ACURACY | F1-SCORE | ACURACY | F1-SCORE | ACURACY | F1-SCORE | |||||
| Data Group 1 | ||||||||||||||
| Unigram | 0.9526 | 0.9689 | 0.9644 | 0.9756 | 0.9600 | 0.9733 | 0.9311 | 0.9578 | 0.8467 | 0.9000 | ||||
| Bigram | 0.8504 | 0.9070 | 0.8700 | 0.9200 | 0.8700 | 0.9200 | 0.8567 | 0.9067 | 0.9856 | 0.9911 | ||||
| Trigram | 0.7637 | 0.8648 | 0.7733 | 0.8700 | 0.7733 | 0.8700 | 0.7733 | 0.8700 | 0.7844 | 0.8700 | ||||
| Data Group 2 | ||||||||||||||
| Unigram | 0.9967 | 0.9978 | 0.9867 | 0.9867 | 0.9867 | 0.9867 | 0.8867 | 0.9167 | 0.8067 | 0.8400 | ||||
| Bigram | 0.8304 | 0.8526 | 0.8200 | 0.8567 | 0.8233 | 0.8633 | 0.7967 | 0.8333 | 0.9867 | 0.9900 | ||||
| Trigram | 0.6189 | 0.7485 | 0.6433 | 0.7667 | 0.6433 | 0.7667 | 0.6567 | 0.7667 | 0.6433 | 0.7667 | ||||
| Data Group 3 | ||||||||||||||
| Unigram | 0.9937 | 0.9915 | 0.9700 | 0.9478 | 0.9878 | 0.9789 | 0.8867 | 0.7700 | 0.8333 | 0.6733 | ||||
| Bigram | 0.8222 | 0.6889 | 0.7878 | 0.5922 | 0.8133 | 0.5678 | 0.7567 | 0.2667 | 0.9856 | 0.9733 | ||||
| Trigram | 0.7222 | 0.1115 | 0.7467 | 0.0867 | 0.7578 | 0.0944 | 0.7500 | 0.0800 | 0.7533 | 0.0833 | ||||
| Maximum | 0.9967 | 0.9978 | 0.9867 | 0.9867 | 0.9878 | 0.9867 | 0.9311 | 0.9578 | 0.9867 | 0.9911 | ||||
| Minimum | 0.6189 | 0.1115 | 0.6433 | 0.0867 | 0.6433 | 0.0944 | 0.6567 | 0.0800 | 0.6433 | 0.0833 | ||||
