Tackling the Problem of Class Imbalance in Multi-class Sentiment Classification: An Experimental Study

Lango, Mateusz

doi:10.2478/fcds-2019-0009

References

[1] Abbasi, A., France, S., Zhang, Z., Chen, H.: Selecting Attributes for Sentiment Classification Using Feature Relation Networks. IEEE Transactions on Knowledge and Data Engineering, 23 (3), 447-462 (2011).10.1109/TKDE.2010.110
Search in Google Scholar Back to article
[2] Baccianella, S., Esuli, A., Sebastiani, F.: Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proc. of the Int. Conference on Language Resources and Evaluation (2010).
Search in Google Scholar Back to article
[3] Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14 (1), 1471–2105 (2013).10.1186/1471-2105-14-106364843823522326
Search in Google Scholar Back to article
[4] Blitzer, M. D., Pereira, F.: Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL-2007), 440-447 (2007).
Search in Google Scholar Back to article
[5] Błaszczyński, J., Stefanowski, J.: Neighbourhood sampling in bagging for imbalanced data. Neurocomputing, 150 A, 184–203 (2015).10.1016/j.neucom.2014.07.064
Search in Google Scholar Back to article
[6] Błaszczyński, J., Stefanowski, J.: Local data characteristics in learning classifiers from imbalanced data. In Advances in Data Analysis with Computational Intelligence Methods, 51–85, Springer (2018).10.1007/978-3-319-67946-4_2
Search in Google Scholar Back to article
[7] Brzezinski, D. and Stefanowski, J.: Stream Classification. Encyclopedia of Machine Learning and Data Mining, Springer (2017).10.1007/978-1-4899-7687-1_908
Search in Google Scholar Back to article
[8] Burns N., Bi Y., Wang H., Anderson T.: Sentiment Analysis of Customer Reviews: Balanced versus Unbalanced Datasets. In: König A., Dengel A., Hinkelmann K., Kise K., Howlett R.J., Jain L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems, LNCS, 6881, 161-170 (2011).10.1007/978-3-642-23851-2_17
Search in Google Scholar Back to article
[9] Chawla, N.: Data mining for imbalanced datasets: An overview. In Maimon O., Rokach L. (eds): The Data Mining and Knowledge Discovery Handbook, Springer, 853–867 (2005).10.1007/0-387-25465-X_40
Search in Google Scholar Back to article
[10] Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. J. of Artificial Intelligence Research, 16, 341-378 (2002).10.1613/jair.953
Search in Google Scholar Back to article
[11] Das, S. R., Chen, M. Y.: Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Science, 53(9), 1375-1388 (2007).10.1287/mnsc.1070.0704
Search in Google Scholar Back to article
[12] Fernández A., García S., Galar M., Prati R., Krawczyk B., Herrera H.: Learning from Imbalanced Data Sets. Springer (2018).10.1007/978-3-319-98074-4
Search in Google Scholar Back to article
[13] Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905 (2018).10.1613/jair.1.11192
Search in Google Scholar Back to article
[14] Fernandez, A., Lopez, V., Galar, M., Jesus M., Herrera, F.: Analysing the classification of imbalanced data sets with multiple classes, binarization techniques and ad-hoc approaches. Knowledge-Based Systems, 42, 97–110 (2013).10.1016/j.knosys.2013.01.018
Search in Google Scholar Back to article
[15] Fernández-Navarro, F., Hervás-Martínez, C., Gutiérrez, P.A.: A dynamic oversampling procedure based on sensitivity for multi-class problems. Pattern Recognition, 44, 1821–1833 (2011).10.1016/j.patcog.2011.02.019
Search in Google Scholar Back to article
[16] Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybridbased approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484 (2012).10.1109/TSMCC.2011.2161285
Search in Google Scholar Back to article
[17] Ganu, G., Elhadad, N., Marian, A.: Beyond the stars: improving rating predictions using review text content. In Proc. of 12th Int. Workshop on the Web and Databases, 9, 1–6 (2009).
Search in Google Scholar Back to article
[18] Garcia, V., Sanchez, J.S., Mollineda, R.A.: An empirical study of the behaviour of classifiers on imbalanced and overlapped data sets. In Proc. of Progress in Pattern Recognition, Image Analysis and Applications, LNCS, 4756, 397–406 (2007).10.1007/978-3-540-76725-1_42
Search in Google Scholar Back to article
[19] Han, H., Wen-Yuan, W., Bing-Huan, M.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Advances in intelligent computing, 878-887 (2005).10.1007/11538059_91
Search in Google Scholar Back to article
[20] He, H., Yang, B., Garcia, E.A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE Int. Joint Conference on Neural Networks, 1322-1328 (2008).10.1109/IJCNN.2008.4633969
Search in Google Scholar Back to article
[21] He H., Garcia E.: Learning from imbalanced data. IEEE Transactions on Data and Knowledge Engineering, 21 (9), 1263–1284 (2009).10.1109/TKDE.2008.239
Search in Google Scholar Back to article
[22] He, H. and Ma, Y.: Imbalanced learning: foundations, algorithms, and applications, Wiley (2013).10.1002/9781118646106
Search in Google Scholar Back to article
[23] Hido, S., Kashima, H.: Roughly balanced bagging for imbalance data. Statistical Analysis and Data Mining, 2 (5-6), 412–426 (2009).10.1002/sam.10061
Search in Google Scholar Back to article
[24] Hu, M., Liu, B.: Mining and summarizing customer reviews. In Proc. of the 10th ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 168–177 (2004).10.1145/1014052.1014073
Search in Google Scholar Back to article
[25] Japkowicz, N., Stephen, S.: Class imbalance problem: a systematic study. Intelligent Data Analysis Journal, 6 (5), 429–450 (2002).10.3233/IDA-2002-6504
Search in Google Scholar Back to article
[26] Jo, T., Japkowicz, N.: Class Imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6 (1), 40–49 (2004).10.1145/1007730.1007737
Search in Google Scholar Back to article
[27] Kiritchenko, S., Zhu, X., Mohammad, S.M.: Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, 50, 723–762 (2014).10.1613/jair.4272
Search in Google Scholar Back to article
[28] Koppel, M, Schler, J.: The Importance of Neutral Examples for Learning Sentiment. Computational Intelligence, 22, 100–109 (2006).10.1111/j.1467-8640.2006.00276.x
Search in Google Scholar Back to article
[29] Krawczyk B., McInnes B.T., Cano A.: Sentiment Classification from Multi-class Imbalanced Twitter Data Using Binarization. In: Martínez de Pisón F., Urraca R., Quintiá n H., Corchado E. (eds) Hybrid Artificial Intelligent Systems, LNCS, 10334, 26–37 (2017).10.1007/978-3-319-59650-1_3
Search in Google Scholar Back to article
[30] Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: oneside selection. In Proc. of the 14th Int. Conf. on Machine Learning ICML-97, 179-186 (1997).
Search in Google Scholar Back to article
[31] Kuncheva, L. I.: Combining Pattern Classifiers: Methods and Algorithms: Methods and Algorithms. Wiley (2004).10.1002/0471660264
Search in Google Scholar Back to article
[32] Lango M., Brzeziński D., Firlik S., Stefanowski J.: Discovering Minority Subclusters and Local Difficulty Factors from Imbalanced Data. In Proc. of the 20th Int. Conference on Discovery Science (2017).10.1007/978-3-319-67786-6_23
Search in Google Scholar Back to article
[33] Lango M., Brzeziński D., Stefanowski J.: PUT at SemEval-2016 Task 4: The ABC of Twitter Sentiment Analysis, In Proc. of the 10th Int. Workshop on Semantic Evaluation (2016).10.18653/v1/S16-1018
Search in Google Scholar Back to article
[34] Lango, M., Napierala, K., Stefanowski, J.: Evaluating Difficulty of Multi-class Imbalanced Data. In Proc. of 23rd Int. Symposium on Methodologies for Intelligent Systems, 312–322 (2017).10.1007/978-3-319-60438-1_31
Search in Google Scholar Back to article
[35] Lango M., Stefanowski J.: Multi-class and Feature Selection Extensions of Roughly Balanced Bagging for Imbalanced Data. Journal of Intelligent Information Systems (2018).10.1007/s10844-017-0446-7
Search in Google Scholar Back to article
[36] Lemaître G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18 (17), 1–5 (2017).
Search in Google Scholar Back to article
[37] Li, S., Ju, S., Zhou, G., Li, X.: Active learning for imbalanced sentiment classification. In Proc. of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 139-148 (2012).
Search in Google Scholar Back to article
[38] Li, S., Wang, Z., Zhou, G., Lee, S. Y. M.: Semi-supervised learning for imbalanced sentiment classification. In Proc. of Int. Joint Conference on Artificial Intelligenc, 22 (3), 1826–1831 (2011).
Search in Google Scholar Back to article
[39] Li, T., Zhang, Y., Sindhwani, V.: A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In Proc. of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Int. Joint Conference on Natural Language Processing of the AFNLP, 1, 244-252 (2009).10.3115/1687878.1687914
Search in Google Scholar Back to article
[40] Li, S., Zhou, G., Wang, Z., Lee, S. Y. M., Wang, R.: Imbalanced sentiment classification. In Proc. of the 20th ACM Int. Conference on Information and Knowledge Management, 2469-2472 (2011).10.1145/2063576.2063994
Search in Google Scholar Back to article
[41] Liu, B.: Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool (2012).10.2200/S00416ED1V01Y201204HLT016
Search in Google Scholar Back to article
[42] Loper, E., Bird, S.: NLTK: The natural language toolkit. In Proc. of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, 1, 63–70 (2002).10.3115/1118108.1118117
Search in Google Scholar Back to article
[43] Mathioudakis, M., Koudas, N.: Twitter-monitor: Trend detection over the twitter stream. In Proc. of the 2010 ACM SIGMOD Int. Conference on Management of Data, 1155–1158 (2010).10.1145/1807167.1807306
Search in Google Scholar Back to article
[44] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. In Proc. of Neural Information Systems Processing (2013).
Search in Google Scholar Back to article
[45] Mohammad, S., Turney, P.D.: Crowd-sourcing a word-emotion association lexicon. Computational Intelligence, 29 (3), 436–465 (2013).10.1111/j.1467-8640.2012.00460.x
Search in Google Scholar Back to article
[46] Mountassir, A., Benbrahim, H., Berrada, I.: An empirical study to address the problem of Unbalanced Data Sets in sentiment classification. IEEE Int. Conference on Systems, Man, and Cybernetics (SMC), 3298-3303 (2012).10.1109/ICSMC.2012.6378300
Search in Google Scholar Back to article
[47] Nakov, P., Ritter, A., Rosenthal, S., Stoy-anov, V., Sebastiani, F.: SemEval- 2016 task 4: Sentiment analysis in Twitter. In Proc. of the 10th Int. Workshop on Semantic Evaluation (2016).10.18653/v1/S16-1001
Search in Google Scholar Back to article
[48] Napierala, K., Stefanowski, J.: The influence of minority class distribution on learning from imbalance data. In Proc. of the 7th Int. Conference on Hybrid Artificial Intelligent Systems, LNAI, 7209, 139–150 (2012).10.1007/978-3-642-28931-6_14
Search in Google Scholar Back to article
[49] Napierala, K., Stefanowski, J.: BRACID: a comprehensive approach to learning rules from imbalanced data. Journal of Intelligent Information Systems, 39 (2), 335–373 (2012).10.1007/s10844-011-0193-0
Search in Google Scholar Back to article
[50] Napierala, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 46(3), 563–597 (2016).10.1007/s10844-015-0368-1
Search in Google Scholar Back to article
[51] Niklas, J., Weber, S.H., Müller, M.C., Gurevych, I.: Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations. In Proc. of the 1st Int. Workshop on Topic-sentiment analysis for mass opinion (2009).
Search in Google Scholar Back to article
[52] Ohana, B., Tierney, B., Delany, S. J.: Domain independent sentiment classification with many lexicons. In 4th Int. Symposium on Mining and Web at 25th Int. Conference on Advanced Information Networking and Applications (AINA), 632–637 (2011).10.1109/WAINA.2011.103
Search in Google Scholar Back to article
[53] Pang, B., Lee, L.: A Sentimental Education: Sentiment Analysis using subjectivity summarization based on minimum cuts. In: 42nd Annual Meeting on Association for Computational Linguistics, 271–278 (2004).10.3115/1218955.1218990
Search in Google Scholar Back to article
[54] Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification using Machine Learning Techniques. In: Conference on Empirical Methods in Natural Language Processing, 10, 79–86 (2002).10.3115/1118693.1118704
Search in Google Scholar Back to article
[55] Pedregosa et al.: Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830 (2011).
Search in Google Scholar Back to article
[56] Prati, R., Batista, G., Monard, M.: Class imbalance versus class overlapping: an analysis of a learning system behavior. In Proc. of 3rd Mexican Int. Conf. on Artificial Intelligence, 312–321 (2004).10.1007/978-3-540-24694-7_32
Search in Google Scholar Back to article
[57] Remus, R.: Modeling and representing negation in data-driven machine learning-based sentiment analysis. In Proc. of 1st Int.Workshop on Emotion and Sentiment in Social and Expressive Media (ESSEM 2013), 22–33 (2013).
Search in Google Scholar Back to article
[58] Schütze, H., Manning, C.D.: Foundations of Statistical Natural Language Processing. MIT Press (1999).
Search in Google Scholar Back to article
[59] Song, K., Feng, S., Gao, W., Wang, D., Yu, G., Wong, K. F.: Personalized Sentiment Classification Based on Latent Individuality of Microblog Users. In Proc. of Int. Joint Conferences on Artificial Intelligence, 2277-2283 (2015).
Search in Google Scholar Back to article
[60] Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In Ramanna, L.C.J.S. and Howlett, R.J. (eds), Emerging Paradigms in Machine Learning, 277–306 (2013).10.1007/978-3-642-28699-5_11
Search in Google Scholar Back to article
[61] Stefanowski, J.: Dealing with Data Difficulty Factors while Learning from Imbalanced Data. In S. Matwin and J. Mielniczuk (eds), Challenges in Computational Statistics and Data Mining, Studies in Computational Intelligence, 605, 333–363 (2016).10.1007/978-3-319-18781-5_17
Search in Google Scholar Back to article
[62] Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In Song, I.-Y., Eder, J., Nguyen, T.M. (eds) Data Warehousing and Knowledge Discovery, LNCS, 5182, 283–292 (2008).10.1007/978-3-540-85836-2_27
Search in Google Scholar Back to article
[63] Tomek, I.: Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 6, 769-772 (2010).10.1109/TSMC.1976.4309452
Search in Google Scholar Back to article
[64] Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL-2002) (2002).10.3115/1073083.1073153
Search in Google Scholar Back to article
[65] Wallace, B.C., Small, K., Brodley, C.E., Trikalinos, T.A.: Class Imbalance, Redux. In Proc. of IEEE 11th Int. Conference on Data Mining, 754-763 (2011).10.1109/ICDM.2011.33
Search in Google Scholar Back to article
[66] Wang, S., Yao, X.: Mutliclass imbalance problems: analysis and potential solutions. IEEE Trans. System Man Cybern., Part B. 42 (4), 1119–1130 (2012).10.1109/TSMCB.2012.2187280
Search in Google Scholar Back to article
[67] Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symp. Comput. Intell. Data Mining, 324–331 (2009).10.1109/CIDM.2009.4938667
Search in Google Scholar Back to article
[68] Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression approach. In Proc. of the 16th ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 783–792 (2010).10.1145/1835804.1835903
Search in Google Scholar Back to article
[69] Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39 (2-3), 165–210 (2005).10.1007/s10579-005-7880-9
Search in Google Scholar Back to article
[70] Wilson, D.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetrics, 2 (3), 408-421 (1972).10.1109/TSMC.1972.4309137
Search in Google Scholar Back to article
[71] Wilson D.R., Martinez T.R.: Improved heterogeneous distance functions. J. Artificial Intelligence Research, 6, 1–34 (1997).10.1613/jair.346
Search in Google Scholar Back to article
[72] Wojciechowski, S., Wilk, S., Stefanowski, J.: An algorithm for selective preprocessing of multi-class imbalanced data. In Proc. of Int. Conference on Computer Recognition Systems, CORES 2017, 238–247 (2017).10.1007/978-3-319-59162-9_25
Search in Google Scholar Back to article
[73] Wojciechowski, S., Wilk, S.: Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data, Foundations of Computing and Decision Sciences, 42(2), 149-176 (2017).10.1515/fcds-2017-0007
Search in Google Scholar Back to article
[74] Xu, R., Chen, T., Xia, Y., Lu, Q., Liu, B., Wang, X.: Word Embedding Composition for Data Imbalances in Sentiment and Emotion Classification. Cogn Comput, 7, 226 (2015).10.1007/s12559-015-9319-y
Search in Google Scholar Back to article
[75] Zhou, Z. H., Liu, X.Y.: On multi-class cost sensitive learning. Computational Intelligence, 26 (3), 232–257 (2010).10.1111/j.1467-8640.2010.00358.x
Search in Google Scholar Back to article