Have a personal or library account? Click to login
Tackling the Problem of Class Imbalance in Multi-class Sentiment Classification: An Experimental Study Cover

Tackling the Problem of Class Imbalance in Multi-class Sentiment Classification: An Experimental Study

By: Mateusz Lango  
Open Access
|Jun 2019

References

  1. [1] Abbasi, A., France, S., Zhang, Z., Chen, H.: Selecting Attributes for Sentiment Classification Using Feature Relation Networks. IEEE Transactions on Knowledge and Data Engineering, 23 (3), 447-462 (2011).10.1109/TKDE.2010.110
  2. [2] Baccianella, S., Esuli, A., Sebastiani, F.: Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proc. of the Int. Conference on Language Resources and Evaluation (2010).
  3. [3] Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14 (1), 1471–2105 (2013).10.1186/1471-2105-14-106364843823522326
  4. [4] Blitzer, M. D., Pereira, F.: Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL-2007), 440-447 (2007).
  5. [5] Błaszczyński, J., Stefanowski, J.: Neighbourhood sampling in bagging for imbalanced data. Neurocomputing, 150 A, 184–203 (2015).10.1016/j.neucom.2014.07.064
  6. [6] Błaszczyński, J., Stefanowski, J.: Local data characteristics in learning classifiers from imbalanced data. In Advances in Data Analysis with Computational Intelligence Methods, 51–85, Springer (2018).10.1007/978-3-319-67946-4_2
  7. [7] Brzezinski, D. and Stefanowski, J.: Stream Classification. Encyclopedia of Machine Learning and Data Mining, Springer (2017).10.1007/978-1-4899-7687-1_908
  8. [8] Burns N., Bi Y., Wang H., Anderson T.: Sentiment Analysis of Customer Reviews: Balanced versus Unbalanced Datasets. In: König A., Dengel A., Hinkelmann K., Kise K., Howlett R.J., Jain L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems, LNCS, 6881, 161-170 (2011).10.1007/978-3-642-23851-2_17
  9. [9] Chawla, N.: Data mining for imbalanced datasets: An overview. In Maimon O., Rokach L. (eds): The Data Mining and Knowledge Discovery Handbook, Springer, 853–867 (2005).10.1007/0-387-25465-X_40
  10. [10] Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. J. of Artificial Intelligence Research, 16, 341-378 (2002).10.1613/jair.953
  11. [11] Das, S. R., Chen, M. Y.: Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Science, 53(9), 1375-1388 (2007).10.1287/mnsc.1070.0704
  12. [12] Fernández A., García S., Galar M., Prati R., Krawczyk B., Herrera H.: Learning from Imbalanced Data Sets. Springer (2018).10.1007/978-3-319-98074-4
  13. [13] Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905 (2018).10.1613/jair.1.11192
  14. [14] Fernandez, A., Lopez, V., Galar, M., Jesus M., Herrera, F.: Analysing the classification of imbalanced data sets with multiple classes, binarization techniques and ad-hoc approaches. Knowledge-Based Systems, 42, 97–110 (2013).10.1016/j.knosys.2013.01.018
  15. [15] Fernández-Navarro, F., Hervás-Martínez, C., Gutiérrez, P.A.: A dynamic oversampling procedure based on sensitivity for multi-class problems. Pattern Recognition, 44, 1821–1833 (2011).10.1016/j.patcog.2011.02.019
  16. [16] Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybridbased approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484 (2012).10.1109/TSMCC.2011.2161285
  17. [17] Ganu, G., Elhadad, N., Marian, A.: Beyond the stars: improving rating predictions using review text content. In Proc. of 12th Int. Workshop on the Web and Databases, 9, 1–6 (2009).
  18. [18] Garcia, V., Sanchez, J.S., Mollineda, R.A.: An empirical study of the behaviour of classifiers on imbalanced and overlapped data sets. In Proc. of Progress in Pattern Recognition, Image Analysis and Applications, LNCS, 4756, 397–406 (2007).10.1007/978-3-540-76725-1_42
  19. [19] Han, H., Wen-Yuan, W., Bing-Huan, M.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Advances in intelligent computing, 878-887 (2005).10.1007/11538059_91
  20. [20] He, H., Yang, B., Garcia, E.A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE Int. Joint Conference on Neural Networks, 1322-1328 (2008).10.1109/IJCNN.2008.4633969
  21. [21] He H., Garcia E.: Learning from imbalanced data. IEEE Transactions on Data and Knowledge Engineering, 21 (9), 1263–1284 (2009).10.1109/TKDE.2008.239
  22. [22] He, H. and Ma, Y.: Imbalanced learning: foundations, algorithms, and applications, Wiley (2013).10.1002/9781118646106
  23. [23] Hido, S., Kashima, H.: Roughly balanced bagging for imbalance data. Statistical Analysis and Data Mining, 2 (5-6), 412–426 (2009).10.1002/sam.10061
  24. [24] Hu, M., Liu, B.: Mining and summarizing customer reviews. In Proc. of the 10th ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 168–177 (2004).10.1145/1014052.1014073
  25. [25] Japkowicz, N., Stephen, S.: Class imbalance problem: a systematic study. Intelligent Data Analysis Journal, 6 (5), 429–450 (2002).10.3233/IDA-2002-6504
  26. [26] Jo, T., Japkowicz, N.: Class Imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6 (1), 40–49 (2004).10.1145/1007730.1007737
  27. [27] Kiritchenko, S., Zhu, X., Mohammad, S.M.: Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, 50, 723–762 (2014).10.1613/jair.4272
  28. [28] Koppel, M, Schler, J.: The Importance of Neutral Examples for Learning Sentiment. Computational Intelligence, 22, 100–109 (2006).10.1111/j.1467-8640.2006.00276.x
  29. [29] Krawczyk B., McInnes B.T., Cano A.: Sentiment Classification from Multi-class Imbalanced Twitter Data Using Binarization. In: Martínez de Pisón F., Urraca R., Quintiá n H., Corchado E. (eds) Hybrid Artificial Intelligent Systems, LNCS, 10334, 26–37 (2017).10.1007/978-3-319-59650-1_3
  30. [30] Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: oneside selection. In Proc. of the 14th Int. Conf. on Machine Learning ICML-97, 179-186 (1997).
  31. [31] Kuncheva, L. I.: Combining Pattern Classifiers: Methods and Algorithms: Methods and Algorithms. Wiley (2004).10.1002/0471660264
  32. [32] Lango M., Brzeziński D., Firlik S., Stefanowski J.: Discovering Minority Subclusters and Local Difficulty Factors from Imbalanced Data. In Proc. of the 20th Int. Conference on Discovery Science (2017).10.1007/978-3-319-67786-6_23
  33. [33] Lango M., Brzeziński D., Stefanowski J.: PUT at SemEval-2016 Task 4: The ABC of Twitter Sentiment Analysis, In Proc. of the 10th Int. Workshop on Semantic Evaluation (2016).10.18653/v1/S16-1018
  34. [34] Lango, M., Napierala, K., Stefanowski, J.: Evaluating Difficulty of Multi-class Imbalanced Data. In Proc. of 23rd Int. Symposium on Methodologies for Intelligent Systems, 312–322 (2017).10.1007/978-3-319-60438-1_31
  35. [35] Lango M., Stefanowski J.: Multi-class and Feature Selection Extensions of Roughly Balanced Bagging for Imbalanced Data. Journal of Intelligent Information Systems (2018).10.1007/s10844-017-0446-7
  36. [36] Lemaître G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18 (17), 1–5 (2017).
  37. [37] Li, S., Ju, S., Zhou, G., Li, X.: Active learning for imbalanced sentiment classification. In Proc. of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 139-148 (2012).
  38. [38] Li, S., Wang, Z., Zhou, G., Lee, S. Y. M.: Semi-supervised learning for imbalanced sentiment classification. In Proc. of Int. Joint Conference on Artificial Intelligenc, 22 (3), 1826–1831 (2011).
  39. [39] Li, T., Zhang, Y., Sindhwani, V.: A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In Proc. of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Int. Joint Conference on Natural Language Processing of the AFNLP, 1, 244-252 (2009).10.3115/1687878.1687914
  40. [40] Li, S., Zhou, G., Wang, Z., Lee, S. Y. M., Wang, R.: Imbalanced sentiment classification. In Proc. of the 20th ACM Int. Conference on Information and Knowledge Management, 2469-2472 (2011).10.1145/2063576.2063994
  41. [41] Liu, B.: Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool (2012).10.2200/S00416ED1V01Y201204HLT016
  42. [42] Loper, E., Bird, S.: NLTK: The natural language toolkit. In Proc. of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, 1, 63–70 (2002).10.3115/1118108.1118117
  43. [43] Mathioudakis, M., Koudas, N.: Twitter-monitor: Trend detection over the twitter stream. In Proc. of the 2010 ACM SIGMOD Int. Conference on Management of Data, 1155–1158 (2010).10.1145/1807167.1807306
  44. [44] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. In Proc. of Neural Information Systems Processing (2013).
  45. [45] Mohammad, S., Turney, P.D.: Crowd-sourcing a word-emotion association lexicon. Computational Intelligence, 29 (3), 436–465 (2013).10.1111/j.1467-8640.2012.00460.x
  46. [46] Mountassir, A., Benbrahim, H., Berrada, I.: An empirical study to address the problem of Unbalanced Data Sets in sentiment classification. IEEE Int. Conference on Systems, Man, and Cybernetics (SMC), 3298-3303 (2012).10.1109/ICSMC.2012.6378300
  47. [47] Nakov, P., Ritter, A., Rosenthal, S., Stoy-anov, V., Sebastiani, F.: SemEval- 2016 task 4: Sentiment analysis in Twitter. In Proc. of the 10th Int. Workshop on Semantic Evaluation (2016).10.18653/v1/S16-1001
  48. [48] Napierala, K., Stefanowski, J.: The influence of minority class distribution on learning from imbalance data. In Proc. of the 7th Int. Conference on Hybrid Artificial Intelligent Systems, LNAI, 7209, 139–150 (2012).10.1007/978-3-642-28931-6_14
  49. [49] Napierala, K., Stefanowski, J.: BRACID: a comprehensive approach to learning rules from imbalanced data. Journal of Intelligent Information Systems, 39 (2), 335–373 (2012).10.1007/s10844-011-0193-0
  50. [50] Napierala, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 46(3), 563–597 (2016).10.1007/s10844-015-0368-1
  51. [51] Niklas, J., Weber, S.H., Müller, M.C., Gurevych, I.: Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations. In Proc. of the 1st Int. Workshop on Topic-sentiment analysis for mass opinion (2009).
  52. [52] Ohana, B., Tierney, B., Delany, S. J.: Domain independent sentiment classification with many lexicons. In 4th Int. Symposium on Mining and Web at 25th Int. Conference on Advanced Information Networking and Applications (AINA), 632–637 (2011).10.1109/WAINA.2011.103
  53. [53] Pang, B., Lee, L.: A Sentimental Education: Sentiment Analysis using subjectivity summarization based on minimum cuts. In: 42nd Annual Meeting on Association for Computational Linguistics, 271–278 (2004).10.3115/1218955.1218990
  54. [54] Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification using Machine Learning Techniques. In: Conference on Empirical Methods in Natural Language Processing, 10, 79–86 (2002).10.3115/1118693.1118704
  55. [55] Pedregosa et al.: Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830 (2011).
  56. [56] Prati, R., Batista, G., Monard, M.: Class imbalance versus class overlapping: an analysis of a learning system behavior. In Proc. of 3rd Mexican Int. Conf. on Artificial Intelligence, 312–321 (2004).10.1007/978-3-540-24694-7_32
  57. [57] Remus, R.: Modeling and representing negation in data-driven machine learning-based sentiment analysis. In Proc. of 1st Int.Workshop on Emotion and Sentiment in Social and Expressive Media (ESSEM 2013), 22–33 (2013).
  58. [58] Schütze, H., Manning, C.D.: Foundations of Statistical Natural Language Processing. MIT Press (1999).
  59. [59] Song, K., Feng, S., Gao, W., Wang, D., Yu, G., Wong, K. F.: Personalized Sentiment Classification Based on Latent Individuality of Microblog Users. In Proc. of Int. Joint Conferences on Artificial Intelligence, 2277-2283 (2015).
  60. [60] Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In Ramanna, L.C.J.S. and Howlett, R.J. (eds), Emerging Paradigms in Machine Learning, 277–306 (2013).10.1007/978-3-642-28699-5_11
  61. [61] Stefanowski, J.: Dealing with Data Difficulty Factors while Learning from Imbalanced Data. In S. Matwin and J. Mielniczuk (eds), Challenges in Computational Statistics and Data Mining, Studies in Computational Intelligence, 605, 333–363 (2016).10.1007/978-3-319-18781-5_17
  62. [62] Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In Song, I.-Y., Eder, J., Nguyen, T.M. (eds) Data Warehousing and Knowledge Discovery, LNCS, 5182, 283–292 (2008).10.1007/978-3-540-85836-2_27
  63. [63] Tomek, I.: Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 6, 769-772 (2010).10.1109/TSMC.1976.4309452
  64. [64] Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL-2002) (2002).10.3115/1073083.1073153
  65. [65] Wallace, B.C., Small, K., Brodley, C.E., Trikalinos, T.A.: Class Imbalance, Redux. In Proc. of IEEE 11th Int. Conference on Data Mining, 754-763 (2011).10.1109/ICDM.2011.33
  66. [66] Wang, S., Yao, X.: Mutliclass imbalance problems: analysis and potential solutions. IEEE Trans. System Man Cybern., Part B. 42 (4), 1119–1130 (2012).10.1109/TSMCB.2012.2187280
  67. [67] Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symp. Comput. Intell. Data Mining, 324–331 (2009).10.1109/CIDM.2009.4938667
  68. [68] Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression approach. In Proc. of the 16th ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 783–792 (2010).10.1145/1835804.1835903
  69. [69] Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39 (2-3), 165–210 (2005).10.1007/s10579-005-7880-9
  70. [70] Wilson, D.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetrics, 2 (3), 408-421 (1972).10.1109/TSMC.1972.4309137
  71. [71] Wilson D.R., Martinez T.R.: Improved heterogeneous distance functions. J. Artificial Intelligence Research, 6, 1–34 (1997).10.1613/jair.346
  72. [72] Wojciechowski, S., Wilk, S., Stefanowski, J.: An algorithm for selective preprocessing of multi-class imbalanced data. In Proc. of Int. Conference on Computer Recognition Systems, CORES 2017, 238–247 (2017).10.1007/978-3-319-59162-9_25
  73. [73] Wojciechowski, S., Wilk, S.: Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data, Foundations of Computing and Decision Sciences, 42(2), 149-176 (2017).10.1515/fcds-2017-0007
  74. [74] Xu, R., Chen, T., Xia, Y., Lu, Q., Liu, B., Wang, X.: Word Embedding Composition for Data Imbalances in Sentiment and Emotion Classification. Cogn Comput, 7, 226 (2015).10.1007/s12559-015-9319-y
  75. [75] Zhou, Z. H., Liu, X.Y.: On multi-class cost sensitive learning. Computational Intelligence, 26 (3), 232–257 (2010).10.1111/j.1467-8640.2010.00358.x
DOI: https://doi.org/10.2478/fcds-2019-0009 | Journal eISSN: 2300-3405 | Journal ISSN: 0867-6356
Language: English
Page range: 151 - 178
Submitted on: Jan 18, 2019
Accepted on: Feb 24, 2019
Published on: Jun 6, 2019
Published by: Poznan University of Technology
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2019 Mateusz Lango, published by Poznan University of Technology
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.