Have a personal or library account? Click to login
A Case Study in Text Mining of Discussion Forum Posts: Classification with Bag of Words and Global Vectors Cover

A Case Study in Text Mining of Discussion Forum Posts: Classification with Bag of Words and Global Vectors

By: Paweł Cichosz  
Open Access
|Jan 2019

References

  1. Aggarwal, C.C. and Zhai, C.-X. (Eds.) (2012). Mining Text Data, Springer, New York, NY.10.1007/978-1-4614-3223-4
  2. Aswani Kumar, C. and Srinivas, S. (2006). Latent semantic indexing using eigenvalue analysis for efficient information retrieval, International Journal of Applied Mathematics and Computer Science 16(4): 551-558.
  3. Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances, Philosophical Transactions of the Royal Society of London 53: 370-418.10.1098/rstl.1763.0053
  4. Bilski, A. and Wojciechowski, J. (2016). Automatic parametric fault detection in complex analog systems based on a method of minimum node selection, International Journal of Applied Mathematics and Computer Science 26(3): 655-668, DOI: 10.1515/amcs-2016-0045.10.1515/amcs-2016-0045
  5. Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022.
  6. Breiman, L. (1996). Bagging predictors, Machine Learning 24(2): 123-140.10.1007/BF00058655
  7. Breiman, L. (2001). Random forests, Machine Learning 45(1): 5-32.10.1023/A:1010933404324
  8. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees, Chapman and Hall, New York, NY.
  9. Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning, Proceedings of the 9th European Conference on Artificial Intelligence (ECAI-90), Stockholm, Sweden, pp. 147-149.
  10. Cichosz, P. (2015). Data Mining Algorithms: Explained Using R, Wiley, Chichester.10.1002/9781118950951
  11. Cortes, C. and Vapnik, V.N. (1995). Support-vector networks, Machine Learning 20(3): 273-297.10.1007/BF00994018
  12. Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, New York, NY.10.1017/CBO9780511801389
  13. Dařena, F. and Žižka, J. (2017). Ensembles of classifiers for parallel categorization of large number of text documents expressing opinions, Journal of Applied Economic Sciences 12(1): 25-35.
  14. Dietterich, T.G. (2000). Ensemble methods in machine learning, Proceedings of the 1st International Workshop on Multiple Classifier Systems, Cagliari, Italy, pp. 1-15.10.1007/3-540-45014-9_1
  15. Domingos, P. and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss, Machine Learning 29(2-3): 103-137.10.1023/A:1007413511361
  16. Duchi, J., Hazan, E. and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12: 2121-2159.
  17. Dumais, S.T. (2005). Latent semantic analysis, Annual Review of Information Science and Technology 38(1): 188-229.10.1002/aris.1440380105
  18. Dumais, S.T., Platt, J.C., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization, Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM-98), Bethesda, MD, USA, pp. 148-155.10.1145/288627.288651
  19. Egan, J.P. (1975). Signal Detection Theory and ROC Analysis, Academic Press, New York, NY.
  20. Fawcett, T. (2006). An introduction to ROC analysis, Pattern Recognition Letters 27(8): 861-874.10.1016/j.patrec.2005.10.010
  21. Forman, G. (2003). An extensive empirical study of feature selection measures for text classification, Journal of Machine Learning Research 3: 1289-1305. Goldberg, Y. and Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.’s negative sampling word-embedding method, arXiv: 1402.3722.
  22. Guyon, I.M. and Elisseeff, A. (2003). An introduction to variable and feature selection, Journal of Machine Learning Research 3: 1157-1182.
  23. Hamel, L.H. (2009). Knowledge Discovery with Support Vector Machines, Wiley, New York, NY.10.1002/9780470503065
  24. Hand, D.J. and Yu, K. (2001). Idiot’s Bayes-not so stupid after all?, International Statistical Review 69(3): 385-399.10.2307/1403452
  25. Heaps, H.S. (1978). Information Retrieval: Computational and Theoretical Aspects, Academic Press, New York, NY.
  26. Hilbe, J.M. (2009). Logistic Regression Models, Chapman and Hall, New York, NY.10.1201/9781420075779
  27. Holtz, P., Kronberger, N. and Wagner, W. (2012). Analyzing Internet forums: A practical guide, Journal of Media Psychology 24(2): 55-66.10.1027/1864-1105/a000062
  28. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of the 10th European Conference on Machine Learning (ECML-98), Chemnitz, Germany, pp. 137-142.10.1007/BFb0026683
  29. Joachims, T. (2002). Learning to Classify Text by Support Vector Machines: Methods, Theory, and Algorithms, Springer, New York, NY.10.1007/978-1-4615-0907-3
  30. Koprinska, I., Poon, J., Clark, J. and Chan, J. (2007). Learning to classify e-mail, Information Sciences: An International Journal 177(10): 2167-2187.10.1016/j.ins.2006.12.005
  31. Lau, J.H. and Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation, Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 78-86.10.18653/v1/W16-1609
  32. Le, Q.V. and Mikolov, T. (2014). Distributed representations of sentences and documents, Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, China, pp. 1188-1196.
  33. Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of the Tenth European Conference on Machine Learning (ECML- 98), Chemnitz, Germany, pp. 4-15.10.1007/BFb0026666
  34. Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest, R News 2(3): 18-22, http://CRAN.R-project.org/doc/Rnews/.
  35. Liu, H. and Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining, Springer, New York, NY.10.1007/978-1-4615-5689-3
  36. Liu, H., Motoda, H., Setiono, R. and Zhao, Z. (2010). Feature selection: An ever-evolving frontier in data mining, Proceedings of the 4th Workshop on Feature Selection in Data Mining (FSDM-10), Hyderabad, India, pp. 4-13.
  37. Lui, A. K.-F., Li, S.C. and Choy, S.O. (2007). An evaluation of automatic text categorization in online discussion analysis, Proceedings of the 7th IEEE International Conference on Advanced Learning Technologies (ICALT-2007), Niigata, Japan, pp. 205-209.10.1109/ICALT.2007.59
  38. Manning, C.D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval, Cambridge University Press, Cambridge.10.1017/CBO9780511809071
  39. Marra, R.M., Moore, J.L. and Klimczak, A.K. (2004). Content analysis of online discussion forums: A comparative analysis of protocols, Educational Technology Research and Development 52(2): 23-40.10.1007/BF02504837
  40. McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification, Proceedings of the AAAI/ICML-98 Workshop on Learning for Text Categorization, Madison, WI, USA, pp. 41-48.
  41. Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. and Leisch, F. (2015). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, R package version 1.6-7, https://CRAN.R-project.org/package=e1071.
  42. Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. (2013a). Efficient estimation of word representations in vector space, arXiv:1301.3781.
  43. Mikolov, T., Le, Q.V. and Sutskever, I. (2013b). Exploiting similarities among languages for machine translation, arXiv:1309.4168.
  44. Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics, Cognitive Science 34(8): 1388-1429.10.1111/j.1551-6709.2010.01106.x21564253
  45. Moldovan, A., Boţ, R.I. and Wanka, G. (2005). Latent semantic indexing for patent documents, International Journal of Applied Mathematics and Computer Science 15(4): 551-560.
  46. Oooms, J. (2016). hunspell: Morphological Analysis and Spell Checker for R, R package version 2.3, https://CRAN.R-project.org/package=hunspell.
  47. Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP-14), Doha, Qatar, pp. 1532-1543.10.3115/v1/D14-1162
  48. Platt, J.C. (1998). Fast training of support vector machines using sequential minimal optimization, in B. Schölkopf et al. (Eds.), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, pp.185-208.10.7551/mitpress/1130.003.0016
  49. Platt, J.C. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods, in A.J. Smola et al. (Eds.), Advances in Large Margin Classifiers, MIT Press, Cambridge, MA, pp. 61-74.10.7551/mitpress/1113.003.0008
  50. Quinlan, J.R. (1986). Induction of decision trees, Machine Learning 1: 81-106.10.1007/BF00116251
  51. R Development Core Team (2016). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, http://www.R-project.org.
  52. Radovanović, M. and Ivanović, M. (2008). Text mining: Approaches and applications, Novi Sad Journal of Mathematics 38(3): 227-234.
  53. Rios, G. and Zha, H. (2004). Exploring support vector machines and random forests for spam detection, Proceedings of the 1st International Conference on Email and Anti Spam (CEAS-04), Mountain View, CA, USA, pp. 398-403.
  54. Rousseau, F., Kiagias, E. and Vazirgiannis, M. (2015). Text categorization as a graph classification problem, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics and the 6th International Joint Conference on Natural Language Processing (ACLIJCNLP-15), Beijing, China, pp. 1702-1712.
  55. Said, D. and Wanas, N. (2011). Clustering posts in online discussion forum threads, International Journal of Computer Science and Information Technology 3(2): 1-14.10.5121/ijcsit.2011.3201
  56. Schölkopf, B. and Smola, A.J. (2001). Learning with Kernels, MIT Press, Cambridge, MA.10.7551/mitpress/4175.001.0001
  57. Sebastiani, F. (2002). Machine learning in automated text categorization, ACM Computing Surveys 34(1): 1-47.10.1145/505282.505283
  58. Selivanov, D. (2016). text2vec: Modern Text Mining Framework for R, R package version 0.4.0, https://CRAN.R-project.org/package=text2vec.
  59. Siwek, K. and Osowski, S. (2016). Data mining methods for prediction of air pollution, International Journal of Applied Mathematics and Computer Science 26(2): 467-478, DOI: 10.1515/amcs-2016-0033.10.1515/amcs-2016-0033
  60. Szymański, J. (2014). Comparative analysis of text representation methods using classification, Cybernetics and Systems 45(2): 180-199.10.1080/01969722.2014.874828
  61. Wu, Q., Ye, Y., Zhang, H., Ng, M.K. and Ho, S.-H. (2014). ForesTexter: An efficient random forest algorithm for imbalanced text categorization, Knowledge-Based Systems 67: 105-116.10.1016/j.knosys.2014.06.004
  62. Xu, B., Guo, X., Ye, Y. and Cheng, J. (2012). An improved random forest classifier for text categorization, Journal of Computers 7(12): 2913-2920.10.4304/jcp.7.12.2913-2920
  63. Xue, D. and Li, F. (2015). Research of text categorization model based on random forests, 2015 IEEE International Conference on Computational Intelligence and Communication Technology (CICT-15), Ghaziabad, India, pp. 173-176.10.1109/CICT.2015.101
  64. Yang, Y. and Pedersen, J. (1997). A comparative study on feature selection in text categorization, Proceedings of the 14th International Conference on Machine Learning (ICML-97), Nashville, TN, USA, pp. 412-420.
  65. Yessenalina, A. and Cardie, C. (2011). Compositional matrix-space models for sentiment analysis, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP-11), Edinburgh, UK, pp. 172-182.
DOI: https://doi.org/10.2478/amcs-2018-0060 | Journal eISSN: 2083-8492 | Journal ISSN: 1641-876X
Language: English
Page range: 787 - 801
Submitted on: Sep 22, 2017
Accepted on: Jun 2, 2018
Published on: Jan 11, 2019
Published by: University of Zielona Góra
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2019 Paweł Cichosz, published by University of Zielona Góra
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.