Bag of Words and Embedding Text Representation Methods for Medical Article Classification

Cichosz, Paweł

doi:10.34768/amcs-2023-0043

References

Aggarwal, C.C. and Zhai, C.-X. (Eds) (2012). Mining Text Data, Springer, New York.Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S. and Vollgraf, R. (2019). FLAIR: An easy-to-use framework for state-of-the-art NLP, Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Stroudsburg, USA, pp. 54–59.
Search in Google Scholar Back to article
Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S. and Vollgraf, R. (2021). Flair: A Very Simple Framework for State-of-the-Art NLP, Version 0.10, https://github.com/flairNLP/flair.
Search in Google Scholar Back to article
Akbik, A., Blythe, D. and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling, Proceedings of the 27th International Conference on Computational Linguistics, COLING-2018, Santa Fe, USA, pp. 1638–1649.
Search in Google Scholar Back to article
Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection, Statistics Surveys 4: 40–79.
Search in Google Scholar Back to article
Babić, K., Martincic-Ipsic, S. and Meštrović, A. (2020). Survey of neural text representation models, Information 11(11): 511.
Search in Google Scholar Back to article
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2016). Enriching word vectors with subword information, arXiv: 1607.04606.
Search in Google Scholar Back to article
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2020). fastText: Library for Efficient Text Classification and Representation Learning, Version 0.9.2, https://fastte xt.cc.
Search in Google Scholar Back to article
Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M. and Kasneci, G. (2022). Deep neural networks and tabular data: A survey, arXiv: 2110.01889.
Search in Google Scholar Back to article
Breiman, L. (2001). Random forests, Machine Learning 45(1): 5–32.
Search in Google Scholar Back to article
Chawla, N.V., Bowyer, K. W. Hall, L.O. and Kegelmeyer, W.P. (2002). SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16: 321–357.
Search in Google Scholar Back to article
Cichosz, P. (2018). A case study in text mining of discussion forum posts: Classification with bag of words and global vectors, International Journal of Applied Mathematics and Computer Science 28(4): 787–801, DOI: 10.2478/amcs-2018-0060.
Search in Google Scholar Back to article
Cohen, A.M., Hersh, W.R., Peterson, K. and Yen, P.-Y. (2006). Reducing workload in systematic review preparation using automated citation classification, Journal of the American Medical Informatics Association 13(2): 206–219.
Search in Google Scholar Back to article
Cohn, D., Atlas, L. and Ladner, R. (1994). Improving generalization with active learning, Machine Learning 15(2): 201–221.
Search in Google Scholar Back to article
Cortes, C. and Vapnik, V.N. (1995). Support-vector networks, Machine Learning 20(3): 273–297.
Search in Google Scholar Back to article
Dařena, F. and Žižka, J. (2017). Ensembles of classifiers for parallel categorization of large number of text documents expressing opinions, Journal of Applied Economic Sciences 12(1): 25–35.
Search in Google Scholar Back to article
Deb, S. and Chanda, A.K. (2022). Comparative analysis of contextual and context-free embeddings in disaster prediction from Twitter data, Machine Learning with Applications 7: 100253.
Search in Google Scholar Back to article
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT-2019, Minneapolis, USA, pp. 4171–4186.
Search in Google Scholar Back to article
Dumais, S.T., Platt, J.C., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization, Proceedings of the 17th International Conference on Information and Knowledge Management, CIKM-98, Bethesda, USA, pp. 148–155.
Search in Google Scholar Back to article
Egan, J.P. (1975). Signal Detection Theory and ROC Analysis, Academic Press, New York.
Search in Google Scholar Back to article
Fawcett, T. (2006). An introduction to ROC analysis, Pattern Recognition Letters 27(8): 861–874.
Search in Google Scholar Back to article
Forman, G. (2003). An extensive empirical study of feature selection measures for text classification, Journal of Machine Learning Research 3: 1289–1305.
Search in Google Scholar Back to article
Forman, G. and Scholz, M. (2010). Apples-to-apples in cross-validation studies: Pitfalls in classifier performance measurement, ACM SIGKDD Explorations Newsletter 12(1): 49–57.
Search in Google Scholar Back to article
García Adeva, J.J., Pikatza Atxaa, J.M., Ubeda Carrillo, M. and Ansuategi Zengotitabengoa, E. (2014). Automatic text classification to support systematic reviews in medicine, Expert Systems with Applications 41(4): 1498–1508.
Search in Google Scholar Back to article
Graves, A. (2013). Generating sequences with recurrent neural networks, arXiv: 1308.0850.
Search in Google Scholar Back to article
Hamel, L.H. (2009). Knowledge Discovery with Support Vector Machines, Wiley, Hoboken.
Search in Google Scholar Back to article
Hassan, S., Mihalcea, R. and Banea, C. (2007). Random-walk term weighting for improved text classification, Proceedings of the 1st IEEE International Conference on Semantic Computing, ICSC-2007, Irvine, USA, pp. 53–60.
Search in Google Scholar Back to article
Helaskar, M.N. and Sonawane, S.S. (2019). Text classification using word embeddings, Proceedings of the 5th International Conference on Computing, Communication, Control, and Automation, ICCUBEA-2019, New York, USA, pp. 1–4.
Search in Google Scholar Back to article
Hilbe, J.M. (2009). Logistic Regression Models, Chapman and Hall, Boca Raton.
Search in Google Scholar Back to article
Honnibal, M., Montani, I., Van Landeghem, S. and Boyd, A. (2021). spaCy: Industrial-Strength Natural Language Processing in Python, http://spacy.io.
Search in Google Scholar Back to article
Ji, X., Ritter, A. and Yen, P.-Y. (2017). Using ontology-based semantic similarity to facilitate the article screening process for systematic reviews, Journal of Biomedical Informatics 69: 33–42.
Search in Google Scholar Back to article
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of the 10th European Conference on Machine Learning, ECML-98, Chemnitz, Germany, pp. 137–142.
Search in Google Scholar Back to article
Joachims, T. (2002). Learning to Classify Text by Support Vector Machines: Methods, Theory, and Algorithms, Springer, New York.
Search in Google Scholar Back to article
Jonnalagadda, S. and Petitti, D. (2013). A new iterative method to reduce workload in systematic review process, International Journal of Computational Biology and Drug Design 6(1–2): 5–17.
Search in Google Scholar Back to article
Kaibi, I., Nfaoui, E.H. and Satori, H. (2019). A comparative evaluation of word embeddings techniques for Twitter sentiment analysis, Proceedings of the 2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems, WITS-2019, Fez, Morocco, pp. 1–4.
Search in Google Scholar Back to article
Khabsa, M., Elmagarmid, A., Ilyas, I., Hammady, H. and Ouzzani, M. (2016). Learning to identify relevant studies for systematic reviews using random forest and external information, Machine Learning 102(3): 465–482.
Search in Google Scholar Back to article
Koprinska, I., Poon, J., Clark, J. and Chan, J. (2007). Learning to classify e-mail, Information Sciences: An International Journal 177(10): 2167–2187.
Search in Google Scholar Back to article
Koziarski, M. and Woźniak, M. (2017). CCR: A combined cleaning and resampling algorithm for imbalanced data classification, International Journal of Applied Mathematics and Computer Science 27(4): 727–736, DOI: 10.1515/amcs-2017-0050.
Search in Google Scholar Back to article
Le, Q.V. and Mikolov, T. (2014). Distributed representations of sentences and documents, Proceedings of the 31st International Conference on Machine Learning, ICML-2014, Beijing, China, pp. 1188–1196.
Search in Google Scholar Back to article
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S.and So, C.H. and Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36(4): 1234–1240.
Search in Google Scholar Back to article
Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of the 10th European Conference on Machine Learning, ECML-98, Chemnitz, Germany, pp. 4–15.
Search in Google Scholar Back to article
Matwin, S., Kouznetsov, A., Inkpen, D., Frunza, O. and O’Blenis, P. (2010). A new algorithm for reducing the workload of experts in performing systematic reviews, Journal of the American Medical Informatics Association 17(4): 446–453.
Search in Google Scholar Back to article
McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification, Proceedings of the AAAI/ICML-98 Workshop on Learning for Text Categorization, Madison, USA, pp. 41-48.
Search in Google Scholar Back to article
Menardi, G. and Torelli, N. (2014). Training and assessing classification rules with imbalanced data, Data Mining and Knowledge Discovery 28(1): 92–122.
Search in Google Scholar Back to article
Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. (2013). Efficient estimation of word representations in vector space, arXiv: 1301.3781.
Search in Google Scholar Back to article
Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics, Cognitive Science 34(8): 1388–1429.
Search in Google Scholar Back to article
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12: 2825–2830.
Search in Google Scholar Back to article
Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP-2014, Doha, Qatar, pp. 1532–1543.
Search in Google Scholar Back to article
Platt, J.C. (1998). Fast training of support vector machines using sequential minimal optimization, in B. Schölkopf et al. (Eds), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, pp. 185–208.
Search in Google Scholar Back to article
Radovanović, M. and Ivanović, M. (2008). Text mining: Approaches and applications, Novi Sad Journal of Mathematics 38(3): 227–234.
Search in Google Scholar Back to article
Řehůřek (2021). Gensim: Topic Modeling for Humans, Version 4.0.1, https://radimrehurek.com/gensim.
Search in Google Scholar Back to article
Řehůřek, V. and Sojka, P. (2010). Software framework for topic modelling with large corpora, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50.
Search in Google Scholar Back to article
Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B.B., Chen, X. and Wang, X. (2020). A survey of deep active learning, ACM Computing Surveys 54(9): 1–40.
Search in Google Scholar Back to article
Rios, G. and Zha, H. (2004). Exploring support vector machines and random forests for spam detection, Proceedings of the 1st Conference on Email and Anti Spam, CEAS-2004, Moutain View, USA, pp. 284–292.
Search in Google Scholar Back to article
Salton, G. and Buckley, C. (1988). Term weighting approaches in automatic text retrieval, Information Processing and Management 24(5): 513–523.
Search in Google Scholar Back to article
Szymański, J. (2014). Comparative analysis of text representation methods using classification, Cybernetics and Systems 45(2): 180–199.
Search in Google Scholar Back to article
van den Bulk, L.M., Bouzembrak, Y., Gavai, A., Liu, N., van den Heuvel, L.J. and Marvin, H.J.P. (2022). Automatic classification of literature in systematic reviews on food safety using machine learning, Current Research in Food Science 5: 84–95.
Search in Google Scholar Back to article
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ł. and Polosukhin, I. (2017). Attention is all you need, Advances in Neural Information Processing Systems, NIPS-2017, Long Beach, USA, pp. 6000–6010.
Search in Google Scholar Back to article
Wang, C., Nulty, P. and Lillis, D. (2020). A comparative study on word embeddings in deep learning for text classification, Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, NLPIR-2020, Seoul, Korea, pp. 37–46.
Search in Google Scholar Back to article
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q. and Rush, A. M. (2020). Transformers: State-of-the-art natural language processing, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, (online).
Search in Google Scholar Back to article
Xue, D. and Li, F. (2015). Research of text categorization model based on random forests, 2015 IEEE International Conference on Computational Intelligence and Communication Technology, CICT-2015, Ghaziabad, India, pp. 173–176.
Search in Google Scholar Back to article
Yang, Y. and Pedersen, J. (1997). A comparative study on feature selection in text categorization, Proceedings of the 14th International Conference on Machine Learning, ICML-97, Nashville, USA, pp. 412-420.
Search in Google Scholar Back to article
Yessenalina, A. and Cardie, C. (2011). Compositional matrix-space models for sentiment analysis, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP-2011, Edinburgh, UK, pp. 172–182.
Search in Google Scholar Back to article
Zhu, X. and Goldberg, A. (2009). Introduction to Semi-Supervised Learning, Morgan & Claypool, San Rafael,.
Search in Google Scholar Back to article
Zymkowski, T., Szymański, J., Sobecki, A., Drozda, P., Szalapak, K., Komar-Komarowski, K. and Scherer, R. (2022). Short texts representations for legal domain classification, Proceedings of the 21st International Conference on Artificial Intelligence and Soft Computing, ICAISC-2022, Zakopane, Poland, pp. 105–114.
Search in Google Scholar Back to article

Bag of Words and Embedding Text Representation Methods for Medical Article Classification

References

Paradigm

My account