Have a personal or library account? Click to login
Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering Cover

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Open Access
|Jun 2021

References

  1. Arora, S., Liang, Y.Y., & Ma, T.Y. (2017). A simple but tough-to-beat baseline for sentence embeddings. In proceedings of International Conference on Learning Representations, Toulon, France, April 24–26, 2017.
  2. Astrakhantsev, N. (2015). Methods and software for terminology extraction from domain-specific text collection (Unpublished doctoral dissertation). Ph. D. thesis, Institute for System Programming of Russian Academy of Sciences.
  3. Awan, M.N., & Beg, M.O. (2020). Top-rank: A topicalpostionrank for extraction and classification of keyphrases in text. Computer Speech & Language, 65, 101116.
  4. Beltagy, I., Lo, K., & Cohan, A. (2019). Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.
  5. Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993–1022.
  6. Cagliero, L., & La Quatra, M. (2020). Extracting highlights of scientific articles: A supervised summarization approach. Expert Systems with Applications, 160, 113659.
  7. Curiskis, S.A., Drake, B., Osborn, T.R., & Kennedy, P.J. (2020). An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit. Information Processing & Management, 57(2), 102034.
  8. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391–407.
  9. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  10. Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, pp. 226–231).
  11. Harris, Z.S. (1954). Distributional structure. Word, 10(2–3), 146–162.
  12. Hou, J.H., Yang, X.C., & Chen, C.M. (2018). Emerging trends and new developments in information science: A document co-citation analysis (2009–2016). Scientometrics, 115(2), 869–892.
  13. Jelodar, H., Wang, Y.L., Yuan, C., Feng, X., Jiang, X.H., Li, Y.C., & Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: Models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169–15211.
  14. Jones, K.S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.
  15. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
  16. Kenter, T., Borisov, A., & De Rijke, M. (2016). Siamese cbow: Optimizing word embeddings for sentence representations. arXiv preprint arXiv:1606.04640.
  17. Kim, J., Yoon, J., Park, E., & Choi, S. (2020). Patent document clustering with deep embeddings. Scientometrics, 1–15.
  18. Krenn, M., & Zeilinger, A. (2020). Predicting research trends with semantic and neural networks with an application in quantum physics. Proceedings of the National Academy of Sciences, 117(4), 1910–1916.
  19. Kuhn, T., Perc, M., & Helbing, D. (2014). Inheritance patterns in citation networks reveal scientific memes. Physical Review X, 4(4), 041036.
  20. Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188–1196).
  21. Li, J.Z., Fan, Q.N., & Zhang, K., et al. (2007). Keyword extraction based on tf/idf for chinese news document. Wuhan University Journal of Natural Sciences, 12(5), 917–921.
  22. Liu, H.W., Kou, H.Z., Yan, C., & Qi, L.Y. (2019). Link prediction in paper citation network to construct paper correlation graph. EURASIP Journal on Wireless Communications and Networking, 2019(1), 1–12.
  23. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
  24. Miller, G.A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11), 39–41.
  25. Peganova, I., Rebrova, A., & Nedumov, Y. (2019). Labelling hierarchical clusters of scientific articles. In 2019 ivannikov memorial workshop (ivmem) (pp. 26–32).
  26. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
  27. Radu, R.-G., Rădulescu, I.-M., Truică, C.-O., Apostol, E.-S., & Mocanu, M. (2020). Clustering documents using the document to vector model for dimensionality reduction. In 2020 ieee international conference on automation, quality and testing, robotics (aqtr) (pp. 1–6).
  28. Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text mining: Applications and theory, 1, 1–20.
  29. Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53–65.
  30. Steinley, D. (2004). Properties of the hubert-arable adjusted rand index. Psychological methods, 9(3), 386.
  31. Vahidnia, S., Abbasi, A., & Abbass, H.A. (2020). Document clustering and labeling for research trend extraction and evolution mapping. In C. Zhang, P. Mayr, W. Lu, & Y. Zhang (Eds.), Proceedings of the 1st workshop on extraction and evaluation of knowledge entities from scientific documents co-located with the ACM/IEEE joint conference on digital libraries in 2020, eeke@jcdl 2020, virtual event, china, august 1st, 2020 (Vol. 2658, pp. 54–62). Retrieved from http://ceur-ws.org/Vol-2658/paper7.pdf
  32. Ward Jr, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301), 236–244.
  33. Weber, T., Kranzlmüller, D., Fromm, M., & Tavares de Sousa, N. (2020). Using supervised learning to classify metadata of research data by field of study. Quantitative Science Studies, 1–26.
  34. Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In International conference on machine learning (pp. 478–487).
  35. Xu, H.Y., Winnink, J., Yue, Z.H., Liu, Z.Q., & Yuan, G.T. (2020). Topic-linked innovation paths in science and technology. Journal of Informetrics, 14(2), 101014.
  36. Xu, S., Hao, L.Y., An, X., Yang, G.C., & Wang, F.F. (2019). Emerging research topics detection with multiple machine learning models. Journal of Informetrics, 13(4), 100983.
  37. Xu, S., Zhai, D.S., Wang, F.F., An, X., Pang, H.S., & Sun, Y.R. (2019). A novel method for topic linkages between scientific publications and patents. Journal of the Association for Information Science and Technology, 70(9), 1026–1042.
  38. Zeng, A., Shen, Z.S., Zhou, J.L., Wu, J.S., Fan, Y., Wang, Y.G., & Stanley, H.E. (2017). The science of science: From the perspective of complex systems. Physics Reports, 714–715, 1–73. Retrieved from https://doi.org/10.1016/j.physrep.2017.10.001 doi: 10.1016/j.physrep.2017.10.001
  39. Zhang, Q.R., Li, Y., Liu, J.S., Chen, Y.D., & Chai, L.H. (2017). A dynamic co-word network-related approach on the evolution of China's urbanization research. Scientometrics, 111(3), 1623–1642. doi: 10.1007/s11192-017-2314-1
  40. Zhang, Y., Chen, H.S., Lu, J., & Zhang, G.Q. (2017). Detecting and predicting the topic change of knowledge-based systems: A topic-based bibliometric analysis from 1991 to 2016. Knowledge-Based Systems, 133, 255–268. Retrieved from http://dx.doi.org/10.1016/j.knosys.2017.07.011 doi: 10.1016/j.knosys.2017.07.011
  41. Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H.S., & Zhang, G.Q. (2018). Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. Journal of Informetrics, 12(4), 1099–1117.
  42. Zhang, Y., Zhang, G.Q., Zhu, D.H., & Lu, J. (2017). Scientific evolutionary pathways: Identifying and visualizing relationships for scientific topics. Journal of the Association for Information Science and Technology, 68(8), 1925–1939. Retrieved from http://doi.wiley.com/10.1002/asi.23814 doi: 10.1002/asi.23814
  43. Zhou, Y., Lin, H., Liu, Y.F., & Ding, W. (2019). A novel method to identify emerging technologies using a semi-supervised topic clustering model: A case of 3d printing industry. Scientometrics, 120(1), 167–185.
DOI: https://doi.org/10.2478/jdis-2021-0024 | Journal eISSN: 2543-683X | Journal ISSN: 2096-157X
Language: English
Page range: 99 - 122
Submitted on: Nov 30, 2020
Accepted on: Apr 26, 2021
Published on: Jun 18, 2021
Published by: Chinese Academy of Sciences, National Science Library
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2021 Sahand Vahidnia, Alireza Abbasi, Hussein A. Abbass, published by Chinese Academy of Sciences, National Science Library
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.