Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Vahidnia, Sahand; Abbasi, Alireza; Abbass, Hussein A.

doi:10.2478/jdis-2021-0024

References

Arora, S., Liang, Y.Y., & Ma, T.Y. (2017). A simple but tough-to-beat baseline for sentence embeddings. In proceedings of International Conference on Learning Representations, Toulon, France, April 24–26, 2017.
Search in Google Scholar Back to article
Astrakhantsev, N. (2015). Methods and software for terminology extraction from domain-specific text collection (Unpublished doctoral dissertation). Ph. D. thesis, Institute for System Programming of Russian Academy of Sciences.
Search in Google Scholar Back to article
Awan, M.N., & Beg, M.O. (2020). Top-rank: A topicalpostionrank for extraction and classification of keyphrases in text. Computer Speech & Language, 65, 101116.
Search in Google Scholar Back to article
Beltagy, I., Lo, K., & Cohan, A. (2019). Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.
Search in Google Scholar Back to article
Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993–1022.
Search in Google Scholar Back to article
Cagliero, L., & La Quatra, M. (2020). Extracting highlights of scientific articles: A supervised summarization approach. Expert Systems with Applications, 160, 113659.
Search in Google Scholar Back to article
Curiskis, S.A., Drake, B., Osborn, T.R., & Kennedy, P.J. (2020). An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit. Information Processing & Management, 57(2), 102034.
Search in Google Scholar Back to article
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391–407.
Search in Google Scholar Back to article
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Search in Google Scholar Back to article
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, pp. 226–231).
Search in Google Scholar Back to article
Harris, Z.S. (1954). Distributional structure. Word, 10(2–3), 146–162.
Search in Google Scholar Back to article
Hou, J.H., Yang, X.C., & Chen, C.M. (2018). Emerging trends and new developments in information science: A document co-citation analysis (2009–2016). Scientometrics, 115(2), 869–892.
Search in Google Scholar Back to article
Jelodar, H., Wang, Y.L., Yuan, C., Feng, X., Jiang, X.H., Li, Y.C., & Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: Models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169–15211.
Search in Google Scholar Back to article
Jones, K.S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.
Search in Google Scholar Back to article
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
Search in Google Scholar Back to article
Kenter, T., Borisov, A., & De Rijke, M. (2016). Siamese cbow: Optimizing word embeddings for sentence representations. arXiv preprint arXiv:1606.04640.
Search in Google Scholar Back to article
Kim, J., Yoon, J., Park, E., & Choi, S. (2020). Patent document clustering with deep embeddings. Scientometrics, 1–15.
Search in Google Scholar Back to article
Krenn, M., & Zeilinger, A. (2020). Predicting research trends with semantic and neural networks with an application in quantum physics. Proceedings of the National Academy of Sciences, 117(4), 1910–1916.
Search in Google Scholar Back to article
Kuhn, T., Perc, M., & Helbing, D. (2014). Inheritance patterns in citation networks reveal scientific memes. Physical Review X, 4(4), 041036.
Search in Google Scholar Back to article
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188–1196).
Search in Google Scholar Back to article
Li, J.Z., Fan, Q.N., & Zhang, K., et al. (2007). Keyword extraction based on tf/idf for chinese news document. Wuhan University Journal of Natural Sciences, 12(5), 917–921.
Search in Google Scholar Back to article
Liu, H.W., Kou, H.Z., Yan, C., & Qi, L.Y. (2019). Link prediction in paper citation network to construct paper correlation graph. EURASIP Journal on Wireless Communications and Networking, 2019(1), 1–12.
Search in Google Scholar Back to article
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
Search in Google Scholar Back to article
Miller, G.A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11), 39–41.
Search in Google Scholar Back to article
Peganova, I., Rebrova, A., & Nedumov, Y. (2019). Labelling hierarchical clusters of scientific articles. In 2019 ivannikov memorial workshop (ivmem) (pp. 26–32).
Search in Google Scholar Back to article
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
Search in Google Scholar Back to article
Radu, R.-G., Rădulescu, I.-M., Truică, C.-O., Apostol, E.-S., & Mocanu, M. (2020). Clustering documents using the document to vector model for dimensionality reduction. In 2020 ieee international conference on automation, quality and testing, robotics (aqtr) (pp. 1–6).
Search in Google Scholar Back to article
Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text mining: Applications and theory, 1, 1–20.
Search in Google Scholar Back to article
Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53–65.
Search in Google Scholar Back to article
Steinley, D. (2004). Properties of the hubert-arable adjusted rand index. Psychological methods, 9(3), 386.
Search in Google Scholar Back to article
Vahidnia, S., Abbasi, A., & Abbass, H.A. (2020). Document clustering and labeling for research trend extraction and evolution mapping. In C. Zhang, P. Mayr, W. Lu, & Y. Zhang (Eds.), Proceedings of the 1st workshop on extraction and evaluation of knowledge entities from scientific documents co-located with the ACM/IEEE joint conference on digital libraries in 2020, eeke@jcdl 2020, virtual event, china, august 1st, 2020 (Vol. 2658, pp. 54–62). Retrieved from http://ceur-ws.org/Vol-2658/paper7.pdf
Search in Google Scholar Back to article
Ward Jr, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301), 236–244.
Search in Google Scholar Back to article
Weber, T., Kranzlmüller, D., Fromm, M., & Tavares de Sousa, N. (2020). Using supervised learning to classify metadata of research data by field of study. Quantitative Science Studies, 1–26.
Search in Google Scholar Back to article
Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In International conference on machine learning (pp. 478–487).
Search in Google Scholar Back to article
Xu, H.Y., Winnink, J., Yue, Z.H., Liu, Z.Q., & Yuan, G.T. (2020). Topic-linked innovation paths in science and technology. Journal of Informetrics, 14(2), 101014.
Search in Google Scholar Back to article
Xu, S., Hao, L.Y., An, X., Yang, G.C., & Wang, F.F. (2019). Emerging research topics detection with multiple machine learning models. Journal of Informetrics, 13(4), 100983.
Search in Google Scholar Back to article
Xu, S., Zhai, D.S., Wang, F.F., An, X., Pang, H.S., & Sun, Y.R. (2019). A novel method for topic linkages between scientific publications and patents. Journal of the Association for Information Science and Technology, 70(9), 1026–1042.
Search in Google Scholar Back to article
Zeng, A., Shen, Z.S., Zhou, J.L., Wu, J.S., Fan, Y., Wang, Y.G., & Stanley, H.E. (2017). The science of science: From the perspective of complex systems. Physics Reports, 714–715, 1–73. Retrieved from https://doi.org/10.1016/j.physrep.2017.10.001 doi: 10.1016/j.physrep.2017.10.001
Open DOI Search in Google Scholar Back to article
Zhang, Q.R., Li, Y., Liu, J.S., Chen, Y.D., & Chai, L.H. (2017). A dynamic co-word network-related approach on the evolution of China's urbanization research. Scientometrics, 111(3), 1623–1642. doi: 10.1007/s11192-017-2314-1
Open DOI Search in Google Scholar Back to article
Zhang, Y., Chen, H.S., Lu, J., & Zhang, G.Q. (2017). Detecting and predicting the topic change of knowledge-based systems: A topic-based bibliometric analysis from 1991 to 2016. Knowledge-Based Systems, 133, 255–268. Retrieved from http://dx.doi.org/10.1016/j.knosys.2017.07.011 doi: 10.1016/j.knosys.2017.07.011
Open DOI Search in Google Scholar Back to article
Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H.S., & Zhang, G.Q. (2018). Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. Journal of Informetrics, 12(4), 1099–1117.
Search in Google Scholar Back to article
Zhang, Y., Zhang, G.Q., Zhu, D.H., & Lu, J. (2017). Scientific evolutionary pathways: Identifying and visualizing relationships for scientific topics. Journal of the Association for Information Science and Technology, 68(8), 1925–1939. Retrieved from http://doi.wiley.com/10.1002/asi.23814 doi: 10.1002/asi.23814
Open DOI Search in Google Scholar Back to article
Zhou, Y., Lin, H., Liu, Y.F., & Ding, W. (2019). A novel method to identify emerging technologies using a semi-supervised topic clustering model: A case of 3d printing industry. Scientometrics, 120(1), 167–185.
Search in Google Scholar Back to article

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

References

Paradigm

My account