Have a personal or library account? Click to login
A Comparison of Topic Modeling Approaches Using Networked Discussion Forum Posts From the City-data.com Corpus Cover

A Comparison of Topic Modeling Approaches Using Networked Discussion Forum Posts From the City-data.com Corpus

By: Ryan M. Omizo  
Open Access
|Feb 2024

References

  1. 1Arthur, D., & Vassilvitskii, S. (2007). K-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 10271035).
  2. 2Advameg, Inc. (2011). Philadelphia 2035 (Houston: Foreclosure, neighborhoods, wage)—Pennsylvania (PA)—City-Data Forum. City-Data.Com. https://www.city-data.com/forum/philadelphia/1304227-philadelphia-2035-a.html
  3. 3Advameg, Inc. (2012a). Official Philadelphia Metro Crime Thread (York, Chester: Apartment complexes, houses, unemployment)—Pennsylvania (PA)—Page 10—City-Data Forum [Forum]. City-Data.Com. http://www.city-data.com/forum/philadelphia/1470248-official-philadelphia-metro-crime-thread-10.html
  4. 4Advameg, Inc. (2012b). Retail coming to Philadelphia (Economy, Penn: 2013, tenant, shop)—Pennsylvania (PA)—Page 3—City-Data Forum [Forum]. City-Data.Com. https://www.city-data.com/forum/philadelphia/1740992-retail-coming-philadelphia-3.html
  5. 5Advameg, Inc. (2013). Official Greater Philadelphia Area Crime Thread (York, Mars: Leasing, condominium, place to live)—Pennsylvania (PA)—Page 267—City-Data Forum [Forum]. City-Data.Com. https://www.city-data.com/forum/philadelphia/1839911-official-greater-philadelphia-area-crime-thread-267.html
  6. 6Advameg, Inc. (2020). How’s everyone doing amongst the Coronavirus shut down? (Philadelphia, York: Restaurants, bus)—Pennsylvania (PA)—Page 37—City-Data Forum [Forum]. City-Data.Com. https://www.city-data.com/forum/philadelphia/3137059-hows-everyone-doing-amongst-coronavirus-shut-37.html
  7. 7Advameg, Inc. (n.d.a). City-Data.Com—Stats about all US cities—Real estate, relocation info, crime, house prices, cost of living, races, home value estimator, recent sales, income, photos, schools, maps, weather, neighborhoods, and more. Retrieved January 26, 2024, from https://www.city-data.com/
  8. 8Advameg, Inc. (n.d.b). City-data.com Forum: Relocation, Moving, General and Local City Discussions. Retrieved January 26, 2024, from https://www.city-data.com/forum/
  9. 9Advameg, Inc. (n.d.c). Terms of Service—City-Data Forum. Retrieved October 22, 2023, from https://www.city-data.com/forumtos.html
  10. 10Aharoni, R., & Goldberg, Y. (2020). Unsupervised Domain Clusters in Pretrained Language Models (arXiv:2004.02105), Cornell University, arXiv. http://arxiv.org/abs/2004.02105. DOI: 10.18653/v1/2020.acl-main.692
  11. 11Angelov, D. (2020). Top2Vec: Distributed Representations of Topics.
  12. 12Bhatia, S., Lau, J. H., & Baldwin, T. (2016). Automatic Labeling of Topics with Neural Embeddings. DOI: 10.48550/arXiv.1612.05340
  13. 13Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2020a). Cross-lingual Contextualized Topic Models with Zero-shot Learning (arXiv:2004.07737). arXiv. http://arxiv.org/abs/2004.07737. DOI: 10.18653/v1/2021.eacl-main.143
  14. 14Bianchi, F., Terragni, S., & Hovy, D. (2020b). Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. arXiv preprint arXiv:2004.03974. DOI: 10.18653/v1/2021.acl-short.96
  15. 15Blei, D. M., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. (Jan), 9931022.
  16. 16Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching Word Vectors with Subword Information. arXiv Preprint arXiv:1607.04606. DOI: 10.1162/tacl_a_00051
  17. 17Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391407. https://www.cs.csustan.edu/~mmartin/LDS/Deerwester-et-al.pdf. DOI: 10.1002/(SICI)1097-4571(199009)41:6<;391::AID-ASI1>3.0.CO;2-9
  18. 18Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics, 8, 439453. DOI: 10.1162/tacl_a_00325
  19. 19Duan, Z., Xu, Y., Chen, B., Wang, D., Wang, C., & Zhou, M. (2021). TopicNet: Semantic Graph-Guided Topic Discovery (arXiv:2110.14286). arXiv. DOI: 10.48550/arXiv.2110.14286
  20. 20El-Assady, M., Kehlbeck, R., Collins, C., Keim, D., & Deussen, O. (2019). Semantic Concept Spaces: Guided Topic Model Refinement using Word-Embedding Projections (arXiv:1908.00475). arXiv. http://arxiv.org/abs/1908.00475. DOI: 10.1109/TVCG.2019.2934654
  21. 21Gerlach, M., Peixoto, T. P., & Altmann, E. G. (2018). A network approach to topic models. Science Advances, 4(7), eaaq1360. DOI: 10.1126/sciadv.aaq1360
  22. 22Gourru, A., Velcin, J., Roche, M., Gravier, C., & Poncelet, P. (2018). United We Stand: Using Multiple Strategies for Topic Labeling. In M. Silberztein, F. Atigui, E. Kornyshova, E. Métais, & F. Meziane (Eds.), Natural Language Processing and Information Systems (Vol. 10859, pp. 352363). Springer International Publishing. DOI: 10.1007/978-3-319-91947-8_37
  23. 23Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv Preprint arXiv:2203.05794. DOI: 10.48550/arXiv.2203.05794
  24. 24Hinneburg, A., Rosner, F., Pessler, S., & Oberländer, C. (2014, November). Exploring document collections with topic frames. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (pp. 20842086). DOI: 10.1145/2661829.2661857
  25. 25Hoffman, M., Bach, F., & Blei, D. (2010). Online learning for latent dirichlet allocation. Advances. Neural information processing systems, 23. URL: https://papers.nips.cc/paper_files/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf
  26. 26Jagarlamudi, J., Iii, H. D., & Udupa, R. (2012). Incorporating Lexical Priors into Topic Models. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 204213). Association for Computational Linguistics URL. https://aclanthology.org/E12-1021
  27. 27Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016a). FastText.zip: Compressing text classification models. arXiv Preprint arXiv:1612.03651. DOI: 10.48550/arXiv.1612.03651
  28. 28Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016b). Bag of Tricks for Efficient Text Classification. arXiv Preprint arXiv:1607.01759. DOI: https://doi.org/10.48550/arXiv.1607.01759; 10.18653/v1/E17-2068
  29. 29Li, C., Chen, S., Xing, J., Sun, A., & Ma, Z. (2018). Seed-Guided Topic Model for Document Filtering and Classification. ACM Transactions on Information Systems, 37(1), 9:19:37. DOI: 10.1145/3238250
  30. 30Limwattana, S., & Prom-on, S. (2021). Topic Modeling Enhancement using Word Embeddings. 18th International Joint Conference on Computer Science and Software Engineering (JCSSE), 15. DOI: 10.1109/JCSSE53117.2021.9493816
  31. 31McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering. J. Open Source Softw., 2(11), 205. DOI: 10.21105/joss.00205
  32. 32Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing Semantic Coherence in Topic Models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262272) Association for Computational Linguistics URL: https://aclanthology.org/D11-1024
  33. 33Newman, M. E. (2009). The first-mover advantage in scientific publication. Europhysics Letters, 86(6), 68001. DOI: 10.1209/0295-5075/86/68001
  34. 34Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 28252830. DOI: 10.48550/arXiv.1201.0490
  35. 35Philadelphia City Planning Commission. (2023). About | Philadelphia2035. https://www.phila2035.org/
  36. 36Popa, C., & Rebedea, T. (2021). BART-TL: Weakly-Supervised Topic Label Generation. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 14181425. DOI: 10.18653/v1/2021.eacl-main.121
  37. 37Reddit Inc. (2023). Homepage—Reddit. https://www.redditinc.com/
  38. 38Řehůřek, R., & Sojka, P. (2011). Gensim—statistical semantics in python. Retrieved from genism.org. URL: https://www.fi.muni.cz/usr/sojka/posters/rehurek-sojka-scipy2011.pdf
  39. 39Richardson, L. (2007). Beautiful soup documentation.
  40. 40Ridolfo, J., & In Hart-Davidson, W. (2015). Rhetoric and the digital humanities. University of Chicago Press. DOI: 10.7208/chicago/9780226176727.001.0001
  41. 41Röder, M., Both, A., & Hinneburg, A. (2015a). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399408. DOI: 10.1145/2684822.2685324
  42. 42Röder, M., Both, A., & Hinneburg, A. (2015b). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399408. DOI: 10.1145/2684822.2685324
  43. 43Sobkowicz, P., & Sobkowicz, A. (2010). Dynamics of hate based networks. The European Physical Journal B, 73(4), 633643. DOI: 10.1140/epjb/e2010-00039-0
  44. 44Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring Topic Coherence over Many Models and Many Topics. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics. URL: https://aclanthology.org/D12-1087
  45. 45Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. In Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, & Walter Kintsch (Eds.), Handbook of latent semantic analysis, 427(7), (pp. 424440)
  46. 46Terragni, S. (2023). A collection of Topic Diversity measures for topic modeling. [Python]. https://github.com/silviatti/topic-model-diversity (Original work published 2020).
  47. 47Terragni, S., Fersini, E., & Messina, E. (2021, June). Word embedding-based topic similarity measures. In International Conference on Applications of Natural Language to Information Systems (pp. 3345). Cham: Springer International Publishing. DOI: 10.1007/978-3-030-80599-9_4
  48. 48Tran, N. K., Zerr, S., Bischoff, K., Niederée, C., & Krestel, R. (2013). Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora. In T. Aalberg, C. Papatheodorou, M. Dobreva, G. Tsakonas, & C. J. Farrugia (Eds.), Research and Advanced Technology for Digital Libraries (pp. 297308). Springer. DOI: 10.1007/978-3-642-40501-3_30
  49. 49Vayansky, I., & Kumar, S. A. (2020). A review of topic modeling methods. Information Systems, 94. DOI: 10.1016/j.is.2020.101582
  50. 50Yang, W., Boyd-Graber, J., & Resnik, P. (2016). A Discriminative Topic Model using Document Network Structure. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 686696. DOI: 10.18653/v1/P16-1065
  51. 51Zhang, Z., Fang, M., Chen, L., & Namazi-Rad, M.-R. (2022). Is Neural Topic modeling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics (arXiv:2204.09874). arXiv. DOI: 10.48550/arXiv.2204.09874; 10.18653/v1/2022.naacl-main.285
DOI: https://doi.org/10.5334/johd.182 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 10, 2023
Accepted on: Jan 10, 2024
Published on: Feb 7, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Ryan M. Omizo, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.