An HPC-Ready, Wikidata-Based Workflow for Exploratory Geocoding of Unstructured Textual Corpora

Annie K. Lamar

doi:10.5334/johd.401

Abstract

Geocoding, the task of linking place names in text to geographic coordinates, is a cornerstone of spatial humanities research, yet many existing tools assume structured data, contemporary toponyms, or commercial geocoding services that limit reuse. Humanities corpora, by contrast, are often unstructured, multilingual, and historically variable. This discussion paper presents a scalable, first-pass workflow that applies Wikidata-based geocoding directly to plain-text files through the combined use of Stanford CoreNLP and Python-based Wikidata lookups. The pipeline presents complete shell and SLURM configurations for use on both local machines and high-performance computing (HPC) clusters. This paper details the pipeline’s design, explaining its behavior across multilingual and ambiguous toponyms, and situates it in relation to existing gazetteers such as Pleiades, the World Historical Gazetteer, and GeoNames. Limitations, including minimal disambiguation and uneven language coverage, are discussed openly to guide appropriate reuse. The workflow aims to lower the barrier to Wikidata-based geocoding in the humanities by providing a transparent, extensible, and HPC-ready approach for working with unstructured text.

References

Athens. (n.d.). Retrieved September 23, 2025, from https://www.wikidata.org/wiki/Q1524
Search in Google Scholar Back to article
Bagnall, R. (Ed.). (2016). Pleiades: A Gazetteer of Past Places. Retrieved September 30, 2025. pleiades.stoa.org
Search in Google Scholar Back to article
Bai, X., Jiao, X., Sakai, T., & Xu, H. (2024). Mapping the past with historical geographic information systems: Layered characteristics of the historic urban landscape of Nanjing. China, since the Ming Dynasty (1368–2024). Heritage Science, 12(1), 283. 10.1186/s40494-024-01400-4
Open DOI Search in Google Scholar Back to article
Bamman, D., & Smith, N. A. (2014). Unsupervised Discovery of Biographical Structure from Text. Transactions of the Association for Computational Linguistics, 2, 363–376. 10.1162/tacl_a_00189
Open DOI Search in Google Scholar Back to article
Bodenhamer, D. J., Corrigan, J., & Harris, T. M. (Eds.). (2010). The spatial humanities: GIS and the future of humanities scholarship. Indiana University Press. 10.2979/5864.0s
Open DOI Search in Google Scholar Back to article
Bushell, S. (2020). Reading and mapping fiction: Spatialising the literary text. Cambridge: Cambridge University Press. 10.1017/9781108766876
Open DOI Search in Google Scholar Back to article
Bushell, S., & Hutcheon, R. L. (2025). New approaches for digital literary mapping: Chronotopic cartography. Cambridge University Press. 10.1017/9781009353632
Open DOI Search in Google Scholar Back to article
Devinney, H., Eklund, A., Ryazanov, I., & Cai, J. (2023). Developing a Multilingual Corpus of Wikipedia Biographies. In R. Mitkov & G. Angelova (Eds.), Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing (pp. 285–294). INCOMA Ltd., Shoumen, Bulgaria. https://aclanthology.org/2023.ranlp-1.32/
Search in Google Scholar Back to article
ESRI. (2024). ArcGIS World Geocoding [Computer software]. Retrieved December 8, 2025, from https://www.arcgis.com/home/item.html?id=305f2e55e67f4389bef269669fc2e284,
Search in Google Scholar Back to article
Fischer, F., Börner, I., Göbel, M., Hechtl, A., Kittel, C., Milling, C., & Trilcke, P. (2019). Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on European Drama. Proceedings of DH2019. Utrecht University. 10.5281/ZENODO.4284001
Open DOI Search in Google Scholar Back to article
Getty Research Institute. (2017). Getty Thesaurus of Geographic Names Online (TGN) [Dataset]. Retrieved September 30, 2025, from https://www.getty.edu/research/tools/vocabularies/tgn
Search in Google Scholar Back to article
Google. (n.d.). Google Maps [Computer software]. Retrieved September 30, 2025, from https://maps.google.com
Search in Google Scholar Back to article
Gregory, I. N., & Geddes, A. (Eds.). (2014). Toward spatial humanities: Historical GIS and spatial history. Indiana University Press. 10.2979/6100.0
Open DOI Search in Google Scholar Back to article
Hyvönen, E., & Rantala, H. (2021). Knowledge-based relational search in cultural heritage linked data. Digital Scholarship in the Humanities, 36(Supplement_2), ii155–ii164. 10.1093/llc/fqab042
Open DOI Search in Google Scholar Back to article
Khatib, R. E., & Schaeben, M. (2020). Why Map Literature? Geospatial Prototyping for Literary Studies and Digital Humanities. Digital Studies/Le Champ Numérique, 10(1). 10.16995/dscn.381
Open DOI Search in Google Scholar Back to article
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In K. Bontcheva & J. Zhu (Eds.), Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). Vienna: Association for Computational Linguistics. 10.3115/v1/P14-5010
Open DOI Search in Google Scholar Back to article
Murrieta-Flores, P., & Martins, B. (2019). The geospatial humanities: Past, present and future. International Journal of Geographical Information Science, 33(12), 2424–2429. 10.1080/13658816.2019.1645336
Open DOI Search in Google Scholar Back to article
Page, B., & Ross, E. (2015). Envisioning the Urban Past: GIS Reconstruction of a Lost Denver District. Frontiers in Digital Humanities, 2. 10.3389/fdigh.2015.00003
Open DOI Search in Google Scholar Back to article
Pattuelli, M. C., Weller, C., & Szablya, G. (2011, September). Linked Jazz: An Exploratory Pilot. International Conference on Dublin Core and Metadata Applications 2011 (pp. 158–164). The 2011 International Conference on Dublin Core and Metadata Applications, The Hague, The Netherlands.
Search in Google Scholar Back to article
Pywikibot (Version 1.31). (2003). [Computer software]. Retrieved September 30, 2025. https://github.com/wikimedia/pywikibot
Search in Google Scholar Back to article
Ratinov, L., Roth, D., Downey, D., & Anderson, M. (2011). Local and Global Algorithms for Disambiguation to Wikipedia. In D. Lin, Y. Matsumoto, & R. Mihalcea (Eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 1375–1384). Vienna: Association for Computational Linguistics. https://aclanthology.org/P11-1138/
Search in Google Scholar Back to article
Sil, A., & Florian, R. (2016). One for All: Towards Language Independent Named Entity Linking. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2255–2264). Vienna: Association for Computational Linguistics. 10.18653/v1/P16-1213
Open DOI Search in Google Scholar Back to article
Uhl, J. H., Leyk, S., Chiang, Y.-Y., & Knoblock, C. A. (2022). Towards the automated large-scale reconstruction of past road networks from historical maps. Computers, Environment and Urban Systems, 94, 101794. 10.1016/j.compenvurbsys.2022.101794
Open DOI Search in Google Scholar Back to article
Wick, M. (2005). GeoNames [Dataset]. https://www.geonames.org/
Search in Google Scholar Back to article
World Historical Gazetteer. (2017). [Computer software]. Retrieved September 30, 2025, from https://whgazetteer.org/
Search in Google Scholar Back to article

An HPC-Ready, Wikidata-Based Workflow for Exploratory Geocoding of Unstructured Textual Corpora

Abstract

Paradigm

My account