Have a personal or library account? Click to login
An HPC-Ready, Wikidata-Based Workflow for Exploratory Geocoding of Unstructured Textual Corpora Cover

An HPC-Ready, Wikidata-Based Workflow for Exploratory Geocoding of Unstructured Textual Corpora

By: Annie K. Lamar  
Open Access
|Dec 2025

Abstract

Geocoding, the task of linking place names in text to geographic coordinates, is a cornerstone of spatial humanities research, yet many existing tools assume structured data, contemporary toponyms, or commercial geocoding services that limit reuse. Humanities corpora, by contrast, are often unstructured, multilingual, and historically variable. This discussion paper presents a scalable, first-pass workflow that applies Wikidata-based geocoding directly to plain-text files through the combined use of Stanford CoreNLP and Python-based Wikidata lookups. The pipeline presents complete shell and SLURM configurations for use on both local machines and high-performance computing (HPC) clusters. This paper details the pipeline’s design, explaining its behavior across multilingual and ambiguous toponyms, and situates it in relation to existing gazetteers such as Pleiades, the World Historical Gazetteer, and GeoNames. Limitations, including minimal disambiguation and uneven language coverage, are discussed openly to guide appropriate reuse. The workflow aims to lower the barrier to Wikidata-based geocoding in the humanities by providing a transparent, extensible, and HPC-ready approach for working with unstructured text.

DOI: https://doi.org/10.5334/johd.401 | Journal eISSN: 2059-481X
Language: English
Submitted on: Oct 3, 2025
|
Accepted on: Nov 22, 2025
|
Published on: Dec 23, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Annie K. Lamar, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.