Abstract
Geocoding, the task of linking place names in text to geographic coordinates, is a cornerstone of spatial humanities research, yet many existing tools assume structured data, contemporary toponyms, or commercial geocoding services that limit reuse. Humanities corpora, by contrast, are often unstructured, multilingual, and historically variable. This discussion paper presents a scalable, first-pass workflow that applies Wikidata-based geocoding directly to plain-text files through the combined use of Stanford CoreNLP and Python-based Wikidata lookups. The pipeline presents complete shell and SLURM configurations for use on both local machines and high-performance computing (HPC) clusters. This paper details the pipeline’s design, explaining its behavior across multilingual and ambiguous toponyms, and situates it in relation to existing gazetteers such as Pleiades, the World Historical Gazetteer, and GeoNames. Limitations, including minimal disambiguation and uneven language coverage, are discussed openly to guide appropriate reuse. The workflow aims to lower the barrier to Wikidata-based geocoding in the humanities by providing a transparent, extensible, and HPC-ready approach for working with unstructured text.
