An HPC-Ready, Wikidata-Based Workflow for Exploratory Geocoding of Unstructured Textual Corpora

Annie K. Lamar

doi:10.5334/johd.401

Full Article

Introduction

Geocoding is the computational process of translating place names, or toponyms, into precise geographic coordinates. At its core, it involves recognizing location references in a text or dataset and linking them to latitude-longitude pairs drawn from a structured resource such as a gazetteer. This process makes it possible to treat textual or archival references as geographic data points.

Geocoding enables places mentioned in historical documents, literary works, or oral histories to be integrated into digital maps, queried in geographic information systems (GIS), and analyzed alongside other forms of spatial data.

As a result, geocoding has broadened the methodological repertoire of projects across literary studies, archaeology, history, and cultural heritage, providing the infrastructure for digital mapping, spatial network analysis, and other forms of geospatial scholarship. The following examples, while not exhaustive, illustrate the diversity of approaches enabled by geocoding. Bai et al. (2024) use historical geographic information systems to map centuries of change in Nanjing’s urban landscape, showing how the circulation of people, buildings, and land use can be tracked over time. Page and Ross (2015) reconstructs a lost district of Denver through digitized fire insurance maps, demonstrating how geocoding and GIS can recover historical landscapes that have otherwise vanished. In the realm of infrastructure history, Uhl et al. (2022) develop automated techniques to rebuild past road networks from archival maps, allowing scholars to study large-scale patterns of mobility and interaction. Literary studies, too, have turned to geocoding: Khatib and Schaeben (2020) prototype geospatial mappings of Milton’s Paradise Lost, revealing how narrative space itself can be analyzed computationally. These projects also contribute to a broader conversation in spatial humanities (Bodenhamer et al., 2010; Gregory & Geddes, 2014; Murrieta-Flores & Martins, 2019) and literary geography (Bushell, 2020; Bushell & Hutcheon, 2025) about how spatial methods enrich interpretation, foster interdisciplinary analysis, and allow for the identification of relationships that conventional close reading or historical narrative alone cannot capture.

Traditionally, geocoding pipelines rely on a gazetteer, a structured reference work that provides standardized information about geographic entities. At minimum, a gazetteer records place names linked to geographic coordinates, but most also include variant names, hierarchical relationships (e.g., city within province, province within country), and sometimes historical or linguistic metadata. Well-known examples include the Getty Thesaurus of Geographic Names (TGN), which catalogs over two million places with hierarchical classifications for use in art and cultural heritage contexts (Getty Research Institute, 2017), GeoNames, a freely available database that contains over twelve million modern toponyms (Wick, 2005), and domain-specific historical gazetteers such as Pleiades and the World Historical Gazetteer (Bagnall, 2016; World Historical Gazetteer, 2017), which offer excellent coverage for particular regions and periods. Commercial providers such as Google Maps or ESRI extend these resources with comprehensive coverage, but their terms of use often restrict data extraction, bulk queries, and redistribution (ESRI, 2024; Google, n.d.).

Alongside these established gazetteers, Wikidata has emerged as an increasingly central node in the Linked Open Data ecosystem. Several large-scale infrastructures (e.g. OpenStreetMap, Mapbox, and the reconciliation services of the World Historical Gazetteer) already incorporate Wikidata identifiers or labels in their geocoding pipelines. Smaller domain-specific projects similarly use Wikidata to enrich or reconcile historical place names. There are also standalone libraries, such as wikigeocode, that perform Wikidata lookups for location strings.

Wikidata stores structured information about millions of places, including multilingual labels, historical variants, and geographic coordinates. Wikidata can be queried programmatically and integrated into named entity recognition (NER) pipelines, making it possible to “wikify” textual data, that is, to link textual references to Wikidata entities. Unlike Wikipedia, which provides unstructured narrative descriptions, Wikidata is designed for machine readability and is well- suited to computational pipelines. With careful configuration, wikification allows humanities researchers to bypass costly proprietary gazetteers and instead leverage a collaborative, global resource for spatial analysis. This paper introduces a first-pass, or exploratory, geocoding pipeline that combines off-the-shelf NER tools with Wikidata lookups, implemented in Python and preconfigured for use on high-performance computing systems.

The contribution of this paper is not to introduce Wikidata as a geocoding resource, as several existing infrastructures already do this, but to provide a unified, humanities-focused workflow that integrates NER-based entity extraction with Wikidata reconciliation and coordinate retrieval. The pipeline consolidates these practices into a single, reproducible codebase that is transparent, openly licensed, and designed for corpora that lack structured metadata, including historical, multilingual, and fragmentary texts. Importantly, the workflow is also explicitly configured for high-performance computing environments. By providing shell and SLURM configurations, this workflow enables researchers to perform scalable Wikidata-based geocoding without needing to write custom code, lowering a major barrier to HPC use in the humanities.

Wikification and Wikidata-Based Entity Linking

Wikification is the process of linking terms in a text to structured entities in the Wikidata knowledge base. Instead of treating a place name, person, or concept as a simple string of characters, wikification associates it with a unique, machine-readable identifier (a Q-number) that carries multilingual labels, semantic relationships, and metadata. Each Wikidata entity includes formal statements, standardized identifiers, and links to external resources, all accessible through an open API. For geospatial research, this is particularly valuable: Wikidata records not only geographic coordinates but also variant names, historical labels, administrative hierarchies, and connections to other gazetteers such as GeoNames and the Getty Thesaurus of Geographic Names (TGN).

The entry for Athens (Q1524), for instance, contains precise latitude and longitude coordinates and labels in dozens of languages (“Aθήνα” in Greek, “Atenas” in Spanish, 雅典 in Chinese, etc). The entry also stores pronunciations and audio files, nicknames, information about official language, elevation, and population history, as well as references to Athens’s government leadership. The entry links to lists of people born in Athens, films shot there, and equivalences to external identifiers across encyclopedias and databases, and much more (Athens, n.d.). This combination of geographic, linguistic, and relational data illustrates how Wikidata provides a multidimensional representation of place that can support historical, multilingual, and cross- disciplinary research.

Wikification is a long-practiced and widely used technique in the humanities, applied across numerous domains beyond geocoding. For example, linguistic research has employed wikification to disambiguate named entities in multilingual corpora (Ratinov et al., 2011; Sil & Florian, 2016), while cultural heritage projects have used it to align personal names and bibliographic data across archives (Devinney et al., 2023; Hyvönen & Rantala, 2021). The DraCor project has integrated Wikidata links into its digital corpus of European drama, connecting plays and characters to external authority records and enabling comparative network analysis across national traditions (Fischer et al., 2019). Projects like LinkedJazz similarly use Wikidata for name disambiguation in cultural datasets, demonstrating the value of structured identifiers for exploring networks of people, texts, and performances (Pattuelli et al., 2011). In literary and biographical analysis, Bamman and Smith (2014) propose an unsupervised, latent- variable model that detects biographical event classes (e.g. “Born”, “Graduates High School”, “Becomes Citizen”) from timestamped texts, which they used to perform large-scale analysis of biographical structure. The next section describes how this technique can be operationalized for geocoding through a reproducible workflow suited to humanities corpora.

Method: Wikification for Geocoding Pipeline

This project implements a reproducible pipeline for geocoding textual corpora by combining named entity recognition (NER) with Wikidata lookups. The key contribution of this paper is an openly available codebase that can be applied to any collection of plain-text (.txt) files, whether small or large, historical or modern. The workflow is lightweight enough to run on a personal computer but includes configuration for high-performance computing (HPC) environments, making it adaptable to projects of widely varying scale. The following paragraphs explain the geocoding pipeline conceptually, discuss inputs, outputs, error handling, disambiguation strategy, and briefly summarize the contents of the codebase files. More detail about the codebase files is below.

Geocoding Pipeline

The geocoding workflow is implemented as a two-stage process combining Stanford CoreNLP for named entity recognition (NER) and wikification with Pywikibot for coordinate extraction from Wikidata (Manning et al., 2014; Pywikibot, 2003). The shell script automates the setup and execution of this pipeline in a high-performance computing environment and the Python script processes the extracted entities and retrieves coordinates.

The shell script sets up the Stanford CoreNLP software and adds the extra language models needed for wikification. Once this environment is prepared, the script scans the input directory and creates a list of all the text files to be processed. CoreNLP is then run on these files, automatically marking words and phrases that it recognizes as named entities (such as people, places, or organizations) and, when possible, linking them to their corresponding Wikipedia entries. From these annotated files, the script extracts only the location entities that have Wikipedia identifiers and compiles them into a single list saved as entities.txt.

In the second stage, the geocode.py script takes entities.txt as input and queries Wikidata for geographic coordinates. For each location name, the script attempts to retrieve the corresponding English Wikipedia page, resolve its linked Wikidata item, and extract latitude and longitude values (property P625). All results are collected into a pandas DataFrame with four fields (location, latitude, longitude, and source_file) and are automatically saved as a CSV file with the suffix _geotagged.csv.

This workflow allows researchers to point the pipeline at any directory of plain-text (.txt) files, obtain a consolidated list of geotagged locations, and generate structured outputs that are immediately usable in GIS software or further computational analysis. The combination of shell and Python scripts ensures portability: the shell script handles large-scale batch processing on HPC systems, while the Python script alone can be run locally for smaller projects.

Input and Corpus Flexibility

The pipeline is designed to accept any set of plain-text (.txt) files as input. No specialized markup or metadata are required: the system tokenizes the files internally and applies NER directly. This design makes the workflow adaptable to corpora as varied as literary works, oral histories, archival transcriptions, or classroom assignments. Researchers can simply point the pipeline at a directory of .txt files and obtain a structured output of detected locations with geographic coordinates.

Computational Environment

The scripts are written in Python (tested with version 3.10) and depend on widely used libraries such as requests, json, and pywikibot. The workflow also requires Stanford CoreNLP, which can be run as a local server or integrated via the Python wrapper. To simplify setup, the provided shell script automatically installs CoreNLP and downloads all the models needed for wikification, so users do not have to perform any additional installation steps. For larger corpora, the included SLURM script enables deployment on HPC clusters, while the shell script provides a streamlined interface for local execution. This dual configuration ensures that the same codebase scales from exploratory testing on a laptop to batch processing of millions of words on HPC infrastructure.

Disambiguation and Error Handling

The current version of the pipeline implements a basic error-handling workflow rather than a full disambiguation system. For each location name, the script attempts to retrieve the corresponding English Wikipedia page and extract its linked Wikidata item. If the page exists and contains geographic coordinates, those coordinates are returned. If no coordinates are present, the pipeline records a placeholder value (–1000, –1000). If no page exists under the given name, the script attempts to follow a redirect (for example, mapping “Peking” to “Beijing”) and then extract coordinates from the redirected page. If this also fails, a separate placeholder (–2000, –2000) is returned to indicate that no match was found. This approach ensures that redirects and missing data are explicitly logged, preventing silent failures.

The current workflow is intended for first-pass, exploratory geocoding of unstructured corpora and therefore implements only minimal disambiguation. Locations with multiple referents—for example, the dozens of global Springfields, or ancient versus modern Thebes—are matched to the first corresponding Wikipedia page. In Wikidata specifically, multiple Q-items often represent historically or conceptually distinct places (e.g., Classical Athens, the modern municipality, and the archaeological site). The pipeline does not attempt to distinguish among these and therefore should not be used as a definitive historical place-resolution system. Instead, it provides a reproducible baseline that can be extended with contextual heuristics, co-occurrence data, or filters derived from Pleiades, WHG, or other domain-specific gazetteers.

Working with Multilingual or non-English Texts

Because Wikidata stores labels for places in dozens of languages, the pipeline can also process non-English and multilingual texts. When a place name appears in another language (such as “Aθήνα” in Greek, “Atenas” in Spanish, or “雅典” in Chinese), Wikidata recognizes it as an alias of the same underlying entity (Q1524, Athens). The code is able to take advantage of this multilingual metadata, so long as the place name is represented in Wikidata. In addition, the script’s handling of redirects (for example, from “Peking” to “Beijing”) ensures that older or variant spellings can still be linked to the correct entry.

By default, the pipeline relies on Stanford CoreNLP’s English-language models for named entity recognition, which means that entity detection itself will work best on English texts. However, once location names are identified—whether by CoreNLP or by another tool better suited to a given language—the Wikidata lookup can still resolve them correctly, provided the names exist in Wikidata’s multilingual label set. This makes the workflow adaptable to non-English corpora and highlights the potential of Wikidata as a bridge across languages and historical naming conventions.

Stanford CoreNLP provides pretrained named entity recognition (NER) models not only for English but also for a range of other languages, including Spanish, German, French, and Chinese. Each model is distributed with its own configuration file (a .properties file) that specifies which tokenizers, part-of-speech taggers, and NER classifiers should be used for that language. When running CoreNLP, researchers can load these models simply by calling the relevant properties file (e.g., -props StanfordCoreNLP-spanish.properties) or, if needed, by directly specifying the path to the language-specific model jars. This design makes it possible to process non-English texts by changing only a single line of code.

I will note two important caveats. First, the entitylink annotator used in the English workflow is not multilingual; it is designed to link recognized entities to English Wikipedia pages. As a result, in non-English workflows CoreNLP will still identify location names but will not automatically provide Wikipedia identifiers. In practice, this means that non-English pipelines need to rely on the list of detected location strings, which can then be passed to the Python script for resolution against Wikidata. Second, the quality and coverage of the non-English models vary significantly. For widely used languages such as Spanish and Chinese, the NER models are reasonably robust, but for others (especially those with smaller training datasets) accuracy may be lower, and important toponyms may be missed. In such cases, researchers may consider substituting alternative NER systems better suited to the target language (e.g., Stanza or spaCy) and then feeding their outputs into the Wikidata lookup stage of the pipeline or running the workflow multiple times with different NER models and merging the detected entities before Wikidata reconciliation.

Output and Reproducibility

The pipeline produces a tabular output (.csv file) listing each input location alongside its geographic coordinates. By default, the script creates a pandas DataFrame with four columns. The first records the place name exactly as it appeared in the input file, while the second and third contain the latitude and longitude in decimal degrees. If no geographic coordinates are available, these fields are filled with placeholder values. The fourth column notes the name of the source file from which the entity was extracted, ensuring that each geotagged entry can be traced back to its original context. If the csv option is enabled (default), this DataFrame is saved as a .csv file with the suffix _geotagged.csv. Missing or unmatched entries are not dropped from the output; instead, they are encoded with placeholder values; this explicit error encoding ensures that every input entity is accounted for, including failed matches, and allows researchers to review, filter, or manually correct unresolved cases. Because the code is open source and logs its queries to Wikidata, the workflow is transparent and reproducible, and can be adapted or extended for other corpora.

Re-Use Potential

This workflow is designed for digital humanities researchers who work with unstructured textual corpora and need an open, reproducible method for extracting and geocoding place names. Its primary audience includes scholars in literary studies, history, classics, archaeology, and cultural heritage who must identify toponyms embedded in plain-text documents but may not have access to curated Text Encoding Initiative (TEI)metadata or domain-specific gazetteers. Because the pipeline requires only basic command-line familiarity, it is well suited to researchers who can run scripts but may not have the time or expertise to construct a multi-stage geocoding system on their own. It is particularly useful for projects that involve large or heterogeneous corpora, since the inclusion of ready-made shell and SLURM scripts allows users to run the workflow locally for exploratory testing or deploy it directly on university HPC clusters for large-scale batch processing.

The pipeline also benefits projects working with multilingual or historically variant place names. Wikidata’s extensive labels and aliases allow the workflow to match transliterated forms, non- English spellings, and historical variants when they are represented in the knowledge base. Because the workflow is fully open-source, it is also well-suited for researchers who require unrestricted downstream analysis and reproducibility, including those working under open- science or long-term preservation mandates.

For scholars seeking a rapid spatial overview of a corpus (e.g. for the purpose of identifying narrative settings, detecting clusters of historical activity, or establishing geographic baselines) the pipeline offers a fast and replicable method for generating coordinate data without the need to construct a bespoke system. At the same time, it is not intended as a comprehensive disambiguation tool: projects that require fine-grained historical place modeling, ontological alignment, or expert-curated gazetteer reconciliation may wish to replace or combine this workflow with resources such as Pleiades or the World Historical Gazetteer. Rather than serving as a definitive geospatial authority file, the workflow provides an extendable foundation for humanities researchers who need an openly licensed, scalable approach to geocoding unstructured text.

Codebase Description

This dataset is archived in the JOHD Dataverse repository and mirrors its active development on GitHub. It contains a small suite of scripts designed to support geocoding workflows for humanities datasets, integrating named entity recognition (NER), entity linking, and coordinate retrieval from Wikidata.

Repository location

Dataverse: https://doi.org/10.7910/DVN/NNGFJC

GitHub: https://github.com/AnnieKLamar/geocode

Repository name

The dataset is available in the JOHD Dataverse repository and mirrored on GitHub for active development and version control.

Object name

geocode.sh: Shell script for setting up and executing Stanford CoreNLP with the required language models and entitylink annotator. Automates preprocessing, named entity recognition (NER), and wikification across a directory of plain-text (.txt) files. Configured for both local execution and high-performance computing (HPC) environments.
geocode.py: Python script that processes the list of extracted location entities (entities.txt) and retrieves latitude/longitude coordinates from Wikidata using Pywikibot. Handles redirects, missing pages, and missing coordinate values, returning standardized placeholder codes where necessary. Outputs results as a CSV file with columns for place name, latitude, longitude, and source file.
geocode.sbatch: Optional SLURM submission script for running run_corenlp.sh on HPC clusters. Includes configurable resource requests for scalable processing of large corpora.
README.md: Detailed README file including a line-by-line explanation of the geocode.sh file.

Format names and versions

The repository includes code and documentation files in the following formats: .py (Python scripts), .sh (shell scripts), .sbatch (SLURM batch submission scripts), and .md (Markdown documentation).

Creation dates

Originally published on GitHub: 2024-04-11

Dataverse Publication: 2025-10-07

Dataset creators

Annie K. Lamar

Assistant Professor of Computational Classics

Director, Low-Resource Language (LOREL) Lab

University of California, Santa Barbara

Language

The codebase uses English for variable names and documentation in the README file, with implementation written in Python, Bash, and SLURM.

License

The dataset is published on Dataverse under a CC0 1.0 license, granting universal public domain dedication.

Publication date

The codebase was originally released on GitHub 2024-04-11 and subsequently deposited in the Dataverse repository on 2025-10-07.

Acknowledgements

I would like to thank Brad Rittenhouse, Research Data Facilitator at the Stanford Research Computing Center, for his guidance and support during the original development of this project.

Competing Interests

The author has no competing interests to declare.

Author Contributions

This is a single authored piece.