(1) Overview
Repository location
The dataset can be found at https://doi.org/10.17605/OSF.IO/DSVKR
Context
This data paper investigates the semantic properties of so-called geonyms, classifying words in place names (e.g., Square in Leicester Square; Piazza in Italian Piazza Esedra ‘Esedra square’ [Blair & Tent, 2021; Morabito, 2020; Samo & Ursini, 2023]). A more detailed companion, currently under review, illustrates the procedure described here and the literature landscape for a small, but informative set of Romance languages. That paper proposes a cross-linguistic, context-sensitive analysis of target geonyms expressing the concepts ‘square’, ‘street’ and ‘alley’ initially for five Romance languages with official/national status: French, Italian, Portuguese, Romanian and Spanish. For each language, the authors selected three geonyms that express these concepts in toponyms (e.g., Italian Piazza for ‘square’; French rue for ‘street’; Portuguese beco for ‘alley’). The goal of the paper was thus to provide a model for the cross-linguistic, context-sensitive analysis of geonyms and their senses/meanings. However, the empirical richness of Wikidata also provides the opportunity of working cross-linguistically across official languages as well as local varieties to detect dimensions of variation (cf. see Samo & Ursini, 2023; Ursini & Samo, 2025).
We used three distinct data sources acting as contexts of use (general/conversational, specialised, technical term context [ten Hacken, 2018]), extracting three data types from each source/context. The first source/context was SketchEngine, to access large corpora and general/conversational uses (Kilgarriff et al., 2013). The second was Wikipedia, to access specialised articles on places and place names (Syed et al. 2021). The third was Wikidata as a multilingual dictionary, to access (technical) term definitions for the target geonyms (Vrandecic, 2013; Turki et al., 2017). The goal of the present paper is therefore to explain in (more) detail the Wikidata data extraction and analysis procedures.
(2) Method
The method worked as follows and summarized by the pipeline in Figure 1.

Figure 1
Overview of the data processing pipeline used in the study.
Steps
First, we used Wikidata as a multilingual dictionary, and accessed the entries for the three geonyms expressing the target concepts in each language (again, ‘square’, ‘street’ and ‘alley’). We started by analysing the definitions offered in English, our meta-language (i.e. Alley: https://www.wikidata.org/wiki/Q1251403; Street: https://www.wikidata.org/wiki/Q79007; Square: https://www.wikidata.org/wiki/Q174782. Last accessed September 5, 2025). We then verified if the selected geonyms would be labelled as possible translations of these entries (e.g., if Piazza was as a translation of square). In case a definition was absent from Wikidata, we decided to consult a monolingual dictionary for verification. This safety measure turned out not to be necessary, at this step.
Second, we used O(pen)S(treet)M(ap), an Open Access online gazetteer (Samo & Ursini, 2023; Ursini & Samo, 2025), to perform a geographical verification task. We first extracted toponyms including the 15 target geonyms, and then verified their geographical distribution on the national territories associated for each language. For instance, we verified that the French toponyms including the geonym place ‘square’, rue ‘street’ and allée ‘alley’ were attested on the French territory, and thus did not give us data from other regional varieties (e.g., Belgian French). We observed local forms of geo-dialectal variation for the Italian data (e.g., calle as a variant for vicolo ‘alley’ in Venice). We included these forms of variation in the analysis after assessing whether they expressed the target concepts (cf. Samo & Ursini, 2023; for discussion).
Third, we selected the definitions of the senses/meanings assigned to each target geonym and extracted those involving geographical, architectural, urban planning and legal domains. We chose these definitions because official dictionaries in geographic disciplines (e.g., Klaus-Jürgen et al., 2010) usually lack legal status, and also lack linguistic information about geonyms and toponyms (e.g., etymology, naming functions). The Wikidata definitions/entries operate as (broadly) official definitions and thus cover these and other linguistic aspects (e.g., grammatical rules of use) in detail. Hence, they provided us with information about their use in official/technical contexts, such as the coinage of new place names including these geonyms (e.g., piatia for new Romanian place names).
Four, we performed a manual analysis of these definitions. Since the authors have good degrees of fluency with each language, they mapped the components (semantic features [Bullinaria & Levy, 2007; Samo & Ursini, 2024]) forming the senses/meanings of each geonym into English as the meta-language of the study. For instance, Italian piazza has a core definition in Wikidata that includes the phrase luogo pubblico di circolazione ‘public place for circulation’. The semantic method of analysis permitted the authors to extract the features ‘place’, ‘public’ and ‘circulation’ as features/component forming the sense of the geonym piazza ‘square.’ From these four steps, we performed the semantic analysis outlined in the paper under evaluation. Readers interested in the methodology and automated version of the semantic analysis can also consult Ursini & Samo (2025). Here we observe that our methodology allowed us to perform a fast, efficient and replicable extraction of the data.
Sampling strategy
The sampling only worked for the languages under investigation.
Quality control
We interact with native speakers and experts of each target language for the evaluation of each definition feature extraction.
(3) Dataset Description
Repository name
gen_geonyms
Object name
wikidata-definitons.py (the folder contains the file _square, as practical demo example) and the output wikidata_definitions_square.json, osm-data-finder.py and the output geonyms_squaredemo.csv. The files also contain the osm-data-finder and its output for the Romance dataset. The files can be easily modified for the different research questions.
Format names and versions
Python, CSV, JSON.
Creation dates
2025-08-10.
Dataset creators
Giuseppe Samo (main creator) and Francesco-Alessio Ursini (curator).
Language
Meta-language: English. Object languages original paper: French, Italian, Portuguese, Romanian, and Spanish. Object languages presented here for reuse potential: Mandarin Chinese and Uzbek. Script/programming languages: Python.
License
CC-BY Attribution-NonCommercial 4.0 International
Publication date
2025-10-22.
(4) Reuse Potential
The data provide potential re-analysis in the form of large datasets involving three different data types, and methodologies for data extraction and linguistic comparison, in both automated and manual form. The study also provides an ample set of bibliographic references outlining the background to the study, and detailed information for the validation method. Furthermore, the authors believe that the full study could provide a springboard for the teaching of research methods in linguistic data involving multi-source data extraction and analysis. The authors also envision possible future collaborations involving other authors applying these methods to languages outside the Romance family, other data sources (e.g., social media), and other applications extending the methodology across research networks.
As an illustrative example, we show how we can evaluate the features of terms in different languages families. We report below an example for the concept ‘square’ in Italian (piazza), Uzbek (maydon) and Chinese (guǎng chǎng). The main script, wikidata_definitions.py, was developed to automate the retrieval and translation of lexical definitions from Wikidata by querying the public Wikidata API to extract labels and descriptions associated with a given item—in this case, Q174782, corresponding to the target ‘square’ concept. The user can specify the target languages and their ISO code (in our example: it = Italian, zh-cn = Chinese, and uz = Uzbek), which are then processed to produce a structured JSON file containing, for each language, the original term, its native-language description, and an English translation of this native-language description. To ensure translation robustness, the script employs a two-tiered fallback system: it first attempts translation using the googletrans library, and if unavailable or unsuccessful, it defaults to the MyMemory Translated.net public API. The translations aim to offer an overview of the Wikidata definitions also in languages not mastered by researchers; if necessary, they can be also verified with native speakers. The output is as given in Figure 2: each term has a label, the language, the definition in the original language and an automated English translation.

Figure 2
Output of the python script. The JSON file contains the relevant information for the semantic analysis but also it allows an easy extraction for the data retrieval for OSM or other gazetteers.
A second script (osm-data-finder.py) operationalizes the multilingual labels obtained from Wikidata by integrating them into a large-scale geospatial data extraction process. This script reads the JSON file generated by the first script and uses labels as linguistic search keys to query the OpenStreetMap (OSM) database through the Overpass API. Researchers can also divide territories into geographic tiles, according to the levels of fine-grained analysis requested. For example, in the available demo file, researchers may find boxes corresponding the city of Siena in Italy, the urban space of Samarqand in Uzbekistan and the city of Xi’an in China. Limiting the analysis to these smaller territorial extents allows for faster data retrieval, but diverse research questions would require different dimensions of data retrieval. Figure 3 shows a plot of the data collected by plotting the longitude and latitude of the resulting CSV file of the three geonyms, and a finer-grained distribution for piazza in Siena.

Figure 3
Distributions in terms of longitude and latitude of the geonyms piazza, maydon and guang chang in the selected geographical tiles in OSM (left panel) and the distribution of Piazza in the selected geographical tiles.
By coupling Wikidata’s multilingual conceptual framework with OSM’s volunteered geographic information, this workflow enables cross-cultural comparative analyses of how urban spaces are linguistically represented in public cartographic data. We believe that this approach provides a reproducible and autonomous pipeline for the finding semantic normalization of geolinguistic concepts drawn from Wikidata, their spatial distribution, thus supporting cross-linguistic and cross-cultural analyses of urban terminology.
Acknowledgements
We would like to thank the anonymous reviewers for their useful comments during the review process.
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Giuseppe Samo: Data curation, Resources, Software, Writing – original draft, Writing – review and editing
Francesco-Alessio Ursini: Conceptualization, Formal Analysis, Writing – original draft, Writing – review & editing
