(1) Overview
The data introduced in this paper are a collection of 15 type-based word embeddings models coupled with a web application for their explorations. The models are trained on two corpora of Early Modern Latin texts with a special relevance for the history of Early Modern science:
NOSCEMUS – a corpus of digitized Early Modern scientific texts in Latin, (Akopyan et al., 2023; Zathammer, 2025)1
EMLAP – a corpus of digitized Early Modern Latin Alchemical Prints (Hedesan et al., 2025)
In addition to that, for reference, we also implement two other Latin word embedding models from other historical periods: (1) a model trained on Opera Latina (LASLA), a representative corpus of 158 Latin texts from 20 different Classical authors (Denooz, 2004; Longree et al., 2023), (2) a model based on Opera Maiora, a corpus of philosophical, theological and religious treaties by Thomas Aquinas (Passarotti, 2010, 2015). These additional models have been produced within the LiLa project;2 their development was properly documented and evaluated (Sprugnoli et al., 2019, 2020), are publicly available online,3 and are fully compatible with our own models, as we inherit their parametrization for training our models.
In total, we offer four temporal models based on NOSCEMUS, eight discipline-specific models derived from NOSCEMUS, one model trained on the EMLAP corpus, and two pretrained models from the LiLa project (see Figure 1):
NOSCEMUS – 1501–1550
NOSCEMUS – 1551–1600
NOSCEMUS – 1601–1650
NOSCEMUS – 1651–1700
NOSCEMUS – Alchemy/Chemistry
NOSCEMUS – Astronomy/Astrology/Cosmography
NOSCEMUS – Biology
NOSCEMUS – Geography/Cartography
NOSCEMUS – Mathematics
NOSCEMUS – Medicine
NOSCEMUS – Meteorology/Earth sciences
NOSCEMUS – Physics
EMLAP
LASLA
Opera Maiora

Figure 1
Tokens per subcorpora.
The discipline categories used for the eight discipline-specific models are directly inherited from the NOSCEMUS wiki metadata. While the applicability of these modern disciplinary names can be questioned as anachronistic, we still consider them as a useful heuristics to navigate through the data, as no authoritative alternative exists at the moment. The creators of NOSCEMUS are domain experts, and their task was far from straightforward: classifying scholarly texts from the first four centuries of modern science by discipline necessarily involves a degree of simplification, especially given that some disciplines disappeared during this period while others only began to emerge. It should also be noticed that these discipline categories are inclusive – in NOSCEMUS, a single work was often assigned more than one discipline label. Such works have been employed repeatedly for training individual submodels.
Repository location
Both the models and the source code for the web app are published via Zenodo:
WEEMS: https://doi.org/10.5281/zenodo.17186167 (models, supplemented by the code used for their training)
iWEEMS: https://doi.org/10.5281/zenodo.16923112 (source code for the web application)
The web app is currently deployed at https://ccs-lab.zcu.cz/iweems/ and https://huggingface.co/spaces/vojkas/iweems.
Context
The models and the web app were produced as a part of the TOME (The Origins of Modern Encyclopaedism: Launching Evolutionary Metaphorology) research project, http://tome.flu.cas.cz, which aims to investigate the role of the evolution of metaphors in the emergence of modern encyclopaedism.
(2) Method
The work presented in this paper draws on the distributional semantics approach to word meaning, which hypothesizes a correlation between distributional similarity and semantic similarity (Harris, 1954; Lenci, 2018; Sahlgren, 2008). By quantifying the distributional contexts in which words appear within a language corpus, we can approximate their semantic relatedness. Once we obtain multiple models trained on different discipline-specific subcorpora or temporal slices, we can also compare how a word’s meaning shifts across contexts.
Despite the rise of token-based (contextual) embeddings — where each token receives its own vector depending on its surrounding context — trained on large-scale corpora using Transformer-based architectures (such as BERT; Devlin et al., 2018), it has been argued that type-based (static) embeddings — where each vocabulary entry is represented by a single vector summarizing its distributional behavior (e.g., word2vec; Mikolov et al., 2013) — still hold value in certain research contexts. This is particularly true in digital humanities, where the size of available corpora and limited computing resources make it impractical to train contextual models from scratch (Ehrmanntraut et al., 2021; Lenci et al., 2022). While relying on pretrained token-based models may be suitable when optimizing for general NLP benchmarks, such models may reflect semantic patterns not inherent to the corpus under study. In contrast, static type-based embeddings have been shown to outperform BERT in many out-of-context semantic similarity tasks (Ehrmanntraut et al., 2021; Lenci et al., 2022; Pražák et al., 2020; Schlechtweg et al., 2020).
We adopted the static type-based embedding approach described above. Its limitation is well known: each vocabulary item is represented by a single vector, which prevents the model from capturing homonymy and polysemy. Yet this constraint can itself become a useful feature, as the extent to which embeddings reveal dominant or blended meanings may offer insights into how concepts were employed and contested within Early Modern Latin scientific discourse.
Steps
We trained the models on textual data which were automatically cleaned and morphologically annotated.4 For that we relied primarily on LatinCy (Burns, 2023), a set of natural language processing pipelines for Latin for use with the spaCy Python library (Montani et al., 2023). Training input consisted of hybrid sentence data from each subcorpus, where every sentence was represented as a list of lemmata restricted to nouns (NOUN), verbs (VERB), adjectives (ADJ), and proper names (PROPN). As a consequence—and a known limitation of our approach—we did not split homographs (e.g., liber ‘book/free’, or LASLA’s lemma variants) into distinct vectors. This collapses senses into a single type-vector, an intentional trade-off in type-based embeddings noted below. Polysemous or homographic lemmas may therefore appear as “blend” vectors. We flag this as a limitation and identify it as a direction for future work (e.g., lemma+POS keys or sense-tagged variants).
For each subcorpus, vectors were trained only for words included in a predefined vocabulary. This vocabulary was derived from raw lemma frequencies across all subcorpora: we extracted the 5,000 most frequent lemmata from each subcorpus, yielding a combined vocabulary of 11,044 unique items (see Supplementary Material, Section 1, Table 1). During training, items with fewer than 10 occurrences in a given subcorpus were excluded. This pipeline ensures that each subcorpus model remains representative of its domain while also preserving a substantial overlap in vocabulary across models, which enables meaningful cross-corpus comparison.
Subsequently, the models were trained using the skip-gram variant of the FastText algorithm (Bojanowski et al., 2017) and implemented using the Gensim library for Python (Řehůřek & Sojka, 2010). For the training, drawing on Sprugnoli et al. (2020), we used the following parameters:
100 dimensions
Context window size: 10
Number of negative samples per positive example (used in negative sampling): 25
Training iterations: 15
This parametrization makes our vectors directly comparable to the models based on LASLA and Opera Maiora as we inherit them from the LiLa project.
We evaluate each WEEMS submodel on a standard synonym-selection task modeled on TOEFL and adapted to Latin (Sprugnoli et al., 2020). For each benchmark item, we compute cosine similarities between the target lemma and four candidates (one highly semantically related or a synonym + three decoys) and mark a prediction correct if the highly related term has the highest similarity (ties count as correct). Items with missing vectors are excluded from accuracy computations but are counted in coverage statistics. Coverage is reported under two settings: min-1 (target, related, and ≥1 decoy in-vocab) and all (all five lemmas in-vocab) (see Supplementary Material, Section 2, Table 2).
The Latin synonym benchmark is grounded in Classical lexicography and was originally designed around the Opera Latina (LASLA) corpus (Classical authors, ~1.7M tokens). As a result, LASLA shares far more lemmas with the benchmark than the other WEEMS subcorpora, which reflect Early Modern scientific, alchemical, or scholastic domains. The fact that all our models nevertheless achieve consistently high accuracy once items are covered demonstrates that the embeddings capture synonymy relations robustly across very different corpora.
(3) Dataset Description
Repository name
Zenodo
Object name
In the case of WEEMS, for convenience, the models are serialized into one pickle file: /data/vectors_dict_comp.pkl.
In the case of iWEEMS, the source code for the web app consists of four python files located in the main directory of the repository: iweems-streamlit.py, explorer.py, crosscorpora_comparisons.py and overview.py.
Format names and versions
All the WEEMS vector data are stored within a single pickle file, /data/vectors_dict_comp.pkl, containing a Python dictionary object, with the labels of individual models as dictionary keys and their corresponding vector data in the form of Gensim Python library keyed vectors as values. Once the repository is downloaded or cloned, the vector data can be loaded directly using the following Python code snippet:

Subsequently, for example, a list of words with the most similar vector representation to the Latin term scientia in the EMLAP submodel can be obtained by running the following command:

The iWEEMS web app allows users to explore the WEEMS data in an interactive manner with several visualizations and tabular views.
First of all, iWEEMS offers interactive three-dimensional projections of the WEEMS vector data. For each submodel, we projected all vectors into a three-dimensional space using tSNE. To choose the most suitable algorithm for the projections, we experimented with 16 different parametrizations of tSNE (van der Maaten & Hinton, 2008), manipulating the perplexity parameter, and 30 different parametrizations of UMAP (McInnes et al., 2018), manipulating especially the n_neighbours parameter (see weems/data/projections_experiment.csv). We ran these experiments with models for two subcorpora – the smallest one (EMLAP) and the largest one (NOSCEMUS – Medicine). To assess the performance of different parametrizations, we measured trustworthiness of the projections, i.e. how well the local neighbourhood structure of the original high-dimensional data is preserved in the lower-dimensional representation (Venna & Kaski, 2001).
Instead of visualizing the complete vector space of each model, the tool allows the user to (1) to select one of the subcorpora from a list of options, (2) to choose a target word using a free text input, and (3) to set up the numbers of nearest neighbours to visualize.
While the nearest neighbours are determined as words with the highest cosine similarity within the full vector space of the respective subcorpus, their positions in the plot are derived from the tSNE projections described above, which form one of the key data inputs for the web app. This distinction is important to notice, especially for polysemous words, since even a word with a relatively high cosine similarity with a target can be plotted as relatively remote from it, being dragged toward other clusters within the projection. For each word visualized within the interactive 3D plot, there is a popping-up hover text, with the count of the word within the respective subcorpus, its machine translation into English, and its cosine similarity score with the target.
In addition to the interactive 3D visualization, the app also offers a tabular view of the nearest neighbours including the cosine similarity score. Further, there is a similarity matrix covering mutual cosine similarities among all nearest neighbours.
Another important feature of iWEEMS is the cross-corpora comparison (see the Crosscorpora Comparison tool within the app). Even though vectors from different subcorpora live in separate spaces, we bypass explicit alignment by comparing similarity profiles of the same term. We first compute a cosine similarity matrix for each subcorpus’ model based on the shared vocabulary of 3,055 words appearing across all subcorpora. For a chosen term, we extract its row from each subcorpus’ cosine similarity matrix (removing the self-similarity entry), which represents how that term relates to every other shared word within the subcorpus. We then compute pairwise Pearson correlations between these profiles, producing a correlation matrix that captures how consistently the term’s semantic neighborhood is preserved across subcorpora. High correlations suggest stability of meaning, while lower or negative correlations may point to semantic shifts or discipline specific usages.
The iWEEMS web app is implemented using streamlit. To deploy the app locally on your machine, (1) clone or download the repository, (2) move to the repository, (3) create or activate a dedicated Python environment, (4) install the required Python libraries from REQUIREMENTS.txt, and (5) run the following shell command:

Creation dates
WEEMS v0.3.9: 2025-06-04 (first public version released on 2025-01-10)
iWEEMS v0.7.1: 2025-06-04
Dataset creators
Vojtěch Kaše (Institute of Philosophy, Czech Academy of Sciences, Prague, Czech Republic): Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing
Jana Švadlenková (Institute of Philosophy, Czech Academy of Sciences, Prague, Czech Republic & Department of Philosophy, University of West Bohemia, Pilsen, Czech Republic): Data curation, Project administration, Writing – review & editing
Jan Tvrz (Institute of Philosophy, Czech Academy of Sciences, Prague, Czech Republic & Department of Philosophy, University of West Bohemia, Pilsen, Czech Republic): Data curation, Software, Visualization, Writing – review & editing
Georgiana Hedesan (History Faculty, University of Oxford, Oxford, United Kingdom): Methodology, Resources, Validation, Writing – review & editing
Petr Pavlas (Institute of Philosophy, Czech Academy of Sciences, Prague, Czech Republic): Conceptualization, Funding acquisition, Methodology, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing
Language
The language of the source texts used for training the models is Latin. The documentation of the web app is in English.
License
CC-BY-SA-4.0 license
Publication date
2025-06-13
(4) Reuse Potential
The WEEMS and iWEEMS resources can be reused in two complementary ways. First, anyone can explore the deployed interactive web app, which provides visualizations (3-D/2-D projections, similarity heatmaps, and cross-corpus comparison tools) without the need for coding or local setup. This makes the models accessible for teaching, exploratory research, and collaborative work with scholars outside computational linguistics. Second, the vector data themselves are openly available via Zenodo, enabling researchers to download and incorporate them into their own workflows for aggregation, further analysis, validation, or integration with other Latin resources. See Supplementary Material, Section 3 for an illustrative case on the term mercurius. The example elaborated there demonstrates how WEEMS and iWEEMS can serve as both an exploratory tool and a reliable research dataset for the data-driven history of ideas. The models invite reuse not only in specialized studies of Early Modern science, but also in comparative projects across Latin corpora and in teaching modules that introduce students to semantic change. They also support broader digital humanities workflows, where embeddings can be aligned, aggregated, or extended with new materials.
Additional File
The additional file for this article can be found as follows:
Notes
[1] https://transkribus.eu/r/noscemus (last accessed 10 October 2025).
[2] https://lila-erc.eu (last accessed 10 October 2025).
[3] https://embeddings.lila-erc.eu (last accessed 10 October 2025).
[4] For preprocessing of the textual datasets, see https://github.com/CCS-ZCU/noscemus_ETF and https://github.com/CCS-ZCU/EMLAP_ETL respectively.
Acknowledgements
We are grateful to the team of data curators of the EMLAP corpus led Georgiana Hedesan, namely Alexander Huber, Ondřej Kříž, Jindra Kubíčková, and Jana Ředinová (Institute of Philosophy of the Czech Academy of Sciences), for their careful work on data extraction and cleaning of the corpus. Furthermore, we are grateful to Martin Korenjak and Stefan Zathammer (University of Innsbruck) for providing us with access to the data of the NOSCEMUS corpus.
Competing interests
The authors have no competing interests to declare.
Author Contributions
Vojtěch Kaše: conceptualization, data curation, formal analysis, methodology, Software, validation, visualization, writing – original draft, writing – review & editing
Jana Švadlenková: data curation, project administration, visualization, validation, Writing – original draft, writing – review & editing
Jan Tvrz: data curation, software, visualization, writing – review & editing
Georgiana Hedesan: methodology, resources, validation, writing – review & editing
Petr Pavlas: conceptualization, funding acquisition, methodology, resources, supervision, validation, writing – original draft, writing – review & editing
