Have a personal or library account? Click to login
Towards a Data-Driven History of Lexicography: Two Alchemical Dictionaries in TEI-XML Cover

Towards a Data-Driven History of Lexicography: Two Alchemical Dictionaries in TEI-XML

By: Sarah Lang  
Open Access
|Mar 2025

Full Article

(1) Overview

Repository location

Two Alchemical Dictionaries in TEI-XML [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14638445; related GitHub repository: https://github.com/sarahalang/alchemical-dictionaries.

Context

This groundwork for this dataset was laid during a seminar at the University of Graz by Ines Lesiak and Rosa-Maria Mayer and subsequently refined by the author.

(2) Method

Steps

The TEI-XML files adhere to the TEI dictionary module, presenting a structured version of the dictionary where entries can be addressed separately irrespective of their layout on the page. Frontmatters and backmatters were not encoded in detail as the focus was on making the dictionary entries addressable. The encoding is based on high-quality text data generated using the Transkribus-based NOSCEMUS GM4 HTR model (Fröstl et al., 2023). Developed as part of the NOSCEMUS ERC project, this HTR model specializes in early modern scientific texts, including alchemical works, and achieves a character error rate of .80%, i.e. roughly 2–10 character errors per early modern page (including minor issues like missing whitespace). Given the length and density of the pages, this is an exceptionally low error rate for HTR output for a historical text with an unusual layout, comparable to human transcription errors unless multiple proofreading rounds are conducted, which suggests that the dataset’s quality is suitable for various applications.

Quality control

Despite the fact that comprehensive proofreading to achieve edition-quality text was not done, the dataset is highly usable and a significant resource for scholars.

(3) Dataset Description

Repository name: Zenodo

Object name: Two Alchemical Dictionaries in TEI-XML

Format names and versions: TEI-XML

Creation dates: 2022-07-01 to 2025-01-13

Dataset creators: Rosa-Maria Mayer (former student of University of Graz), Ines Lesjak (former student of University of Graz), Sarah Lang (University of Graz), Stefan Zathammer (University of Innsbruck)

Language: Latin and German in the historical sources, descriptions in English.

License: CC BY

Publication date: Published on Github 2025-01-09; on Zenodo on 2025-01-13.

(4) Reuse potential

High-quality data is essential for a data-driven intellectual history that traces the origins, development, and spread of ideas using computational methods. This paper presents two alchemical dictionaries containing approximately 20,000 entries, encoded using the TEI dictionary module, to make the data accessible for digital humanities research and computational analysis.1 Structured data enables automated analysis of individual entries, regardless of their placement on the page, and allows for automatic comparisons of the same lemma across multiple dictionaries. While traditional historians may still prefer to consult scanned pages manually, such structured datasets enable data-driven research into the history of ideas and the development of early modern terminology. Alchemical language offers a unique lens for this study, as it is a specialist language that underwent significant evolution in the early modern period and possesses characteristics that make it particularly interesting for such an analysis.

Ongoing digitization efforts provide unprecedented opportunities to enhance the historiography of alchemy through computational methods. These methods are not intended to replace traditional approaches but to complement them. Digital analyses advance our understanding at both the micro- and macro-historical levels, as they allow us to contribute to larger-scale narratives than would be possible using traditional methods. Lexicography, a task difficult to investigate manually due to the sheer volume of entries in the extensive alchemical dictionaries, is an area where computational methods can be particularly valuable. Comparisons across multiple dictionaries using traditional methods are typically limited to focusing on specific terms of interest, rather than examining the broader dynamics of how these dictionaries are compiled. This is especially relevant when considering the knowledge organization practices of the early modern period (Blair 2010), where information came to be compiled and curated by an emerging class of professional writers, as reflected in these terminological dictionaries. Thus, data-driven approaches can offer valuable insights into chymical knowledge organization practices.

Alchemical Decknamen (literally “cover names”) differ from typical technical terms, as they are context-dependent and often metaphorical or neologistic (Newman, 1996; Lang, 2022). However, some are relatively stable in their meaning. Traditional lexicographical methods, while valuable, tend to fall short in decoding these terms due to their inherent contextual variability. Because Decknamen and their usage vary across alchemical authors, traditions, and languages, much work remains to achieve a systematic decoding of alchemical Decknamen. Yet, significant work since the 1990s has dispelled the Jungian misconception that historical alchemical language referred solely to spiritual matters rather than actual chemical operations (Newman 1996). Historians analyzed elusive passages of alchemical texts through a chemical lens and recreated them in laboratories, leading to major discoveries about the chemical meanings behind some Decknamen.2 Nonetheless, considerable historiographical debate remains on how exactly Decknamen should be characterized (Lang 2022), which is beyond the scope of this paper.

Early modern practitioners sought to make these terms easier to understand by compiling dictionaries. As in other sciences that developed significantly during the early modern period, lexicography was a key component in the professionalization of chymistry. The professionalization of alchemy and the natural sciences during the early modern period, particularly with the 16th-century print revolution, marked a turning point in the standardization of alchemical knowledge through handbooks and dictionaries. Although terms were not used uniformly across all alchemical texts, their inclusion in dictionaries established reference points, contributing to the gradual standardization of terminology or, in the very least, offering insight into prevalent terms. This process mirrors developments in other early scientific disciplines, where standardizing practices played a key role in professionalization (Korenjak 2024).

The two dictionaries encoded in TEI and presented in this paper are part of this effort, reflecting attempts to codify and unify alchemical terminology. Likely intended as tools for practitioners, they offer valuable insights into how alchemical knowledge was curated, transmitted, and disseminated over time. They also provide a means to examine how early modern lexicographers sought to impose order on the linguistic complexity of alchemical language. Fortunately, numerous historical dictionaries of this kind have been digitized, including Martin Ruland’s Lexicon Alchemiae (Ruland, 2021) and Sommerhoff’s German-Latin dictionary (Sommerhoff, 2021). Both contain numerous headwords with definitions alternating between German and Latin, reflecting early efforts to standardize alchemical terminology in a manner similar to developments in other early sciences.

The Lexicon Alchemiae (1612) is a chymical dictionary comprising approximately 500 pages and ca. 3200 entries, although many are exceedingly long. It is dedicated to Heinrich Julius, Duke of Braunschweig and Lüneburg. The Lexicon Alchemiae is a cornerstone of the alchemical tradition, reflecting the state of the alchemical lexicographical endeavour in the early 17th century. The letter of dedication outlines the author’s intention to impose order on the “Babylonian confusion” of alchemical terminology and promote the study of alchemy.

The entries vary significantly in scope, ranging from simple explanations to comprehensive lists of synonyms, detailed lexicon entries, and even short scholarly treatises. Many entries are subdivided for structure. While Latin is the dominant language, the dictionary includes numerous German translations and explanations. The book was likely designed to meet a widespread need for terminological clarification in the field of alchemy. The imperial privilege prohibiting unauthorized reprints for ten years suggests it was expected to achieve significant commercial success.3

The bilingual Lexicon pharmaceutico-chymicum Latino-Germanicum et Germanico-Latinum was authored by the German pharmacist Johann Christoph Sommerhoff and first published in Nuremberg in 1701. It includes Latin-German and German-Latin sections, with approximately 12,000 entries in the former (~400 pages) and 5,500 shorter entries in the latter (~100 pages). Its comprehensive approach and bilingual nature make it a valuable resource for understanding the historical development of alchemical terminology and its intersections with the wider worlds of medicine, metallurgy and natural philosophy. Sommerhoff’s dictionary also points to the growing role of apothecaries and the professionalization of pharmaceutical work, where such terminology appeared as part of pharmacopoeias.

It is a rare and early example of a scientific lexicon with a near-symmetrical relationship between Latin and German. While the Latin-German section is more extensive, both lexica are alphabetized independently. Vernacular German material may have been included to support users with limited Latin proficiency, such as pharmaceutical apprentices lacking technical Latin knowledge, as the frontmatter claims. The Latin-German section often cites sources, indicating that much of the compiled knowledge was drawn from earlier works. Rather than creating entirely original definitions, early modern authors of compendia synthesized existing knowledge to provide what they considered the most useful resource (Blair 2010).

Covering a wide range of topics ranging from pharmaceutical, zoological, botanical, mineralogical, (al)chymical to medical terminology, Sommerhoff’s terms are supplemented with a rich array of alchemical symbols, often serving as abbreviations. These symbols play a prominent role in the text, either complementing or replacing linguistic terms, sometimes entirely. Across the first 22 pages, 618 symbols were encoded in detail, including approximately 280 unique symbols, many of which were mapped to Unicode ranges for “Alchemical Symbols” (1F700–1F77F) and “Miscellaneous Symbols” (U+2600–U+26FF). Despite this symbol encoding being limited to 22 pages, the dataset still contains several hundred symbols, which could serve as a valuable resource for training datasets or potential machine learning applications.4

This paper has introduced TEI-encoded editions of the Ruland and Sommerhoff dictionaries. These resources may be reused in other projects, such as linking them to digital scholarly editions to provide contemporary definitions for historical terms.5 They may also support computational approaches to disambiguate, interpret, and contextualize specialist terminology. Numerous historical dictionaries remain outside this dataset. Expanding the dataset to include them would enable a comparative analysis of terminology, its evolution, and the transition of alchemical knowledge from personalized usage to a more standardized expert vocabulary, reflecting this historical shift in cultures of knowledge production.

Notes

[1] This data paper refers to the version of the dataset on Zenodo, not primarily to the Github repository that contains more materials and information beyond the dataset discussed here.

[2] This scholarly trend, termed “an alchemical revolution” (Reardon 2011) or “The New Historiography of Alchemy” (Martinón-Torres 2011) has encouraged scholars to refer to early modern practices as chymistry, the term commonly used by practitioners of the period and thus the more authentic designation (Principe/Newman 2001).

[3] The Lexicon Alchemiae was encoded in TEI-XML following the dictionary module in the context of a project seminar at the University of Graz. The significant variation in entry length and structure required a flexible approach to encoding, ranging from automated using regular expressions to manually corrected. To structure the entries, textual characteristics such as the presence of “id est” (indicating a definition) were used to identify the beginning of new entries. However, not all entries follow a consistent structure. OCR junk (e.g., “digitized by Google”) was removed, and some OCR errors were corrected, though the text has not been fully proofread. For German text insertions—whether single-word synonyms, general text, or definitions in German, <cit> and <quote> were used. Landscape tables and diagrams on some pages were encoded as <list>. The regex search-and-replace workflow was relatively effective for short entries but became error-prone with longer ones and had to be manually corrected later.

[4] Given that some of these symbols are not included in Unicode, this dataset may equally serve as a contribution to data modeling. Although the book includes front matter such as title pages, an author portrait, bilingual prefaces, and laudatory poems, as well as errata at the end, the encoding focused solely on the dictionary entries. The front and back matter have not been encoded in detail using TEI. The rest of the dictionary was encoded in a simple implementation of the TEI dictionary guidelines: <entry> elements encapsulate each lexicon entry. <form> with @type="lemma" was used to mark headwords. Variants were marked with @type="variant". <sense> and <def> elements were used for prose definitions, while <cit> elements captured translations.

[5] Those less familiar with digital humanities may wonder why scholars would link to this dataset rather than simply refer to a page in a digitized source or OCR text. Unlike linking to a scanned page or raw OCR output, linking to the TEI encoding allows direct referencing of individual dictionary entries as semantically contextualized text strings. In a Digital Scholarly Edition (DSE), this would, for instance, make it possible to implement a tooltip tool that leverages the TEI files for automatic retrieval of contemporary dictionary definitions for specific terms. Machine-translated English versions, of course, clearly indicated as such, could be used to support users unfamiliar with German or Latin, providing an accessible version for quick reference. In contrast, if a DSE only links to a scanned page, the user would need to manually locate the relevant entry in an external source and be required to understand the language in question. The dataset already contains structures such as IDs, which allow automatic referencing and integration into digital tools. For scholars who do not possess the technical expertise to integrate TEI into their digital scholarly editions, linking to a file containing structured TEI data in a GitHub repository is indeed not particularly useful —but this does not diminish the dataset’s value for the DH community.

Acknowledgements

Rosa-Maria Mayer (2022) and Ines Lesjak (2023) from in the University of Graz project seminar (which is part of the Digital Humanities master’s program) were instrumental in creating the first version of this dataset.

Furthermore, creating this dataset would not have been possible without the high-quality HTR transcription created by Stefan Zathammer in the context of the NOSCEMUS ERC project at the University of Innsbruck.

Competing Interests

The author has no competing interests to declare.

Author Contributions

Sarah Lang

Conceptualization

Data curation

Investigation

Methodology

Project administration

Supervision

Validation

Writing – original draft

Writing – review & editing

DOI: https://doi.org/10.5334/johd.303 | Journal eISSN: 2059-481X
Language: English
Submitted on: Jan 9, 2025
Accepted on: Feb 21, 2025
Published on: Mar 10, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Sarah Lang, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.