A TEI-Compliant Dataset for Historical Data From the DigitalSEE Collection

Dimitar Iliev; Kristiyan Simeonov

doi:10.5334/johd.326

Full Article

(1) Overview

Repository location

https://doi.org/10.6084/m9.figshare.28344539.v2

Context

The dataset was created as a part of the Digital Repository for South Eastern Europe in the 18^th-19^th centuries (DigitalSEE). DigitalSEE is a comprehensive virtual repository of historical data and their visual output aimed at various scholarly audiences and the general public (Baramova et al., 2024b). The DigitalSEE platform uses a dataset of geo-annotated excerpts from historical and archaeological documents encoded in custom-made, non-proprietary XML (Extensible Markup Language) and stored as JSON (JavaScript Object Notation)and XML files. The types of documents serving as sources for the initial encoding of data include excerpts from historical travelogues, documentation of archaeological sites, geolocated objects described in situ, epigraphic monuments, and data from historical maps. The volume as well as the scope of the source data will expand as the project activities progress.

In addition to the publicly available Python Flask application used for the generation of these files from a web input form (Simeonov, 2024a), as well as the HuggingFace space for their in-browser visualization (Simeonov, 2024b), the dataset is stored in a findable and accessible FigShare repository (Baramova et al., 2024a). In this way, not only are the research outcomes of the project represented on the user-end of the DigitalSEE platform (DigitalSEE, 2025), but the auxiliary tools and the raw data created in the project workflow are also at the disposal of further researchers with similar goals and interests. However, the XML code generated for the original dataset is largely dependent on the structure of the web input form and reproduces its fields in a machine-readable format. This implies that the resulting dataset is not in complete accordance with the FAIR (Findable, Accessible, Interoperable, Reusable) principles of data storage and management (Wilkinson et al., 2016). To address this issue, a derivative dataset from the same raw files has been created, containing interoperable and reusable XML code compliant with the TEI (Text Encoding Initiative) P5 Guidelines (TEI Consortium, 2023). For this purpose, we adjusted our approach to the resources (guidelines and schemas) provided by the EpiDoc initiative that develops a subset of TEI for the encoding of epigraphic and other historical documents in XML (Elliott et al., 2006–2022). This is the dataset presented in the current paper (Figure 1).

A high-level overview of the DigitalSEE (Digital South-Eastern Europe) workflow, showing how the Flask-based data management system mediates between the original XML files, their derivative TEI P5-compliant counterparts, web visualization and external repositories (HuggingFace, Figshare) to ensure collaborative, FAIR-aligned data curation.

(2) Method

The three main methods used for the creation of the dataset can be summarized as a) data analysis, b) transformation, and c) visualization. The implementation of the workflow includes the following steps.

Steps

Analyzing the data structure and the hierarchy of the source XML files (Baramova et al., 2024a), their parent, child, and sibling elements, attributes, and values.
Selecting suitable TEI-compliant analogues to the respective elements, attributes, and values.
Creating a new TEI-compliant template valid against the EpiDoc RNG (REgular LAnguage for XML Next Generation) schema for encoding historical documents and adherent to the tagset of the source XML code.
Developing a Python xml.etree.ElementTree (Lundh, 1999–2008) algorithm for the automated transformation of the source XML code into its TEI-compliant counterpart according to the generated RNG schema. The Python script applied generates a TEI-compliant XML in a way analogous to the XSLT (Extensible Stylesheet Language Transformation) scenarios in the workflows of other projects, but allowing for more flexibility and faster processing of the source files.
Inspecting and manually correcting the output XML code after the transformation when needed.
Creating a GitHub repository to store the output XML files in their final versions validated against the EpiDoc RNG schema.
Developing a visualization tool for the dataset repository using a Streamlit low-code solution (Simeonov, 2025).

The resulting TEI-compliant XML files have the following structure validated against the EpiDoc schema. All the meta- and paradata are contained in the <teiHeader> element, while the sources are given in the <text> element. The latter contains several <div> child elements with attributes such as @edition, @commentary, and @bibliography, containing textual information from and references to different levels of sources (primary, secondary, paratexts, etc.). The metadata in the <teiHeader> element, all of them contained in the <fileDesc> child element, are subdivided into title and authorship of the XML file (<titleStmt>), information about the project and the copyright (<publicationStmt>), and the metadata of the source (<sourceDesc>, with main child element <msDesc>). Among the latter, the richest information is contained in the <history> child element where the source’s origin, as well as its current location or previous locations (if any) are described according to the EpiDoc standards: <origin>, <provenance type = “found”>, <provenance type = “observed”> and the respective sub elements containing links to different gazetteers such as Pleiades¹ or Geonames². Further detailed information about how the metadata in such EpiDoc-adherent files (also used by the project team in similar previous initiatives) can be found in (Iliev, 2024).

Sampling strategy

Random Selection

A random subset was picked out to reflect the overall diversity in the types of documents that were used to produce the dataset, as well as the temporal coverage and structural complexity of the file.

Selected inclusion

Specific documents with more complex formatting or structure were selected to create the TEI-compliant structure, as well as to illustrate the algorithm’s capacity to handle challenging cases.

This strategy helped to ensure that our examples represent the uniform application of the algorithm and the varied characteristics of the original data.

Quality control

Automated Schema Validation

All output files are validated against the EpiDoc RNG schema to ensure compliance with the TEI guidelines (Elliott et al., 2006–2022). This step ensures the structural and syntactic validity of the files in the dataset.

Manual Inspection and Correction

After the automated transformation, the files are manually reviewed for anomalies missed by the automated checks.

Version Control and Traceability

The final version of the TEI-compliant files is stored in a GitHub repository, ensuring any changes or corrections are documented and can be tracked (Chen et al., 2024; Ram, 2013). To ensure the FigShare dataset remains current with these updates, we utilize the automated features provided by FigShare that synchronize with new releases on GitHub. This process automatically generates a new version of the dataset under the same DOI (e.g., v1, v2, etc.).

(3) Dataset Description

Repository name

DigitalSEE (Digital South-Eastern Europe): TEI Collection

Object name

Bestroi150-digitalsee-tei-collection-1.1.0 containing a subfolder titled/collection.

Format names and versions

The data is stored as a ZIP file containing a subfolder with TEI-compliant XML files.

Creation dates

2025-01-23

Dataset creators

Dimitar Iliev (researcher), Kristiyan Simeonov (researcher), Ivan Valchev (researcher) (All the researchers are from Sofia University “St. Kliment Ohridski”).

Language

The dataset primarily uses English. Within the elements, the @xml:lang attribute indicates the language of content in the element, which contains an excerpt from the source, specifically the <div type = “edition”>. The languages found in this element are English, German, French and Latin. To support these languages, we utilise BCP 47 (Internet Engineering Task Force, 2009), with a specific focus on the ISO 639-2 standard.

License

CC BY 4.0

Publication date

2025-02-04

(4) Reuse Potential

The dataset is openly available to scholars in the fields of History, Archaeology, Regional Studies, Balkan Studies, and Geospatial Studies. The dataset can be used through the visualization tool to yield research results not necessarily included in the output of the DigitalSEE website and platform, nor initially envisaged by the project team. The data can be complemented with additional XML files using the same template as the ones contained in the set and enriched with new sources according to the interests and needs of various researchers. The TEI XML data can potentially be adapted to other visualization and front-end tools. The raw code can be processed with regular expressions³ via specialized text editors, integrated development environments (IDEs), existing online tools and within programming languages (Fitzgerald, 2012) Jupyter notebook or the Streamlit tool, following the instructions in the connected GitHub repository (Simeonov, 2025). The dataset documents material related to objects and places across the Balkans, with a particular emphasis on Bulgaria. It compiles historical travelogues by authors such as Hans Jakob Ammann, Gerhard Cornelius von Driesch, and John Newberie, providing a rich resource for research into historical narratives, regional cultural studies, and comparative analyses of travel literature. One notable case study, concerning a place featured in the dataset, is Trajan’s Gate Pass (Valchev, 2024).

A limitation of the dataset is its modest size compared to larger corpora. However, our focus has been on the depth of the information rather than quantity. The dataset is part of an ongoing research project and is constantly updated. The DigitalSEE initiative represents a close-reading approach, involving experts who are deeply familiar with the subject matter. This differs from distant reading techniques, which may yield larger-scale analyses, and thus our approach might be perceived as less comprehensive from a quantitative standpoint. Nevertheless, the goal of this paper is to showcase an exemplary FAIR dataset that can be used in historical data representation. It can be subjected to further expansion in the framework of the current or, hopefully, future projects.

Notes

[1] https://pleiades.stoa.org/.

[2] https://www.geonames.org/.

[3] Interactive resource for Regular Expressions: https://regexone.com/ (last accessed: 05.02.2025).

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Dimitar Iliev: Conceptualization; Supervision.

Kristiyan Simeonov: Data curation; Formal analysis; Visualization.