(1) Overview
Repository location
The CORSMAXIX dataset,1 the accompanying datasheets, as well as the Python scripts used are available in Zenodo: 10.5281/zenodo.16789180.
Context
The data was produced as part of a second book project (2023–2029) undertaken at the Romanisches Seminar, University of Tübingen, focused on tracing the impact of health humanities policies in the Spanish cultural field at the end of the nineteenth century. This project is also linked to past and current research on the digitization and computational analysis of Spanish-language cultural magazines of the nineteenth and twentieth centuries in Spain and Latin America, conducted under the Chair of Ibero-Romanic Literature at the Romanisches Seminar. Examples include the project Literary Modernization Processes and Transnational Network Formation in the Medium of the Cultural Magazine: From “Modernismo” to Avant-Garde, and its consequent platform Revistas Culturales 2.0 (Revistas Culturales, 2025), which also engaged with debates on the challenges and possibilities of digitizing periodicals (Bingham 2010).
Emerging at a crucial historical moment in Spain’s history—1898, marked by the failure of the Spanish-American War and the loss of its status as a colonial empire—the importance of these magazines lies not only in their role as a crucible for national reflection on the crisis and the country’s future, but also in the way they bring together some of the most significant literary figures of nineteenth-century Spain (Benito Pérez Galdós, Leopoldo Alas “Clarín”) alongside representatives of the cultural renewal that would unfold in the early twentieth century (Miguel de Unamuno, Pío Baroja, Vicente Blasco Ibáñez). Despite their importance and potential for computational analysis, these had not been standardized and converted into plain text up until now.
All 48 issues of Vida nueva (1898–1900), La vida literaria (1899), and La vida galante (1898–1905) were downloaded in PDF format from the “Digital Periodicals Library” of the National Library of Spain (BNE), a component of the “Biblioteca Digital Hispánica” project launched in 2007 and which by 2025 has expanded to include 4,100 titles (Biblioteca Nacional de España, 2025). Alongside the full corpus of text files, the complementary datasheets include the three indexes with each contribution’s title, contributors’ name, magazine, issue, and publication year, another file listing all authors’ names, their unified IDs based on authority control sources, country of origin, and birth and death dates, a fifth datasheet listing authors’ names as they originally appeared in the magazines and another column with their normalized author name, as well as PIDs list by magazine issue number, based on BNE identifiers.
The corpus encompasses 789 texts by 308 contributors, many of whom consistently published in the same magazines, while others contributed simultaneously to several periodicals, showing a higher level of cross-publication (Figure 1). The majority of contributors are local—157 identified Spanish authors—followed by contributors from France (21), Italy (6), the United Kingdom (4), and Germany (4) (Figure 2). A distinction emerged between active contributors—those who submitted original work to these magazines—and foreign authors whose works were translated, as European countries were mainly represented by translated authors. In contrast, Latin Americans (17), after Spaniards, constituted the second-largest group of active contributors. Of the 231 authors whose gender could be identified, only 3 are women, although female authorship may be hidden in unsigned articles or in those signed with pseudonyms or initials.

Figure 1
Distribution of authors’ publications in the three cultural magazines (1898–1899). The colors dark blue, orange, and light blue represent each magazine and the number of works published. Contributors such as Eduardo Zamacois, Enrique Gómez Carrillo, and Pedro Corominas stand out, as they wrote regular columns. Others, like Eusebio Blasco and Octavio Jacinto Picón, diversified their presence across all three magazines.

Figure 2
Number of contributors by country of origin, where Spain accounts for the majority of contributors. European authors—such as Zola, Baudelaire, Leopardi, and D’Annunzio—are primarily represented through translations, whereas Latin Americans form the second-largest group of active contributors submitting original work.
(2) Method
Steps
Text extraction, cleaning and standardization
Altought the OCR had been applied during the BNE digitization, text recognition accuracy varied by magazine and issue (Figure 3) (San Juan 2022; García-Villaraco, T. C., Rubiales Zabarte, G 2023). In order to extract plain texts from these issues as well as obtain metadata that was not available in the search tools provided by BNE repository, the first step involved processing the text using the efficient proprietary toolset ABBYY FineReader (Jain, P., Taneja, K., & Taneja, H. 2021), employing both its integrated Spanish dictionary and the “Verification Text” tool that allows users to check and correct low-confidence recognized characters. Challenges arose in recognizing words that were not in standard Spanish dictionaries (such as colloquial language expressions, e.g., the section “Aires murcianos” in La vida literaria), identifying proper names, as well as correctly recognizing nineteenth-century vocabulary now in disuse. ABBYY also struggled with recognizing uppercase letters, particularly when they appeared in different fonts at the beginning of texts. Linguistically, additional challenges included the lack of standardized Spanish orthography (e.g., interchangeable uses of “g” and “j” or “s” and “c” or inconsistent use of accent marks on monosyllabic words). The recognition of text segments in other languages—Catalan, French, and Portuguese—posed further difficulties (Figure 4).

Figure 3
First page of issue 4 of Vida Nueva, the magazine with the lowest OCR quality.

Figure 4
The “Verification Text” tool in ABBYY FineReader, used in issue 7 in La vida galante. It allows users to manually review and confirm suggested corrections.
Regarding the somewhat problematic relationship between text layout and image (Rißler-Pipka 2017), it was impossible to process advertising pages automatically, as well as correspondence sections from La vida literaria and Vida nueva. The small typography, minimal spacing between columns, and overall lower image quality made it necessary to use the “Draw Text Area” tool in Vida Nueva to manually align the columns in the magazine with the extracted text.
After processing each issue, 48.txt files were obtained and these were then handled for homogenization. This involved manually implementing a structured format following this scheme:
_Title
_Author_
Text
One major challenge was repositioning author names, as there was no uniform structure—even within the same magazine issue—regarding their placement. Texts without an identified author were labeled as “Anonymous”, while untitled works were marked as “Untitled”. For texts affected by discontinuous text flow (such as those in Vida nueva), fragmented texts were merged and placed at their first occurrence to maintain readability.
To further refine author identification, authority control was applied. This process involved verifying and standardizing names and assigning full names to authors who only signed with surnames using name authority sources such as Virtual International Authority File (VIAF), Autoridades (BNE) and Gemeinsame Normdatei (GND).
Creation of Indexes and Supplementary Metadata Data Sheets
A Python script was designed to extract the contributors’ names (author_pattern: r”(.*?)_”) and the title of the text (title_pattern: r”^_.*[^_]$”) using regular expressions, following the previously defined schema (Figure 5). Although page indexing was not considered for this project, this functionality has been included for potential future use.

Figure 5
Code snippet of the Python script using regular expressions to extract authors and titles from the magazine issues.
Information regarding the magazine, its issue, and the year of publication was extracted from the file names, which had been previously formatted according to the following structure: <Magazine(Initials)>_<Issue>_<Year>.txt. Consequently, three first data sheets in the form of an index were created, encompassing the contributors’ names (as they appear in the magazines), their corresponding work, the name of the magazine, its issue, and the year of publication.
Additionally, a fourth data sheet was created to include standardized contributor names based on authority control and their author IDs, along with biographical information such as birth and death dates, gender (female or male), and country of origin. This process also required manually restructuring the way authors names were originally formatted in the magazines (from “First Name Last Name” to “Last Name, First Name”).
This also involved handling pseudonyms and onomastic variants of the same author, which were documented in a fifth data sheet (Contributors_Original– Contributors_Documented). In this case, the first column contained the various ways authors’ names appeared in the magazines or their pseudonyms, while the second included their unified ID based on authority control sources. In cases where a pseudonym had a documented real identity (e.g., Elleide → Lluria Despau, Enrique), the standardized name was used. However, when a pseudonym’s attribution was uncertain or debated (as it was the case of “Dr. Pedro Recio de Tirteafuera” (San Juan 2016) or when only initials were provided, they were preserved as they appeared in their original form.
A new Python script was used to replace the extracted author names in the original index with their corresponding standardized forms, resulting in an updated version of the first indexes. Finally, the fourth data sheet was generated by merging it with the extensive datasheet provided by the Revistas Culturales 2.0 project, where a considerable number of the authors were already listed (Ehrlicher 2016). For authors not found in that database, missing information was manually retrieved from authority control platforms (VIAF, BNE or DNB), whenever possible. A final Python script was used to split the contributions within each issue, using the previously applied regular expression pattern.
The graphs were also generated by Python scripts using the libraries matplotlib, pandas, and seaborn.
Quality control
Together with the “Text Verification” tool provided by ABBYY FineReader, quality control was performed during the standardization of issue schemas for index creation. After splitting the contributions of each issue, a third manual quality check ensured proper text quality and accurate contribution separation.
(3) Dataset Description
Repository name
Zenodo
Object names
CORSMAXIX_Plain_Text.zip (1.48MB);
CORSMAXIX_contributors_ID_gender_country_birth_death.xlsx (31.32 KB);
CORSMAXIX_contributors_original_and_contributors_documented.csv (14.28 KB);
CORSMAXIX_BNE_PID.xlsx (9.65 KB);
CORSMAXIX_vida_nueva_index.csv (20.20 KB);
CORSMAXIX_la_vida_galante_index.csv (8.11 KB);
CORSMAXIX_la_vida_literaria_index.csv (25.19 KB);
Python1_Index.py (2.85 KB);
Python2_Author_Documentation.py (902 Bytes);
Python3_Author_Documentation_RevCult.py (1.14 KB);
Python4_Splitted_Contributions.py (2.71 KB);
Python5_contributors_magazine.py (6.79 KB);
Python6_contributors_country.py (5.52 KB);
Format names and versions
txt, xlsx, csv, py
Version v3
Creation dates
2024-01-09 to 2025-05-01
Dataset creators
Adriana Rodríguez-Alfonso, Conceptualization, Data Curation, Programming, Validation and Writing-review and editing, University of Tübingen
Luis Giraldo González-Ricardo, Data Curation, Validation, University Hospital of Tübingen.
Language
The CORSMAXIX txt. files are in Spanish; Metadata labels in the datasheets are in English.
License
CC0 1.0
Publication date
2025-07-08
(4) Reuse Potential
The CORSMAXIX dataset can be reused for both close and distant reading research in periodical studies, material culture, and nineteenth-century studies. The indexes and author supplementary datasheets also support computational analyses such as social network analysis and natural language processing, and facilitate linking with other periodical datasets amongst libraries thanks to the standardized author data. Potential hidden female authorship could be explored in the future through computational stylometric analysis aimed at attributing anonymous or pseudonymous works. Python scripts developed to extract metadata and split texts within each issue can be also reused for scholars in digital humanities, literary studies, and historical research, as well as the Python scripts used to generate the graphs.
Notes
Acknowledgements
The author would like to express her gratitude to Prof. Dr. Hanno Ehrlicher and the contributors of Revista Culturales 2.0 for providing the standardized list of author names.
Competing Interests
The author has no competing interests to declare.
Author Contributions
Conceptualization
Data curation
Software
Writing – original draft
Writing – review & editing
Validation
