Allacci Digitale: A Historical Dataset for Early Modern Italian Drama

Luca Giovannini; Giorgia Gallucci

doi:10.5334/johd.250

Full Article

(1) Overview

Repository location

https://doi.org/10.5281/zenodo.10972669.

Context

The Drammaturgia, compiled by Roman scholar Leone Allacci and published by Mascardi in 1666, represents a fundamental document in the history of early modern Italian drama. One of the first extensive bibliographies of theatrical works in the Italian language, it included not only a vast list of plays in alphabetical order, but also various indexes on the writers’ names and surnames, their provenance, the works’ topics, and other information.

One century later, the lack of reprints and the difficulty in finding copies of this essential catalogue motivated a group of Venetian intellectuals – and most notably poet and librettist Apostolo Zeno – to prepare a new updated edition, published by Pasquali in 1755 (Pensa, 1983; Zanandrea, 1991). In this version, the original indexes were merged into the main list, now richer in details and containing even more pieces of information provided by the new compilers.

This later edition of Drammaturgia is characterised by a coherent and repetitive structure. In most entries (see Figure 1), the title is followed by a full stop and then the genre, often accompanied by metric indications in parentheses; two hyphens precede the publishing specifics (city, printer, year, and format), while references to the author and his origins complete the entry.

A sample page from Allacci’s *Drammaturgia*.

Due to its standardized structure, the Drammaturgia is inherently well-suited for conversion into a digital database; in this way, the academic community can be provided with convenient access to a valuable asset for investigating early modern Italian drama. Complementing the general overview of the project presented in Giovannini and Gallucci (2024), this data paper offers a technical report on this attempt at ‘low-resources database prototyping’.

(2) Method

Steps

The pipeline begins with two digitised copies of Allacci’s Drammaturgia from the library holdings of the University of Turin¹ and the University of Western Ontario.² The scans of the pages, available as PDF files, were transcribed automatically using the Transkribus software (Kahle et al., 2017) and its default Print M1 model for printed texts.³ Despite the overall good quality of the images, several manual adjustments were needed to properly define the page areas to be recognised. The result of the OCR process (a large TXT file) was subsequently reviewed through multiple rounds of manual corrections to fix common transcription mistakes. Since the goal was not to deliver a diplomatic transcription of the Drammaturgia, but rather to prepare it for database conversion, we did not strictly follow philological practices (e.g., in some instances, we normalised spellings to modern Italian).

In the next phase, we proceeded to retrieve the most relevant information from each entry of the Drammaturgia. Given the text’s near-tabular format and its limited variability, we opted out of more advanced NLP techniques in favour of a lightly-supervised approach (LSA), i.e., we limited ourselves to the definition and ongoing refinement of extraction rules tailored to the specific project needs (Bonora & Pompilio, 2021, 198).

We therefore implemented a series of regex searches via Python⁴ to capture several key fields in each entry: title, subtitle, author, genre, meter (prose/verse), place of publication, performance location, publisher, year, and typographical format. Furthermore, we flagged translated plays and texts with a musical component (libretti, melodrammi, etc.). The regexes were iteratively modified and improved to prevent wrong extractions. This involved, for instance, removing expressions that confused the parser (e.g. honorifics before names) or directly mapping some specific wordings to a certain output (e.g. senza Luogo to location = None). Table 1 offers an overview of the extraction patterns, while Figure 2 shows the results of such extraction on a sample entry.

Table 1

Extraction patterns.

FIELD	DESCRIPTION	REGEX LOGIC	VALUE TYPE
Entry	Full text of the entry	None	string
Title	Main title	All text until the first full stop.	string
Subtitle	Subtitle or alternative title	Text between a set of expression indicating a potential subtitle (o vero, o sia) and the first full stop	string
Author	Author, writer, librettist	A dash, di, Poesia di + following two words (≈name/surname)	string
Genre	Dramatic genre (as indicated in the entry)	Text between the first full stop and the second full stop or parenthesis	string
City	Place of publication (city or town)	in + following word	string
Location	Physical location of first recorded performance (usually, a theatre)	First two/three words after Teatro di	string
Publisher	Publisher, printer, or typographer	pointer⁷+ following two words (≈name/surname)	string
Year	Year of publication or performance	in + yyyy, else the first yyyy found	integer
Format	Typographical format (quarto, octavo, etc.)	in + one/two-digit number	integer
Mode	Poetry or prose	Find prosa for prose; versi/ottava rima for verse	string
Translation	Only direct translations are considered, not adaptations	If the entry contains translation-related language (tradot-, traduz-), mark as True	boolean
Libretto	Indication of the ‘musical’ nature of the work	If the entry contains music-related language (per/in M/musica-), mark as True	boolean
Composer	For libretti: author of the score	Musica di/del + following two words (≈name/surname)	string

A sample entry; extracted fields are marked.

The results of the extraction pipeline were then exported in CSV format and loaded into OpenRefine,⁵ a state-of-art tool for data wrangling. There, they underwent a long and thorough cleaning process, which was meant to correct extraction errors, disambiguate between items with similar spellings, and transform data – when appropriate – in Linked Open Data. Using OpenRefine’s reconciling function, we were indeed able to match a sizable number of items over five categories (author, city, location, publisher, composer) with the corresponding Wikidata items.⁶

Eventually, the final version of the dataset contained 6024 plays by 2033 different authors (plus 248 composers), covering a 300-year-long timeframe from 1449 and 1755. From a geographical point of view, pieces indexed in the Allacci Digitale have been printed in at least 130 different Italian and European cities and acted in at least 109 different historical theatres.

Sampling strategy

Not applicable. The whole text of the Drammaturgia, including the ‘Aggiunte e correzioni’ (‘Additions and corrections’) appendix, has been converted into the database.

Quality control

As described above, the quality of the database was improved at every step (initial TXT, intermediate CSV/JSON, final CSV/JSON) through rounds of manual curation.

(3) Dataset Description

Repository name

Zenodo.

Object name

database.csv and database.json.

Format names and versions

CSV and JSON files.

Creation dates

The project began in early 2024. The current version of the dataset has been released on 2024/05/28.

Dataset creators

Luca Giovannini (Potsdam/Padova): conceptualisation, data curation; Giorgia Gallucci (Padova): conceptualisation, data curation.

Language

English and Italian.

License

Creative Commons Attribution 4.0 International.

Publication date

First release: 2024/05/28.

(4) Reuse Potential

The Allacci Digitale database is targeted at two different types of audience. On one side, is it meant to help traditional Italian literature scholars in exploratory bibliographical researches. To this aim, a simple bilingual website (https://allacci-digitale.github.io) has been built, featuring a search mask which allows performing complex searches, visualising results, and downloading them as a CSV table.

On the other side, these data can be used by computational literary studies (CLS) scholars for starting to reconstruct a quantitative history of early modern Italian drama. The wealth of information available can be used to track the popularity of genres and authors over time, but also to explore which publishing centres and companies were more relevant in text dissemination.

Future developments might include extraction of less standardised pieces of information, such as the dedicatee of a work, the author’s nationality, or the number of reprints. Given the complexity of such a task, we would set aside our rule-based approach and employ instead large language models, whose effectiveness in information extraction tasks has been increasingly demonstrated (Xu et al., 2024). We also plan to link items to further bibliographic sources on early modern theatre, such as the well-known Corago database (Pompilio, 2020).

Notes

[1] https://dl.unito.it/it (last accessed: 2024/11/18).

[2] https://archive.org/details/drammaturgia00alla (last accessed: 2024/11/18).

[3] https://readcoop.eu/model/transkribus-print-multi-language-dutch-german-english-finnish-french-swedish-etc (last accessed: 2024/11/18).

[4] See the Jupyter notebook: https://github.com/allacci-digitale/allacci-digitale.github.io/blob/main/allacci_modelling_notebook.ipynb (last accessed: 2024/11/18).

[5] https://openrefine.org (last accessed: 2024/11/18).

[6] We also tested reconciliation against the Virtual International Authority File (VIAF), but results were less satisfying than with Wikidata.

[7] Expressions used: “per”, “pel”, “presso”, “appr./appresso”, “per il/gli”, “presso il/gli”, “nella Stamperia (del)”, “nella Stampa”, “ad istanza di/del”, “all’Insegna”.

Acknowledgements

The authors would like to thank Lara Piva (READ Coop/University of Padova) for her suggestions during the OCR phase.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Luca Giovannini: conceptualization, data curation, investigation, methodology, project administration, software, writing – original draft; Giorgia Gallucci: conceptualization, data curation, investigation, writing – review and editing.