(1) Overview
Repository location
Context
The Drammaturgia, compiled by Roman scholar Leone Allacci and published by Mascardi in 1666, represents a fundamental document in the history of early modern Italian drama. One of the first extensive bibliographies of theatrical works in the Italian language, it included not only a vast list of plays in alphabetical order, but also various indexes on the writers’ names and surnames, their provenance, the works’ topics, and other information.
One century later, the lack of reprints and the difficulty in finding copies of this essential catalogue motivated a group of Venetian intellectuals – and most notably poet and librettist Apostolo Zeno – to prepare a new updated edition, published by Pasquali in 1755 (Pensa, 1983; Zanandrea, 1991). In this version, the original indexes were merged into the main list, now richer in details and containing even more pieces of information provided by the new compilers.
This later edition of Drammaturgia is characterised by a coherent and repetitive structure. In most entries (see Figure 1), the title is followed by a full stop and then the genre, often accompanied by metric indications in parentheses; two hyphens precede the publishing specifics (city, printer, year, and format), while references to the author and his origins complete the entry.

Figure 1
A sample page from Allacci’s Drammaturgia.
Due to its standardized structure, the Drammaturgia is inherently well-suited for conversion into a digital database; in this way, the academic community can be provided with convenient access to a valuable asset for investigating early modern Italian drama. Complementing the general overview of the project presented in Giovannini and Gallucci (2024), this data paper offers a technical report on this attempt at ‘low-resources database prototyping’.
(2) Method
Steps
The pipeline begins with two digitised copies of Allacci’s Drammaturgia from the library holdings of the University of Turin1 and the University of Western Ontario.2 The scans of the pages, available as PDF files, were transcribed automatically using the Transkribus software (Kahle et al., 2017) and its default Print M1 model for printed texts.3 Despite the overall good quality of the images, several manual adjustments were needed to properly define the page areas to be recognised. The result of the OCR process (a large TXT file) was subsequently reviewed through multiple rounds of manual corrections to fix common transcription mistakes. Since the goal was not to deliver a diplomatic transcription of the Drammaturgia, but rather to prepare it for database conversion, we did not strictly follow philological practices (e.g., in some instances, we normalised spellings to modern Italian).
In the next phase, we proceeded to retrieve the most relevant information from each entry of the Drammaturgia. Given the text’s near-tabular format and its limited variability, we opted out of more advanced NLP techniques in favour of a lightly-supervised approach (LSA), i.e., we limited ourselves to the definition and ongoing refinement of extraction rules tailored to the specific project needs (Bonora & Pompilio, 2021, 198).
We therefore implemented a series of regex searches via Python4 to capture several key fields in each entry: title, subtitle, author, genre, meter (prose/verse), place of publication, performance location, publisher, year, and typographical format. Furthermore, we flagged translated plays and texts with a musical component (libretti, melodrammi, etc.). The regexes were iteratively modified and improved to prevent wrong extractions. This involved, for instance, removing expressions that confused the parser (e.g. honorifics before names) or directly mapping some specific wordings to a certain output (e.g. senza Luogo to location = None). Table 1 offers an overview of the extraction patterns, while Figure 2 shows the results of such extraction on a sample entry.
Table 1
Extraction patterns.
| FIELD | DESCRIPTION | REGEX LOGIC | VALUE TYPE |
|---|---|---|---|
| Entry | Full text of the entry | None | string |
| Title | Main title | All text until the first full stop. | string |
| Subtitle | Subtitle or alternative title | Text between a set of expression indicating a potential subtitle (o vero, o sia) and the first full stop | string |
| Author | Author, writer, librettist | A dash, di, Poesia di + following two words (≈name/surname) | string |
| Genre | Dramatic genre (as indicated in the entry) | Text between the first full stop and the second full stop or parenthesis | string |
| City | Place of publication (city or town) | in + following word | string |
| Location | Physical location of first recorded performance (usually, a theatre) | First two/three words after Teatro di | string |
| Publisher | Publisher, printer, or typographer | pointer7+ following two words (≈name/surname) | string |
| Year | Year of publication or performance | in + yyyy, else the first yyyy found | integer |
| Format | Typographical format (quarto, octavo, etc.) | in + one/two-digit number | integer |
| Mode | Poetry or prose | Find prosa for prose; versi/ottava rima for verse | string |
| Translation | Only direct translations are considered, not adaptations | If the entry contains translation-related language (tradot-, traduz-), mark as True | boolean |
| Libretto | Indication of the ‘musical’ nature of the work | If the entry contains music-related language (per/in M/musica-), mark as True | boolean |
| Composer | For libretti: author of the score | Musica di/del + following two words (≈name/surname) | string |

Figure 2
A sample entry; extracted fields are marked.
The results of the extraction pipeline were then exported in CSV format and loaded into OpenRefine,5 a state-of-art tool for data wrangling. There, they underwent a long and thorough cleaning process, which was meant to correct extraction errors, disambiguate between items with similar spellings, and transform data – when appropriate – in Linked Open Data. Using OpenRefine’s reconciling function, we were indeed able to match a sizable number of items over five categories (author, city, location, publisher, composer) with the corresponding Wikidata items.6
Eventually, the final version of the dataset contained 6024 plays by 2033 different authors (plus 248 composers), covering a 300-year-long timeframe from 1449 and 1755. From a geographical point of view, pieces indexed in the Allacci Digitale have been printed in at least 130 different Italian and European cities and acted in at least 109 different historical theatres.
Sampling strategy
Not applicable. The whole text of the Drammaturgia, including the ‘Aggiunte e correzioni’ (‘Additions and corrections’) appendix, has been converted into the database.
Quality control
As described above, the quality of the database was improved at every step (initial TXT, intermediate CSV/JSON, final CSV/JSON) through rounds of manual curation.
(3) Dataset Description
Repository name
Zenodo.
Object name
database.csv and database.json.
Format names and versions
CSV and JSON files.
Creation dates
The project began in early 2024. The current version of the dataset has been released on 2024/05/28.
Dataset creators
Luca Giovannini (Potsdam/Padova): conceptualisation, data curation; Giorgia Gallucci (Padova): conceptualisation, data curation.
Language
English and Italian.
License
Creative Commons Attribution 4.0 International.
Publication date
First release: 2024/05/28.
(4) Reuse Potential
The Allacci Digitale database is targeted at two different types of audience. On one side, is it meant to help traditional Italian literature scholars in exploratory bibliographical researches. To this aim, a simple bilingual website (https://allacci-digitale.github.io) has been built, featuring a search mask which allows performing complex searches, visualising results, and downloading them as a CSV table.
On the other side, these data can be used by computational literary studies (CLS) scholars for starting to reconstruct a quantitative history of early modern Italian drama. The wealth of information available can be used to track the popularity of genres and authors over time, but also to explore which publishing centres and companies were more relevant in text dissemination.
Future developments might include extraction of less standardised pieces of information, such as the dedicatee of a work, the author’s nationality, or the number of reprints. Given the complexity of such a task, we would set aside our rule-based approach and employ instead large language models, whose effectiveness in information extraction tasks has been increasingly demonstrated (Xu et al., 2024). We also plan to link items to further bibliographic sources on early modern theatre, such as the well-known Corago database (Pompilio, 2020).
Notes
[1] https://dl.unito.it/it (last accessed: 2024/11/18).
[2] https://archive.org/details/drammaturgia00alla (last accessed: 2024/11/18).
[3] https://readcoop.eu/model/transkribus-print-multi-language-dutch-german-english-finnish-french-swedish-etc (last accessed: 2024/11/18).
[4] See the Jupyter notebook: https://github.com/allacci-digitale/allacci-digitale.github.io/blob/main/allacci_modelling_notebook.ipynb (last accessed: 2024/11/18).
[5] https://openrefine.org (last accessed: 2024/11/18).
Acknowledgements
The authors would like to thank Lara Piva (READ Coop/University of Padova) for her suggestions during the OCR phase.
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Luca Giovannini: conceptualization, data curation, investigation, methodology, project administration, software, writing – original draft; Giorgia Gallucci: conceptualization, data curation, investigation, writing – review and editing.
