(1) Overview
Repository location
Context
“Il Dizionario degli scrittori italiani contemporanei pseudonimi in Wikidata. Metodologia e risultati (dati)” is a dataset containing the files and queries produced during the homonymous MA thesis (De Monaco, 2025). During this project a corpus of pseudonymous Italian authors from the 19th and 20th century was structured and curated in Wikidata, using information from a biographical dictionary (Frattarolo, 1975) as a starting point.
(2) Method
The dataset contains files related only to the authors found in the biographical dictionary.
Steps
After digitizing the dictionary,1 the writers were manually listed in a .xlsx file uploaded in OpenRefine in order to reconcile2 the names with already existing Wikidata items. For the authors successfully reconciled, the information about the pseudonym and the dictionary reference were added in Wikidata through OpenRefine, while the others were created using OpenRefine and QuickStatements with a minimum level of details. All the items corresponding to the people in the dictionary were then manually revised and curated in order to confirm and correct data about at least birth, death, occupation, and external identifiers. In particular, for some identifiers corresponding to authority files, error reports were sent: the dataset includes a .xlsx file recording the reports for SBN,3 while the others (GND,4 IdRef5) are available on the Wikidata user subpage of the thesis.6
Once the work on the Wikidata items was complete, a series of queries was written and executed: their results are included in the dataset along with links to the queries themselves.
Quality control
During the project, biographical and occupational information found in the dictionary was added in Wikidata. In case of conflict with data already present in Wikidata, more sources (such as national authority files or external reliable databases) were consulted to try and establish the most reliable information, highlighting it in Wikidata with the rank mechanism7 as preferred. In order to reach a uniform level of minimum description, the same sources were used also in case of missing data: the dictionary was printed in 1975 and doesn’t report complete biographical information for all the authors.
(3) Dataset Description
Repository name
Zenodo
Object name
“FRDCP risultati query tsv – 20250520.zip”, compressed archive containing 48 TSV files with the results of the queries written during the project; the archive contains also an index file that connects the results with the queries that generated them and a comment to give an idea of what each query was for. They deal mainly with general information about the writers, the quality of the dictionary as a source, the presence of the corpus components in external databases and authority files. These results were obtained by running the queries one final time on 2025-05-20.
“Lista voci FRDCP per OpenRefine.xlsx”, spreadsheet file listing all the people found in the dictionary. For every component of the corpus the following data were indicated: the pseudonym (dividing where possible name and surname in different columns); the real name; the starting and finishing pages where the author’s record was found in the printed version of the dictionary; a column with the pseudonym, constructed from the first two columns, to be used in Wikidata as value of the property “pseudonym” (P742); a column constructed with “pseudonym name pseudonym surname (real name)”, to be used in Wikidata as value of the qualifier “subject named as” (P1810); a column with the interval of pages, constructed from the columns containing the starting and finishing pages; five additional columns for the authors that had more than one pseudonym in the dictionary. The additional sheets in the file contain information about the people that had more than five pseudonyms in the dictionary.
“Risultati query per grafici.zip”, compressed archive containing a Jupyter notebook and the CSV files used in it. These last ones are the results of some of the queries mentioned above, used to conduct descriptive analysis on the corpus. The analysis was conducted using Python (pandas, numpy, matplotlib, seaborn libraries) and it focused mainly on generating descriptive statistics (for example calculating mean, standard deviation or percentiles) and visualizations, many of which were included in the thesis.
“Tracciamento errori riscontrati in SBN.xlsx”, spreadsheet file with the reports of 259 errors found in the Italian authority file. For every report the following data were listed: the record ID, a brief comment on the problem to solve, the report date, the solution date, a brief comment on the solution. The reports are listed in two different sheets: the first collects the ones dealing with errors found in authority records, the second with the ones found in bibliographic records.
“wdump-4826-20250505.nt.gz”, compressed archive containing the dump8 of the 493 Wikidata items about the people in the biographical dictionary, generated through WDumper9 on 2025-05-05. The archive contains a single NT file where the items’ RDF triples are stored.
Format names and versions
The dataset contains TSV, XLSX, NT files and a Jupyter Notebook.
Creation dates
The files in the dataset were created between 2024-10-09 and 2025-05-20.
Dataset creators
Sara De Monaco, curator; Camillo Carlo Pellizzari di San Girolamo, supervisor.
Language
Italian (comments, reports etc.), English (variable names).
License
CC0.
Publication date
2025-05-20.
(4) Reuse Potential
The dataset has potential for reuse both from a temporal and a methodological point of view. The Jupyter notebook provides a sample of descriptive analysis that can serve as a model for similar projects or as a starting point to conduct further analysis (using statistical or data mining methods). The Wikidata dump contains a lot of additional information that wasn’t curated and checked during this project, but that could be further explored and analysed.
The spreadsheet file with the list of the authors could be used as a reference in case of biographical dictionary reconciliation with Wikidata.
The reports file offers the means for a comparison with projects that check the Italian authority records corresponding to the authors of a corpus: it would be possible to compare the number and type of errors.
Finally, one could examine the items in this dataset at different moments in time: it would be possible to run the queries written during the thesis again and compare the new results with the ones already available; the same could be done with a new Wikidata dump of the items of the dictionary. Both approaches offer the possibility to study how the data of the corpus items has changed in time.
Notes
[1] A physical copy of the dictionary was digitized using a scanner and the resulting .pdf file was made searchable through the OCR of the NAPS2 software (https://www.naps2.com/, no manual correction required). The file was only consulted during the project, since the work is still subject to copyright: in Wikidata the authors’ items all have the property “described by source” (P1343) with value Q130597718, the id of the dictionary.
[2] Using the Wikidata reconcile service, https://wikidata.reconci.link/.
[3] Servizio Bibliotecario Nazionale, https://opac.sbn.it/.
[4] Gemeinsame Normdatei, https://www.dnb.de/EN/Professionell/Standardisierung/GND/gnd_node.html.
[5] Identifiants et Référentiels pour l’enseignement supérieur et la recherche, https://www.idref.fr/.
[6] More details about the properties and information added, as well as the error reports are available at https://www.wikidata.org/wiki/User:SaraDeMonaco/Tesi_magistrale.
[7] https://www.wikidata.org/wiki/Help:Ranking, mechanism used to annotate “multiple values for a statement”.
Acknowledgements
I would like to thank Professor Vittore Casarosa, Professor Enrica Salvatori and Camillo Carlo Pellizzari di San Girolamo for assisting with and supervising the thesis project related to this dataset.
Competing interests
The author has no competing interests to declare.
