(1) Overview
Since its foundation in 1945, the United Nations Educational, Scientific and Cultural Organization (UNESCO) has played a central role in global actions on literacy, communications and heritage, and been a crucial forum for global debates on a wide range of issues. These debates have played out above all in the plenary sessions of UNESCO’s General Conference (GC), which normally meets every other year. Since 1947, UNESCO has published the “verbatim record” of what delegates said at these sessions in the Proceedings volumes of the series Records of the General Conference.
This body of text is of interest to a wide range of scholars in the humanities and social sciences, as demonstrated by the recent boom in publications on UNESCO in history, anthropology, literature, and heritage studies (Duedahl, 2016; Garner, 2016; McDonald, 2017; Meskell, 2018; Brouillette, 2019; Betts, 2020; Morgan, 2024), as well as by the growing interest in global approaches in the humanities and social sciences (Darian-Smith & McCarty, 2017; Amirell, 2023). The Proceedings volumes are available in PDF format through UNESCO’s digital library (unesdoc.unesco.org), allowing readers to read selected volumes online. But since optical character recognition (OCR) has not been applied to many of the PDFs, these files do not allow for full-text searching, not to mention more sophisticated methods of computational text analysis. As part of the research project “International Ideas at UNESCO” (INIDUN, inidun.github.io), we have sought to address this issue by compiling the texts of the “verbatim record” section from all issues of Proceedings from the 1946 session to the 2017 session, in English and/or French, generating a comprehensive, machine-readable text corpus.
The Proceedings volumes were published in parallel English and French editions from the first General Conference session of 1946 (published in 1947) until the 1960 session (published in 1962). From the 1962 session onward, Proceedings has appeared in a single multilingual volume reproducing each speaker’s intervention in whichever of UNESCO’s six official languages it was delivered. 4 of these (Arabic, Chinese, Russian and Spanish) are translated into either English or French (in alternating years). To create a single corpus that includes all interventions and is amenable to digital exploration, we deploy a language-recognition algorithm that isolates and extracts the text sections in English and French.
We present this corpus as part of a package that includes: (1) the corpus, of ca. 21 million words; (2) code written to curate the corpus; (3) metadata; and (4) documentation and quality control files. This paper presents the first version of the Proceedings corpus.
Repository location
The corpus is available on Zenodo: https://zenodo.org/records/14786689, DOI: 10.5281/zenodo.14786688. The corpus and additional materials are on the project’s GitHub repository: https://github.com/inidun/proceedings_curation/releases.
Context
The corpus was prepared in the INIDUN project, led by Benjamin G. Martin at Uppsala University.
(2) Method
Inspired by dynamic models of corpus creation (Voormann & Gut, 2008; Hurtado Bodell et al., 2022), we prepared this corpus through 5 main steps: 1) OCR reprocessing and quality control; 2) curating corpus metadata; 3) consolidating a metadata index; 4) extracting meetings to individual text files; and 5) extracting language-specific text from these files.1
OCR reprocessing and quality control
All 42 UNESCO General Conference Proceedings documents (1946–2017), except 2, were accessible and downloaded in PDF format from UNESCO’s open web archive (unesdoc.unesco.org). (UNESCO’s GC met annually from 1946 to 1952, and then every other year; in even-numbered years 1952–1980, and in odd-numbered years since 1983. The organization also held “extraordinary sessions” in 1953, 1973 and 1982). After we had the remaining 2 volumes (1952 and 1976) scanned into PDF format, we had assembled 37,862 PDF pages from 42 sessions. Those PDFs that lacked OCR, or where the OCR was unsatisfactory, were run through the OCR engine Tesseract 5.2
We checked the OCR quality of the digitized Proceedings by selecting a random sample of 5 paragraphs, ranging from 60 to 420 words, from each document. To calculate the word error rate, we compared the PDF image of each paragraph to the OCR output from the same paragraph. 18 documents had a word error rate below 1%, 9 had between 1–5%, and 5 had between 5–10%, and 2 had an error rate above 10% (one with 11% and one with 31%). Using new scans of the volumes with the worst OCR quality (1968 and 1993) would yield higher OCR accuracy.
Curating and consolidating metadata
Each biannual “session” of UNESCO’s GC is divided into several “meetings”. We created a metadata file (https://github.com/inidun/proceedings_curation/tree/main/data) that catalogs all meetings. Each PDF document was manually reviewed, and details such as meeting dates, titles, meeting president, languages and page numbers were systematically recorded into this index. We next designed a script (create_metadata_index.py) that creates a new metadata index by merging and updating the index of Proceedings volumes (provided to us by UNESCO’s archive) with our own metadata file. The script loads the Proceedings index and metadata from Excel files, processes and validates the data, and merges them into a single DataFrame. The script also performs various checks and transformations, such as stripping and lowercasing column names, handling date and page information, and generating unique meeting IDs. The resulting metadata index, which is saved as an Excel or CSV file, is instrumental to the following steps.
Extracting text, by meeting and by language
We designed a script (extract_meetings.py) that extracts the text of each meeting from the PDF corpus into separate text files. This script reads the metadata index, checks for the presence of source files, and processes each file by extracting text (via either PDFBoxExtractorMod or TesseractExtractorMod), and saving it to an output directory. It logs the extraction process and handles options such as extracting page numbers, specifying page separators, and forcing overwrites of existing files. Finally, to isolate the text in English and French, we prepared a script (extract_language_subset.py) that processes text files in a specified input folder by tokenizing paragraphs, detecting their languages, and filtering them based on specified languages. It uses a tokenizer and a language detector (langdetect), both of which can be customized.
Applying this method yields 2 corpora, 1 in English (ca. 15,7 million tokens, 83 934 kB) and 1 in French (ca. 8.6 million tokens, 48 257 kB). Combining these gives a complete, machine-readable record of what was said at the meetings of UNESCO’s General Conference between 1946 and 2017.
(3) Dataset Description
Repository name
Zenodo, GitHub
Object name
UNESCO’s Proceedings, 1945–2017: A Bilingual Digital Text Corpus
Format names and versions
Text corpora: .txt
Code: Python
Metadata: CSV
Creation dates
2021-02-01 to 2024-12-10
Dataset creators
Andreas Marklund (developer, Humlab, Umeå University), Benjamin Martin (PI, Uppsala University), Oriane Mathilde Martin (assistant data curator, Uppsala University), Fredrik Mohammadi Norén (researcher, Malmö University), and Roger Mähler (developer, Humlab, Umeå University).
Language
English and French
License
Creative Commons Attribution 4.0 International (CC BY 4.0 Deed).
Publication date
2025-02-01
(4) Reuse Potential
Our curated corpus of UNESCO’s Proceedings has a high degree of reuse potential. Most simply, it enables the use of digital methods for humanistic and/or social scientific analysis of what delegates from around the world said in the meetings of UNESCO’s General Conference between 1946 and 2017. The range of analyses that could be supported by such digital text analysis (from word-trend analysis to topic modeling to word embedding) is far too broad to catalogue here!
The range of uses could be expanded by improving the corpus and metadata. Some volumes (with the highest error rate) should be rescanned.3 One could also raise the OCR quality of selected files, use OCR in the extraction workflow and apply OCR to all languages in the source files. (Currently, languages not included in OCR generate random characters in extracted text files which makes language detection harder.) Efforts to re-OCR the files could make use of information on document languages contained in the metadata file. This would improve the quality of the automated segmentation, since it would be clearer where passages in different languages begin and end. Machine learning approaches to language detection and segmentation could also be implemented. Applying named entity recognition to the material could permit a geographical study of spatial representations concerning named places in the speeches.
The metadata could be improved by adding information on each meeting (duration, number of speakers, topics debated, etc.), by creating a database of all speakers at the conferences with a persistent ID that can be used to link to other databases, such as Wikidata, and by linking each Proceedings speech to its corresponding speaker. One way to work with the corpus in the future would be to set up a public Jupyter Notebook environment, or an equivalent user-friendly interface, in which to explore corpus statistics through various NLP tools.
Notes
[1] Source code and documentation can be found at: github.com/inidun/proceedings_curation.
[2] Re-OCR’ed volumes are listed at: https://demo.humlab.umu.se/inidun/proceedings/pdfs/.
[3] A new scan of the 1968 volume is available at: https://gupea.ub.gu.se/handle/2077/78653.
Acknowledgements
Thanks to the UNESCO Library and Archives, and especially archivist Eng Sengsavang, for facilitating access to metadata and documents.
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Benjamin G. Martin: Conceptualization; Funding acquisition; Investigation; Methodology; Project administration; Resources; Supervision; Validation; Writing – original draft; Writing – review & editing.
Fredrik Mohammadi Norén: Investigation; Methodology; Validation; Writing – original draft; Writing – review & editing.
Roger Mähler: Data curation; Methodology; Formal analysis; Supervision; Validation; Writing – original draft.
Andreas Marklund: Data curation; Methodology; Formal analysis; Validation; Writing – original draft; Writing – review & editing.
Oriane Martin: Data curation; Investigation; Formal analysis; Validation; Writing – original draft; Writing – review & editing.
