(1) Overview
Repository location
Context
The author is the Digital Curator for Automatic Text Recognition at the British Library, UK, and she is responsible for ATR workflow design and its operational implementation. Her team, Heritage Made Digital, manages digitisation workflows at the Library and develops related standards.
The data was produced as part of the research to re-design an Automatic Text Recognition workflow for the British Library. During the initial research phase, it became clear that it was necessary to gather information from multiple institutions on how ATR was integrated into digitisation workflows, what tools and standards were used, and if ethics, environmental sustainability and copyright were considered by institutions. Furthermore, from preliminary informal conversations with other institutions, it was clear how new AI tools were changing the landscape, how institutions were integrating old and new solutions in their workflows and how there was a general interest in collaborating on potential approaches to common challenges, already identified within the British Library (e.g. copyright; ethics). Therefore, to acquire additional information on ATR workflows, a survey (see Table 1 for examples of the included questions and possible answers) was designed to acquire information about the overall situation and to evaluate the potential for creating a working group on Automatic Text Recognition in the Galleries, Libraries, Archives and Museums (GLAM) sector.
Table 1
Sample survey questions and resulting variables.
| SAMPLE SURVEY QUESTIONS | SAMPLE RESULTING VARIABLES |
|---|---|
| Q7 – Why did you decide to use ATR for your collections? Please select all that apply | Create datasets for future use; Implement content search for users; Research; Other |
| Q8 – Which tools do you use for performing Automatic Text Recognition? Please select all that apply | Transkribus; eScriptorium; ABBYY Adobe Acrobat; Document AI by Google; Azure by Microsoft; Textract by Amazon; Tesseract OCR; Local LLM with Computer Vision; Online LLM with Computer Vision; Other |
| Q19 – What standard(s) of metadata do you use for documenting the ATR process? Please select all that apply | ALTO; METS; MIX; PREMIS; IIIF; Not sure; Other |
| Q23 – Have you evaluated the environmental sustainability of your ATR workflow? | Yes, we have fully evaluated it; Yes, we have done some evaluations; No, but we would like to; No, we are not interested; Not sure |
| Q24 – Have you evaluated the ethical considerations about the accuracy of ATR on different collection areas? | Yes, we have evaluated it; No, but we would like to; No, we are not interested; Not sure |
| Q25 – Are copyright assessments and copyright law considerations part of your regular ATR workflow? | Yes, it is part of our workflow; No, but we would like to implement it; No; Not sure |
A call was circulated via a blog post describing the reasons for the research. The survey received internal approval, including the sharing of the anonymised results, which have been made available on the institutional repository.
(2) Method
To create the survey and the dataset the following steps were followed:
Steps
1. Survey Design. A draft of the survey questions was created and shared internally with colleagues from the Heritage Made Digital and Digital Research teams for comments.
2. Approval processes. The draft was approved following the internal ethical processes, including a Data Protection review. The methods for sharing the survey and its results were also approved.
3. Platform. The survey was published on the British Library survey platform (Snap) and included both multiple choice questions and free text questions. None of the questions were mandatory.
4. Dissemination. The survey was publicised using a post on the Digital Scholarship blog (Vavassori, 2025a) that was shared on the author LinkedIn profile, mailing lists (e.g. Museum Computer Group, Multilingual DH, Code4 Lib, GLAMLabs, LIS-Rarebooks, IFLA Mailing List) and groups on Slack (e.g. IIIF, AI4 LAM). The mailing lists and Slack groups were selected by the author, in consultations with colleagues as representative of Digital Humanities, GLAM and Libraries and as potential pools of respondents as it was expected that people in these groups would be familiar with ATR projects within GLAM, have an interest in the application of AI for cultural heritage and knowledge of the ways ATR can be displayed to the audiences. It was publicised internally, with staff being asked to re-share it with relevant contacts. The author shared with her own contacts in other institutions.
5. Duration. The survey was available between 2025-03-17 and 2025-04-07.
Quality control
6. The author performed an initial anonymisation of the data, removing the email addresses provided for coordinating the working group. A second review was conducted to identify data that would have made answers personally identifiable if compounded together with other knowledge (e.g. unique software developed by a single individual, organisations where a single person is responsible for ATR). The proposed anonymisation was then checked internally by an expert, Digital Curator Dr Adi Keinan-Schoonbaert, and the Data Protection Team. The anonymisation was performed to ensure that respondents were not personally identifiable. If answers pointed to organisations but could have been reasonably provided by multiple people, this information was retained. Such cases were considered low risk for identification and offered a richer picture of the field. This approach was agreed with the Data Protection Team. Where redactions were done, square brackets were inserted to document and explain them. This approach reflects the recognition that data cleaning and normalisation are interpretative acts and must be explained and made visible (Rawson and Muñoz, 2019), alongside acknowledging that datasets are situated and contextually produced (D’ Ignazio and Klein, 2020; Drucker, 2020). This approach also ensures that the dataset can be reused without compromising privacy, while still providing information on ATR workflows, supporting institutions developing their own workflows.
7. The dataset was checked for potential double entries. Two were merged as they came from the same institution. The dataset in the repository includes the original duplicate entries at lines 55–56.
8. Several questions had an option listed as “Other” and the option to add additional information. These responses were grouped and normalised whenever possible. For example, if two answers mentioned cataloguing as the main motivation for ATR, they were grouped together in the analysis, but the original answers were also conserved for qualitative analysis. The same approach was adopted for open-ended questions such as languages and scripts present within the collections (Q4 and Q5). For the analysis, the clusters of languages or scripts was kept as separate answers (e.g. European Languages was left as it can have multiple interpretations). The normalised data is not included in the repository, but it was preferred to share the pre-normalised and un-cleaned data to favour future reuse by others. The dataset shared in the repository includes additional columns that have been automatically generated: the date of each survey response (ID.date), start and end time of response (ID.start and ID.end), the completion date (ID.endDate) and the total time spent on the survey (ID.time). These survey timestamps can be useful for researchers and practitioners as they provide insights into respondent behaviour and data quality, identifying patterns and detecting potential issues like rushed responses.
9. The repository dataset includes the survey answers, a .txt file mapping column headings and survey questions and another .txt file detailing the codes for single-choice questions. For example, in the dataset countries are represented by a numerical value, with the file providing the key to match the value to the corresponding country. (e.g. “76” is India).
10. The data were analysed according to frequencies and co-occurrences. A brief analysis of the data was published on the Digital Scholarship blog (Vavassori, 2025b).
(3) Dataset Description
Repository name
British Library Research Repository
Object name
Automatic Text Recognition (ATR) in Cultural Heritage Institutions Survey
Format names and versions
PDF; .csv, .txt, .xlsx, version 1.2
Creation dates
2025-03-17 to 2025-04-07
Language
English
License
CC BY 4.0 Attribution
Publication date
2025-09-18
(4) Reuse Potential
The data can be reused by researchers to examine what tools and workflows cultural heritage institutions are adopting, as well as what scripts and languages they have in their collections (see for example the author’s analysis, Vavassori 2025b). By combining answers from Q4 and Q5 with country or cluster of countries (available under Q2), researchers can explore the languages and scripts presented in cultural heritage collections in different parts of the world.
Another possible analysis could look into which ATR tools or viewers cultural heritage institutions are using (Q8 and Q16) and in which countries they are based.
It may also help with research on copyright, digital sustainability, ethics and AI implementation (including Automatic Speech Recognition). For example, a researcher may use this dataset to augment information on the use of AI in cultural heritage, including the models selected for use and whether they were implemented locally or on the cloud (Q8, Q9, Q10, Q11).
Researchers will be able to perform further analyses using digital methods (e.g. data visualisation) or by comparing them with other relevant datasets on digitisation in GLAM. Finally, this dataset provides examples of metadata for ATR and for standards currently in use within the GLAM sector, information that could help metadata specialists and researchers investigating metadata for AI processes. For example, researchers may want to identify what common fields are currently adopted for metadata on ATR and which standards are currently in use (Q19, Q20, Q21) and to compare these with future datasets that will provide snapshots in the field.
However, it is important to recognise the following limitations of the dataset:
The survey answers mainly represent cultural heritage institutions from the United States and Europe, so its view is skewed towards a Western-centric pool of answers.
The survey was conducted in English; therefore, the language may have influenced the answers received and may have hindered additional answers from different countries. Future surveys may benefit from being translated into multiple languages.
As the data were collected by a British institution, the answers may have been influenced by existing institutional and international relationships.
The call was shared in newsletters specialised in Digital Humanities and Library Science, so this may have skewed the answers towards specific people and subjects.
The duration of the survey may have also limited the responses.
Finally, although the option to leave an email address was voluntary, some potential respondents may not have felt comfortable completing the survey for privacy reasons.
The data are shared with a CC BY 4.0 Attribution license, allowing all forms of reuse, even commercial, provided the author is credited.
Acknowledgements
I would like to thank Andrew Longworth for supporting this project, the data protection team, the marketing team and all the people involved in the survey approval, Dr Adi Keinan- Schoonbaert for checking the data, the questions and for the helpful comments on this paper, the Heritage Made Digital and Digital Research teams for their comments and suggestions on the draft survey and the data analysis.
Competing Interests
The author declares no competing interests. For transparency, the author notes that the British Library is a member of the READ-COOP social cooperative.
Author Contributions
Conceptualisation, Investigation, Data Curation, Formal Analysis.
