1. Context and motivation
1.1 Introduction
Web archives, like the UK Web Archive (UKWA), exist in a paradoxical space. They are expected to function like traditional archives, yet they react to the ‘digital heap’ by structuring decentralised resources that were not meant to be structured (cf. Milligan 2019). This role confusion can cause mismatched expectations. While archivists perceive web archives as organised collections, researchers may anticipate their functionality to resemble that of open data, and the general public might regard them as search engines. To effectively address the diverse needs of these users—archivists, researchers, and the public—web archives must be reconceptualised beyond the confines of traditional collection practices.
Furthermore, UKWA is a legal deposit web archive, which has its built-in usage limitations. In the UK, materials received through legal deposit at national libraries are traditionally available for reference only, meaning that readers cannot access the majority of these works outside the physical reading rooms of the six UK legal deposit libraries. While electronic resources should have the potential for remote access – and users now expect to access them from anywhere – UK regulations still apply the same restrictions to digital content. As a result, UKWA resources have been underused even before the cyber attack on the British Library led to the temporary shut down of the collection. As Gooding, Terras, and Beube pointed out, this also creates barriers to text and data mining and contrasts with current Digital Humanities collaboration trends (Gooding et al. 2018; Milligan 2015). Ogden and Maemura formulated similar criticisms when conceptualising the challenges of web archive research. Besides the legal conditions, they also highlighted the problems of the effort required to research the web archives, including the gap in the understanding of what the curated material can answer (Ogden & Maemura 2021).
To reconcile these conflicting roles and help mitigate the built-in shortcomings, we propose that web archives should embrace a multi-level access model: one that begins with highlighting the uses of collections metadata – what is not part of a legal deposit –, goes through pre-processing by both archivists and researchers, and ends with a user-friendly interface that can engage a broad audience. The three levels of pre-processing are the following:
Archivist level: This is the primary level of selecting, crawling, cataloguing websites, and organising metadata. As the first filter of information, this type of access is offered for those users who are either GLAM (Galleries, Libraries, Archives and Museums) professionals or are familiar with archival research and have a concrete research question only the web archive can answer.
Researcher level: Researchers pre-process the collection metadata, for example using distant reading methodologies (like NLP, ML, etc.), converting it into a format that can be further explored. This middle layer translates raw archive material into analysable data for digital researchers, but it is not directly usable for the wider public.
Public level: The researcher becomes a mediator, creating a more accessible and digestible version of the data. Here, the broader public can engage with the information, answering historical or societal questions in the traditional library setting or via interfaces based on the aggregated information, similarly to how research communication provides accessible narratives out of complex projects.
As understanding the limitations of the UKWA as data has already started to be explored (see Ogden 2018), our model aims to offer a perspective to overcome them. The model offers different access levels that respond to user needs, iterating upon the archive’s role as a service for audiences. This concept is explored further in the Archive of Tomorrow project, which demonstrates the different stages of data processing and dissemination, with a particular focus on working with web archives metadata.
1.2 Case study: The Archive of Tomorrow
The Archive of Tomorrow was a multi-institutional project funded by the Wellcome Trust and was led by the National Library of Scotland. The primary goal of the project was to investigate and preserve online health resources. In total, over 3000 web sources were collected, culminating in the creation of the UKWA collection titled ‘Talking About Health’. This collection aimed to cover a comprehensive digital record of the diverse ways in which health information has been discussed, disseminated, and understood online (UK Legal Deposit Libraries 2023).
The scope of the collection is notably broad, covering a wide array of topics related to health. A key focus of the project was to gather information from both official and unofficial sources. Official sources included materials published by news outlets, government websites, and public health organisations, while unofficial sources ranged from social media platforms to personal blogs. For instance, the collection features forum discussions on DIY dentistry during the COVID-19 lockdowns, government-issued health guidelines, and commercial websites selling products purportedly designed to prevent 5G radiation exposure. The project team made the decision not to classify the collected information as either misinformation or valid information. This decision stemmed from the recognition that many sources occupy a grey area, where their accuracy may fluctuate over time. For example, clinical trials that initially showed promise may later be discredited, or official government advice may become outdated as the situation evolves. The project’s aim, therefore, was not to curate definitive truth but to cast a wider net and capture the broader health-related discourse occurring in both formal and informal spaces online (Austin & Talboom 2024).
The project also grappled with several challenges, particularly in terms of the sensitive nature of some of the material and the legal framework governing web archiving in the UK. Under current legislation, the UKWA can collect online content within the UK domain, but public access to this material is generally restricted to the reading rooms of legal deposit libraries (Department for Culture, Media and Sport 2013). This restriction can only be bypassed when licences are signed by the website owner, content creator and/or copyright holders. However, the project benefited from the expertise of a rights officer who successfully negotiated permissions, making a significant portion of the collection openly accessible. The curated collection of the project, titled ‘Talking about Health’ is available on the UKWA website (UK Web Archive 2023).
One of the central aims of the Archive of Tomorrow was to enhance accessibility, making this material available to a wider and more diverse audience. While some of this was achieved through legal channels, such as securing licences for specific content, the project also explored ways to make the collection accessible at scale. This aspect of the project is particularly of interest, as it addresses the growing demand for digital collections to be used in large-scale data analysis and computational research.
The notion of making digital archives accessible for computational methods is not new. Several other projects within the GLAM community have championed the concept of ‘Collections as Data’, which encourages cultural and academic institutions to rethink their archival holdings as potential datasets. Notable examples include Collections as Data (Padilla et al. 2018), the Archives Unleashed Project (Ruest et al. 2020) and the work of the GLAM Workbench, which has collaborated with the British Library to facilitate the use of archived material in computational research (Sherratt 2020; 2021). These initiatives aim to push GLAM institutions toward adopting new data-centric approaches to their collections, thereby making them more useful for researchers who rely on data analytics.
In a practical effort to test the accessibility of the Archive of Tomorrow collection, two fellowships were made available in collaboration with Cambridge Digital Humanities. This provided valuable insights into how a web archive dataset needed to be prepared to be utilised by scholars working with large datasets and computational tools. This paper will further outline how the dataset was prepared and will include the research that was done by one of the fellows when using this dataset.
2. Dataset description
2.1 Data as a mediator for access: Preparing the dataset and the support needed for researchers
As mentioned above, the Archive of Tomorrow was limited in how the material was made accessible through the non-print print legal deposit legislation (The Legal Deposit Libraries (Non-Print Works) Regulations 2013). For users who wanted to browse individual websites, great improvements were made with the help of the rights officer on the project. By signing a large number of licences throughout the project, more material was made available outside of the reading rooms. However, making the material available as data posed a greater challenge. The legislation prohibits the direct use of website content at scale, primarily to protect publishers’ rights. This restriction ensures that publishers are not unfairly disadvantaged by large-scale computational research, such as distant reading, which could otherwise extract significant value or insights without equitable compensation or acknowledgment. This legislative framework, while protecting publishers, presents a considerable hurdle for researchers seeking to utilise the archived material for computational analyses.
As a solution, the project shared a JSON dataset containing metadata rather than the actual website crawls. This metadata included details like crawl dates, website tags, titles, and descriptions. While this allowed researchers to access curated data about the websites, they still faced limitations. For example, they could only work with metadata and could not directly access the websites, meaning that if certain websites were taken offline, the researchers had no access to the content itself.
The JSON file was created by the British Library, which oversees the UKWA’s infrastructure (Lelkes-Rarugal 2024). However, providing this file with basic explanations proved insufficient for the effective use of the data. The complexities of web archives, such as how they are captured and how different ‘targets’ (websites or sections) are selected, were unfamiliar to the researchers, highlighting the need for more contextual guidance. This is where the introduction of the ‘Datasheet for Datasets’ happened. This work was first introduced in the computer science field to effectively document the datasets (Gebru et al. 2018) but has been picked up by the web archiving community in recent years as a useful way of documenting datasets created in the field (Maemura et al. 2018). At the time of the project, only the general version introduced by Gebru was used to document the dataset, but a more tailored version is now available (Maemura & Byrne 2024). The British Library will be publishing more of these datasheets alongside their datasets in the future.
For the Archive of Tomorrow project, substantial discussion arose around the process of completing the datasheet. Key questions in the datasheet focused on foundational details such as the creation dates, dataset size, relationships between dataset instances, and considerations of rights and legislation. For the project team, this raised questions not initially viewed as particularly relevant, especially when discussing methodology with the fellows. For instance, understanding how web archives function and the variations in content selection became prominent issues. Depending on the capture context, a web archive may include an entire website, a specific section, or even just a single page. Initially, this contextual information was not provided to researchers, as it was considered intrinsic to web archives. However, upon reflection, understanding how a web archive is structured proved essential for researchers, particularly for those who may be unfamiliar with web archives, as it directly influences the scope and approach of their research.
The dataset and the datasheet can now be found on the Data Foundry. This also includes a link to a number of Notebooks highlighting some basic analysis to understand the dataset a bit better.
Repository location
Repository name
Data Foundry, National Library of Scotland
Object name
Talking about Health dataset
Format names and versions
Dataset is available on JSON and plain TXT
Creation dates
2022-02-01–2023-04-30
Dataset creators
Archive of Tomorrow team, National Library of Scotland
Language
English
Licence
CC-BY 4.0
Publication date
2023-04-24
Example Usage of the JSON Dataset
An illustrative application of the JSON dataset involved using it as a secondary-level pre-processing tool to explore the concepts of misinformation and filter bubbles during the pandemic. The workflow began with parsing the metadata of the “Talking about Health” collection to identify working links. URLs were then scraped for keywords and website summaries using natural language processing tools. Articles referenced in the metadata were extracted into a pandas DataFrame using the Newspaper3k package, which facilitated the analysis of fields such as keywords and summaries. To focus on pandemic-related articles, a combination of automated filtering and manual checks was implemented. Text searches employed the ten most frequent COVID-related terms, with results manually refined in OpenRefine to remove irrelevant entries before parsing the full text of the remaining articles. Additionally, articles were manually added based on fact-checking framework guidelines. Finally, LDA Topic Modelling was run on each credibility category (credible, questionable, non-credible) to uncover the most discussed themes within each group on the credibility scale. A best-fit topic modeling configuration was predicted and applied for each category using Python libraries such as nltk, spacy, and gensim, and finally the results were visualised and plotted. This approach allowed the identification of dominant themes and the exploration of content within each category.
The results showed that credible sources withheld information due to the evolving nature of science and policy, non-credible sources cherry-picked experts and information, shaping alternative narratives, and questionable sources were found to speculate and search for answers in the absence of complete scientific information. This pilot research showed the potential of distant reading the information in web archive collections with the help of enhanced metadata while also finding stories and smaller clusters in the vast collection for public engagement uses. Further information on these methods and the Notion board used during this project can be found on the Data Foundry page provided with the dataset.
This example research is one of the bases for facilitating the last iteration in our multi-level access model, public access. This approach emphasises participatory design principles to co-create tools that enable more awareness of web archives and seamless user interaction with their content. While this stage is still ongoing as part of the National Librarian’s Research Fellow in Digital Scholarship 2024–25 at the National Library of Scotland (NLS), several key activities are underway to advance its implementation. The enhanced metadata and pieces of example research are being introduced at workshops hosted at NLS and proposed at conferences related to the topic. These workshops are designed to raise awareness of the resource and its potential applications, including demonstrations of the Jupyter Notebooks available on the Data Foundry. The Notebooks contain pre-configured Python scripts with tools to check links to verify whether websites in the archive are still available. They also provide tools for scraping titles, keywords, and summaries from the URLs available in the JSON files and for performing topic modelling with related visualisations. These features help researchers discover the potential content of the resources, offering valuable insights into their structure and themes as a starting point of their interaction with the archive. By showcasing these capabilities during workshops, participants are equipped with both conceptual and practical knowledge to explore the web archive effectively.
Besides the Talking about Health collection, to further facilitate public interaction, the enhanced metadata from selected web archive collections is shared in higher educational settings with data visualisation students. These students contribute to the creative display of information available in the collections, presenting new ways to explore and understand the archive’s content. Additionally, the datasets is shared with a data visualisation artist, Dorsey Kaufmann, who collaborates on an exhibition to present the archive in a more engaging and visually compelling format and encourage storytelling with the archives.
Based on feedback collected during these interactions – whether from workshops, educational settings, or the exhibition – a new interface is being prototyped. This interface combines data visualisation and storytelling techniques to make the metadata dataset more accessible and engaging. By displaying the extensive web archive collection in a more visual and focused way, users are better able to grasp its content and perform more efficient catalogue searches tailored to their own needs. The enhanced metadata in CSV, the visualisations, and the interface are gradually uploaded to Data Foundry, where they will be available together with the corresponding JSON files. The groundwork laid in the Archiving and Research phases directly informs the design of these public-facing elements. This progression ensures that each step builds cohesively on the last, advancing from technical refinement to broader accessibility.
3. Conclusion
This paper suggests that by implementing a series of iterative steps – collecting data at scale, creating curated open datasets, extracting smaller subsets, performing close reading, and developing a public interface – the project has successfully enhanced the accessibility and engagement potential of the web archive collection for three key audiences: data users, general readers, and the digitally curious (Talboom & Underdown 2019).
For data users, the focus has been on developing the Datasheets for Data documentation, which facilitates the quantitative research of web archive collections by offering structured, detailed descriptions of the datasets. For general readers, the project has enhanced access by working closely with a copyright officer, ensuring that materials can be more effectively used for close reading within legal and ethical boundaries. For the digitally curious, the work undertaken by the Cambridge Digital Humanities fellows and the Jupyter Notebooks available on the Data Foundry page provide valuable tools and resources for exploring the metadata. Further work will be undertaken for the digitally curious in due course and more resources will be added to the Data Foundry page as they are created.
This paper demonstrates that, although expanding access to datasets offers significant benefits, the process of preparing data for a data-driven audience can be complex and time-consuming. Nonetheless, the development of this dataset has provided the British Library with a valuable case study for advancing its data-sharing initiatives. Furthermore, ongoing efforts, such as the Datasheets for Datasets project, exemplify meaningful progress in this area.
Acknowledgements
We would like to thank the Archive of Tomorrow project team, the British Library, the National Library of Scotland, and Cambridge Digital Humanities for making the fellows’ participation possible.
Funding Information
The Archive of Tomorrow was funded by the Wellcome Trust. The fellows on the project were founded by Cambridge Digital Humanities. Dr Andrea Kocsis’s work on UK Web Archive access is funded by the National Librarian’s Research Fellowship in Digital Scholarship 2024–25.
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Leontien Talboom – data curation, writing – draft, writing – review & editing
Andrea Kocsis – formal analysis, writing – draft, writing – review & editing
