Have a personal or library account? Click to login
Providing Web Archive News Articles as Corpus Data Cover

Providing Web Archive News Articles as Corpus Data

Open Access
|Jan 2025

Abstract

While the huge data repositories of web archives carry big potential for knowledge production in academia, researchers have described significant challenges when trying to access and make use of web archives in research. This article describes the creation of a “Web News Collection” where content from the National Library of Norway’s web archive has been made available for computational text analysis, in a manner that facilitates access for research and beyond – aligning with FAIR principles, while also accounting for copyright restrictions. Developing the warc2corpus pipeline, we detail the processes for extracting natural language from WARC files, curating content, and enhancing metadata for analytical purposes. This structured collection — consisting of 1.5 million news articles accessible via a REST API —enables distant reading of news from the web, with tools for building corpora, word frequencies and collocations. To support usage, both programming interfaces and user-friendly web apps are offered, representing a significant step forward in making web archives usable and valuable for digital scholars.

DOI: https://doi.org/10.5334/johd.281 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 12, 2024
Accepted on: Dec 13, 2024
Published on: Jan 23, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Jon Carlstedt Tønnessen, Magnus Breder Birkenes, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.