Providing Web Archive News Articles as Corpus Data

Jon Carlstedt Tønnessen; Magnus Breder Birkenes

doi:10.5334/johd.281

Abstract

While the huge data repositories of web archives carry big potential for knowledge production in academia, researchers have described significant challenges when trying to access and make use of web archives in research. This article describes the creation of a “Web News Collection” where content from the National Library of Norway’s web archive has been made available for computational text analysis, in a manner that facilitates access for research and beyond – aligning with FAIR principles, while also accounting for copyright restrictions. Developing the warc2corpus pipeline, we detail the processes for extracting natural language from WARC files, curating content, and enhancing metadata for analytical purposes. This structured collection — consisting of 1.5 million news articles accessible via a REST API —enables distant reading of news from the web, with tools for building corpora, word frequencies and collocations. To support usage, both programming interfaces and user-friendly web apps are offered, representing a significant step forward in making web archives usable and valuable for digital scholars.

References

1Barbaresi, A. (2020). htmldate: A Python package to extract publication dates from web pages. Journal of Open Source Software, 5(51), 2439. DOI: 10.21105/joss.02439
Back to article
2Beelen, K., Lawrence, J., Wilson, D. C. S., & Beavan, D. (2023). Bias and representativeness in digitized newspaper collections: Introducing the environmental scan. Digital Scholarship in the Humanities, 38(1), 1–22. DOI: 10.21105/joss.02439
Back to article
3Bingham, N. J., & Byrne, H. (2021). Archival strategies for contemporary collecting in a world of big data: Challenges and opportunities with curating the UK web archive. Big Data & Society, 8(1), 2053951721990409. DOI: 10.1177/2053951721990409
Back to article
4Birkenes, M. B., Johnsen, L., & Kåsen, A. (2023, October 16–18). NB DH-LAB: a corpus infrastructure for social sciences and humanities computing. CLARIN Annual Conference Proceedings (pp. 30–34). Leuven, Belgium.
Back to article
5Brügger, N. (2021). The need for research infrastructures for the study of web archives. In: D. Gomes, E. Demidova, J. Winters, & T. Risse (Eds.), The past web: Exploring web archives (pp. 217–224). Springer International Publishing. DOI: 10.1007/978-3-030-63291-5_17
Back to article
6Brügger, N., & Schroeder, R. (2017). Web as History: Using Web Archives to Understand the Past and the Present. UCL Press. DOI: 10.2307/j.ctt1mtz55k
Back to article
7Cadavid, J. A. P. (2014). Copyright challenges of legal deposit and web archiving in the National Library of Singapore. Alexandria Journal of National and International Library and Information Issues, 25(1–2), 1–19. DOI: 10.7227/ALX.0017
Back to article
8Candela, G., Chambers, S., & Sherratt, T. (2023a). An approach to assess the quality of Jupyter projects published by GLAM institutions. Journal of the Association for Information Science and Technology, 74(13), 1550–1564. DOI: 10.1002/asi.24835
Back to article
9Candela, G., Gabriëls, N., Chambers, S., Dobreva, M., Ames, S., Ferriter, M., Fitzgerald, N., Harbo, V., Hofmann, K., & Holownia, O. (2023b). “A checklist to publish collections as data in GLAM institutions”. Global Knowledge, Memory and Communication. DOI: 10.1108/GKMC-06-2023-0195
Back to article
10Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez-Lonebear, D., Rowe, R., Sara, R., Walker, J. D., Anderson, J., & Hudson, M. (2020). The CARE Principles for Indigenous Data Governance. Data Science Journal, 19(43), 1–12. DOI: 10.5334/dsj-2020-043
Back to article
11Cavnar, W. B., & Trenkle J. M. (1994). N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (pp. 161–175).
Back to article
12DH-lab. (2024). Apper fra DH-lab. National Library of Norway. https://www.nb.no/dh-lab/apper/ (last accessed: 2024.10.04).
Back to article
13Familie- og kulturkomiteen. (2015). “Innstilling fra familie- og kulturkomiteen om endringer i lov om avleveringsplikt for allment tilgjengelige dokumenter (innsamling av digitale dokumenter m.m.)” [Innst. 286 L – 2014-2015].
Back to article
14International Internet Preservation Consortium. (2022). WARC Specifications – The WARC Format 1.1. GitHub. https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/ (last accessed: 2024.10.04).
Back to article
15International Organisation for Standardisation. (2017). Information and documentation — WARC file format (ISO Standard No. 28500:2017). https://www.iso.org/standard/68004.html (last accessed: 2024.10.04).
Back to article
16Maemura, E. (2023). All WARC and no playback: The materialities of data-centred web archives research. Big Data & Society, 10(1), 1–14. DOI: 10.1177/20539517231163172
Back to article
17Manning, C., Raghavan, P., & Schütze, H. (2009). An Introduction to Information Retrieval. Cambridge University Press. DOI: 10.1017/CBO9780511809071
Back to article
18Milligan, I. (2016). Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives. International Journal of Humanities and Arts Computing, 10(1), 78–94. DOI: 10.3366/ijhac.2016.0161
Back to article
19National Library of Norway. (2024a). Web news collection. Norwegian Ministry of Culture, National Library of Norway. https://www.nb.no/en/collection/web-archive/research/web-news-corpus/ (last accessed: 2024.10.04).
Back to article
20National Library of Norway. (2024b). Digital Preservation. Archived on 2024.09.06, 13:19. https://nettarkivet.nb.no/search/20240906131954/https://www.nb.no/en/digital-preservation/
Back to article
21NOU. (2022: 9). En åpen og opplyst offentlig samtale: Ytringsfrihetskommisjonens utredning. The Norwegian Government Security and Service Organisation. https://urn.nb.no/URN:NBN:no-nb_digibok_2024022748039
Back to article
22Padilla, T. (2017). On a Collections as Data Imperative. UC Santa Barbara. https://escholarship.org/uc/item/9881c8sv
Back to article
23Pomikálek, J. (2011). Removing Boilerplate and Duplicate Content from Web Corpora. Dissertation, Masaryk University, Brno. https://is.muni.cz/th/45523/fi_d/phdthesis.pdf
Back to article
24Ruest, N., Fritz, S., & Milligan, I. (2022). Creating order from the mess: Web archive derivative datasets and notebooks. Archives and Records, 43(3), 316–331. DOI: 10.1080/23257962.2022.2100336
Back to article
25Sanderson, M., Christopher, D., & Manning, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100. DOI: 10.1017/S1351324909005129
Back to article
26Schäfer, R., & Bildhauer, F. (2013). Web Corpus Construction. Morgan & Claypool. DOI: 10.1007/978-3-031-02152-7
Back to article
27Schafer, V., & Winters, J. (2021). The values of web archives. International Journal of Digital Humanities, 2(1), 129–144. DOI: 10.1007/s42803-021-00037-0
Back to article
28Tønnessen, J. C. (2022). Register of Norwegian News Websites (2005–21) [Data set]. GitHub. https://github.com/nlnwa/map-norwebnews/blob/main/docs/norwebnews/README.md (last accessed: 2024.10.04).
Back to article
29Tønnessen, J. C. (2024a). Navigating the scale of web archives: Leveraging data for research [Manuscript submitted for publication].
Back to article
30Tønnessen, J. C. (2024b). autoCat2 [Software]. Github. https://github.com/nlnwa/autoCat-notebooks (last accessed: 2024.10.04).
Back to article
31Tønnessen, J. C. (2024c). nlnwa-notebooks [Software]. Github. https://github.com/nlnwa/nlnwa-notebooks/blob/main/notebooks/corpus/nettavis-tekstanalyse.ipynb (last accessed: 2024.10.04).
Back to article
32Tønnessen, J. C., & Langvann, T. (2024). Towards multi-layered access with automatic classification. [Paper presentation]. Web Archives in Context: IIPC Web Archiving Conference 2024, Paris, France. https://www.youtube.com/watch?v=paubSvWqfC4
Back to article
33Vlassenroot, E., Chambers, S., Di Pretoro, E., Geeraert, F., Haesendonck, G., Michel, A., & Mechant, P. (2019). Web archives as a data resource for digital scholars. International Journal of Digital Humanities, 1(1), 85–111. DOI: 10.1007/s42803-019-00007-7
Back to article
34Vogels, T., Ganea, O.-E., & Eickhoff, C. (2018). Web2text: Deep structured boilerplate removal (pp. 167–179). DOI: 10.1007/978-3-319-76941-7_13
Back to article

Providing Web Archive News Articles as Corpus Data

Abstract

Paradigm

My account