Have a personal or library account? Click to login
Providing Web Archive News Articles as Corpus Data Cover

Providing Web Archive News Articles as Corpus Data

Open Access
|Jan 2025

References

  1. 1Barbaresi, A. (2020). htmldate: A Python package to extract publication dates from web pages. Journal of Open Source Software, 5(51), 2439. DOI: 10.21105/joss.02439
  2. 2Beelen, K., Lawrence, J., Wilson, D. C. S., & Beavan, D. (2023). Bias and representativeness in digitized newspaper collections: Introducing the environmental scan. Digital Scholarship in the Humanities, 38(1), 122. DOI: 10.21105/joss.02439
  3. 3Bingham, N. J., & Byrne, H. (2021). Archival strategies for contemporary collecting in a world of big data: Challenges and opportunities with curating the UK web archive. Big Data & Society, 8(1), 2053951721990409. DOI: 10.1177/2053951721990409
  4. 4Birkenes, M. B., Johnsen, L., & Kåsen, A. (2023, October 16–18). NB DH-LAB: a corpus infrastructure for social sciences and humanities computing. CLARIN Annual Conference Proceedings (pp. 3034). Leuven, Belgium.
  5. 5Brügger, N. (2021). The need for research infrastructures for the study of web archives. In: D. Gomes, E. Demidova, J. Winters, & T. Risse (Eds.), The past web: Exploring web archives (pp. 217224). Springer International Publishing. DOI: 10.1007/978-3-030-63291-5_17
  6. 6Brügger, N., & Schroeder, R. (2017). Web as History: Using Web Archives to Understand the Past and the Present. UCL Press. DOI: 10.2307/j.ctt1mtz55k
  7. 7Cadavid, J. A. P. (2014). Copyright challenges of legal deposit and web archiving in the National Library of Singapore. Alexandria Journal of National and International Library and Information Issues, 25(1–2), 119. DOI: 10.7227/ALX.0017
  8. 8Candela, G., Chambers, S., & Sherratt, T. (2023a). An approach to assess the quality of Jupyter projects published by GLAM institutions. Journal of the Association for Information Science and Technology, 74(13), 15501564. DOI: 10.1002/asi.24835
  9. 9Candela, G., Gabriëls, N., Chambers, S., Dobreva, M., Ames, S., Ferriter, M., Fitzgerald, N., Harbo, V., Hofmann, K., & Holownia, O. (2023b). “A checklist to publish collections as data in GLAM institutions”. Global Knowledge, Memory and Communication. DOI: 10.1108/GKMC-06-2023-0195
  10. 10Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez-Lonebear, D., Rowe, R., Sara, R., Walker, J. D., Anderson, J., & Hudson, M. (2020). The CARE Principles for Indigenous Data Governance. Data Science Journal, 19(43), 112. DOI: 10.5334/dsj-2020-043
  11. 11Cavnar, W. B., & Trenkle J. M. (1994). N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (pp. 161175).
  12. 12DH-lab. (2024). Apper fra DH-lab. National Library of Norway. https://www.nb.no/dh-lab/apper/ (last accessed: 2024.10.04).
  13. 13Familie- og kulturkomiteen. (2015). “Innstilling fra familie- og kulturkomiteen om endringer i lov om avleveringsplikt for allment tilgjengelige dokumenter (innsamling av digitale dokumenter m.m.)” [Innst. 286 L – 2014-2015].
  14. 14International Internet Preservation Consortium. (2022). WARC Specifications – The WARC Format 1.1. GitHub. https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/ (last accessed: 2024.10.04).
  15. 15International Organisation for Standardisation. (2017). Information and documentation — WARC file format (ISO Standard No. 28500:2017). https://www.iso.org/standard/68004.html (last accessed: 2024.10.04).
  16. 16Maemura, E. (2023). All WARC and no playback: The materialities of data-centred web archives research. Big Data & Society, 10(1), 114. DOI: 10.1177/20539517231163172
  17. 17Manning, C., Raghavan, P., & Schütze, H. (2009). An Introduction to Information Retrieval. Cambridge University Press. DOI: 10.1017/CBO9780511809071
  18. 18Milligan, I. (2016). Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives. International Journal of Humanities and Arts Computing, 10(1), 7894. DOI: 10.3366/ijhac.2016.0161
  19. 19National Library of Norway. (2024a). Web news collection. Norwegian Ministry of Culture, National Library of Norway. https://www.nb.no/en/collection/web-archive/research/web-news-corpus/ (last accessed: 2024.10.04).
  20. 20National Library of Norway. (2024b). Digital Preservation. Archived on 2024.09.06, 13:19. https://nettarkivet.nb.no/search/20240906131954/https://www.nb.no/en/digital-preservation/
  21. 21NOU. (2022: 9). En åpen og opplyst offentlig samtale: Ytringsfrihetskommisjonens utredning. The Norwegian Government Security and Service Organisation. https://urn.nb.no/URN:NBN:no-nb_digibok_2024022748039
  22. 22Padilla, T. (2017). On a Collections as Data Imperative. UC Santa Barbara. https://escholarship.org/uc/item/9881c8sv
  23. 23Pomikálek, J. (2011). Removing Boilerplate and Duplicate Content from Web Corpora. Dissertation, Masaryk University, Brno. https://is.muni.cz/th/45523/fi_d/phdthesis.pdf
  24. 24Ruest, N., Fritz, S., & Milligan, I. (2022). Creating order from the mess: Web archive derivative datasets and notebooks. Archives and Records, 43(3), 316331. DOI: 10.1080/23257962.2022.2100336
  25. 25Sanderson, M., Christopher, D., & Manning, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100. DOI: 10.1017/S1351324909005129
  26. 26Schäfer, R., & Bildhauer, F. (2013). Web Corpus Construction. Morgan & Claypool. DOI: 10.1007/978-3-031-02152-7
  27. 27Schafer, V., & Winters, J. (2021). The values of web archives. International Journal of Digital Humanities, 2(1), 129144. DOI: 10.1007/s42803-021-00037-0
  28. 28Tønnessen, J. C. (2022). Register of Norwegian News Websites (2005–21) [Data set]. GitHub. https://github.com/nlnwa/map-norwebnews/blob/main/docs/norwebnews/README.md (last accessed: 2024.10.04).
  29. 29Tønnessen, J. C. (2024a). Navigating the scale of web archives: Leveraging data for research [Manuscript submitted for publication].
  30. 30Tønnessen, J. C. (2024b). autoCat2 [Software]. Github. https://github.com/nlnwa/autoCat-notebooks (last accessed: 2024.10.04).
  31. 31Tønnessen, J. C. (2024c). nlnwa-notebooks [Software]. Github. https://github.com/nlnwa/nlnwa-notebooks/blob/main/notebooks/corpus/nettavis-tekstanalyse.ipynb (last accessed: 2024.10.04).
  32. 32Tønnessen, J. C., & Langvann, T. (2024). Towards multi-layered access with automatic classification. [Paper presentation]. Web Archives in Context: IIPC Web Archiving Conference 2024, Paris, France. https://www.youtube.com/watch?v=paubSvWqfC4
  33. 33Vlassenroot, E., Chambers, S., Di Pretoro, E., Geeraert, F., Haesendonck, G., Michel, A., & Mechant, P. (2019). Web archives as a data resource for digital scholars. International Journal of Digital Humanities, 1(1), 85111. DOI: 10.1007/s42803-019-00007-7
  34. 34Vogels, T., Ganea, O.-E., & Eickhoff, C. (2018). Web2text: Deep structured boilerplate removal (pp. 167179). DOI: 10.1007/978-3-319-76941-7_13
DOI: https://doi.org/10.5334/johd.281 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 12, 2024
Accepted on: Dec 13, 2024
Published on: Jan 23, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Jon Carlstedt Tønnessen, Magnus Breder Birkenes, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.