References
- 1Barbaresi, A. (2020). htmldate: A Python package to extract publication dates from web pages. Journal of Open Source Software, 5(51), 2439. DOI: 10.21105/joss.02439
- 2Beelen, K., Lawrence, J., Wilson, D. C. S., & Beavan, D. (2023). Bias and representativeness in digitized newspaper collections: Introducing the environmental scan. Digital Scholarship in the Humanities, 38(1), 1–22. DOI: 10.21105/joss.02439
- 3Bingham, N. J., & Byrne, H. (2021). Archival strategies for contemporary collecting in a world of big data: Challenges and opportunities with curating the UK web archive. Big Data & Society, 8(1), 2053951721990409. DOI: 10.1177/2053951721990409
- 4Birkenes, M. B., Johnsen, L., & Kåsen, A. (2023, October 16–18). NB DH-LAB: a corpus infrastructure for social sciences and humanities computing. CLARIN Annual Conference Proceedings (pp. 30–34). Leuven, Belgium.
- 5Brügger, N. (2021).
The need for research infrastructures for the study of web archives . In: D. Gomes, E. Demidova, J. Winters, & T. Risse (Eds.), The past web: Exploring web archives (pp. 217–224). Springer International Publishing. DOI: 10.1007/978-3-030-63291-5_17 - 6Brügger, N., & Schroeder, R. (2017). Web as History: Using Web Archives to Understand the Past and the Present. UCL Press. DOI: 10.2307/j.ctt1mtz55k
- 7Cadavid, J. A. P. (2014). Copyright challenges of legal deposit and web archiving in the National Library of Singapore. Alexandria Journal of National and International Library and Information Issues, 25(1–2), 1–19. DOI: 10.7227/ALX.0017
- 8Candela, G., Chambers, S., & Sherratt, T. (2023a). An approach to assess the quality of Jupyter projects published by GLAM institutions. Journal of the Association for Information Science and Technology, 74(13), 1550–1564. DOI: 10.1002/asi.24835
- 9Candela, G., Gabriëls, N., Chambers, S., Dobreva, M., Ames, S., Ferriter, M., Fitzgerald, N., Harbo, V., Hofmann, K., & Holownia, O. (2023b). “A checklist to publish collections as data in GLAM institutions”. Global Knowledge, Memory and Communication. DOI: 10.1108/GKMC-06-2023-0195
- 10Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez-Lonebear, D., Rowe, R., Sara, R., Walker, J. D., Anderson, J., & Hudson, M. (2020). The CARE Principles for Indigenous Data Governance. Data Science Journal, 19(43), 1–12. DOI: 10.5334/dsj-2020-043
- 11Cavnar, W. B., & Trenkle J. M. (1994). N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (pp. 161–175).
- 12DH-lab. (2024). Apper fra DH-lab. National Library of Norway.
https://www.nb.no/dh-lab/apper/ (last accessed: 2024.10.04). - 13Familie- og kulturkomiteen. (2015). “Innstilling fra familie- og kulturkomiteen om endringer i lov om avleveringsplikt for allment tilgjengelige dokumenter (innsamling av digitale dokumenter m.m.)” [Innst. 286 L – 2014-2015].
- 14International Internet Preservation Consortium. (2022). WARC Specifications – The WARC Format 1.1. GitHub.
https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/ (last accessed: 2024.10.04). - 15International Organisation for Standardisation. (2017). Information and documentation — WARC file format (ISO Standard No. 28500:2017).
https://www.iso.org/standard/68004.html (last accessed: 2024.10.04). - 16Maemura, E. (2023). All WARC and no playback: The materialities of data-centred web archives research. Big Data & Society, 10(1), 1–14. DOI: 10.1177/20539517231163172
- 17Manning, C., Raghavan, P., & Schütze, H. (2009). An Introduction to Information Retrieval. Cambridge University Press. DOI: 10.1017/CBO9780511809071
- 18Milligan, I. (2016). Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives. International Journal of Humanities and Arts Computing, 10(1), 78–94. DOI: 10.3366/ijhac.2016.0161
- 19National Library of Norway. (2024a). Web news collection. Norwegian Ministry of Culture, National Library of Norway.
https://www.nb.no/en/collection/web-archive/research/web-news-corpus/ (last accessed: 2024.10.04). - 20National Library of Norway. (2024b). Digital Preservation. Archived on 2024.09.06, 13:19.
https://nettarkivet.nb.no/search/20240906131954/https://www.nb.no/en/digital-preservation/ - 21NOU. (2022: 9).
En åpen og opplyst offentlig samtale: Ytringsfrihetskommisjonens utredning . The Norwegian Government Security and Service Organisation.https://urn.nb.no/URN:NBN:no-nb_digibok_2024022748039 - 22Padilla, T. (2017). On a Collections as Data Imperative. UC Santa Barbara.
https://escholarship.org/uc/item/9881c8sv - 23Pomikálek, J. (2011). Removing Boilerplate and Duplicate Content from Web Corpora. Dissertation, Masaryk University, Brno.
https://is.muni.cz/th/45523/fi_d/phdthesis.pdf - 24Ruest, N., Fritz, S., & Milligan, I. (2022). Creating order from the mess: Web archive derivative datasets and notebooks. Archives and Records, 43(3), 316–331. DOI: 10.1080/23257962.2022.2100336
- 25Sanderson, M., Christopher, D., & Manning, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100. DOI: 10.1017/S1351324909005129
- 26Schäfer, R., & Bildhauer, F. (2013). Web Corpus Construction. Morgan & Claypool. DOI: 10.1007/978-3-031-02152-7
- 27Schafer, V., & Winters, J. (2021). The values of web archives. International Journal of Digital Humanities, 2(1), 129–144. DOI: 10.1007/s42803-021-00037-0
- 28Tønnessen, J. C. (2022). Register of Norwegian News Websites (2005–21) [Data set]. GitHub.
https://github.com/nlnwa/map-norwebnews/blob/main/docs/norwebnews/README.md (last accessed: 2024.10.04). - 29Tønnessen, J. C. (2024a). Navigating the scale of web archives: Leveraging data for research [Manuscript submitted for publication].
- 30Tønnessen, J. C. (2024b). autoCat2 [Software]. Github.
https://github.com/nlnwa/autoCat-notebooks (last accessed: 2024.10.04). - 31Tønnessen, J. C. (2024c). nlnwa-notebooks [Software]. Github.
https://github.com/nlnwa/nlnwa-notebooks/blob/main/notebooks/corpus/nettavis-tekstanalyse.ipynb (last accessed: 2024.10.04). - 32Tønnessen, J. C., & Langvann, T. (2024). Towards multi-layered access with automatic classification. [Paper presentation]. Web Archives in Context: IIPC Web Archiving Conference 2024, Paris, France.
https://www.youtube.com/watch?v=paubSvWqfC4 - 33Vlassenroot, E., Chambers, S., Di Pretoro, E., Geeraert, F., Haesendonck, G., Michel, A., & Mechant, P. (2019). Web archives as a data resource for digital scholars. International Journal of Digital Humanities, 1(1), 85–111. DOI: 10.1007/s42803-019-00007-7
- 34Vogels, T., Ganea, O.-E., & Eickhoff, C. (2018). Web2text: Deep structured boilerplate removal (pp. 167–179). DOI: 10.1007/978-3-319-76941-7_13
