Have a personal or library account? Click to login
Providing Web Archive News Articles as Corpus Data Cover

Providing Web Archive News Articles as Corpus Data

Open Access
|Jan 2025

Figures & Tables

johd-11-281-g1.png
Figure 1

The key steps of the warc2corpus pipeline.

johd-11-281-g2.png
Figure 2

Comparison of text before and after tokenisation. Text within square brackets [] indicating separate tokens.

Norwegian Bokmål:1 437 768 documents
Norwegian Nynorsk:111 892 documents
Northern Sámi:11 416 documents
Kven:302 documents
Southern Sámi:101 documents
Lule Sámi:78 documents
NRK:130 162
VG:66 800
Forskning.no:65 469
TV2:55 367
Dagens næringsliv:50 005
Dagbladet:46 333
Finansavisen:38 514
Adresseavisen:33 640
Aftenposten:31 075
Khrono:29 794
johd-11-281-g3.png
Figure 3

Displaying the dhlab data frame in a notebook.

johd-11-281-g4.png
Figure 4

Visualising a corpus’ distribution of titles, harvesting dates and language.

johd-11-281-g5.png
Figure 5

Interface of web application for collocation analysis. To the left, one can upload an Excel file with a corpus definition and adjust parameters. In the centre, one enters a keyword to retrieve the most frequent collocated words, before it is outputted in form of a table and a word cloud visualisation.

DOI: https://doi.org/10.5334/johd.281 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 12, 2024
Accepted on: Dec 13, 2024
Published on: Jan 23, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Jon Carlstedt Tønnessen, Magnus Breder Birkenes, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.