Providing Web Archive News Articles as Corpus Data Cover

Providing Web Archive News Articles as Corpus Data

Journal of Open Humanities Data

Volume 11 (2025): Issue 1

By: Jon Carlstedt Tønnessen and Magnus Breder Birkenes

Open Access

|Jan 2025

Figures & Tables

The key steps of the warc2corpus pipeline.

Comparison of text before and after tokenisation. Text within square brackets [] indicating separate tokens.

Norwegian Bokmål:	1 437 768 documents
Norwegian Nynorsk:	111 892 documents
Northern Sámi:	11 416 documents
Kven:	302 documents
Southern Sámi:	101 documents
Lule Sámi:	78 documents

NRK:	130 162
VG:	66 800
Forskning.no:	65 469
TV2:	55 367
Dagens næringsliv:	50 005
Dagbladet:	46 333
Finansavisen:	38 514
Adresseavisen:	33 640
Aftenposten:	31 075
Khrono:	29 794

Displaying the dhlab data frame in a notebook.

Visualising a corpus’ distribution of titles, harvesting dates and language.

Interface of web application for collocation analysis. To the left, one can upload an Excel file with a corpus definition and adjust parameters. In the centre, one enters a keyword to retrieve the most frequent collocated words, before it is outputted in form of a table and a word cloud visualisation.

DOI: https://doi.org/10.5334/johd.281 | Journal eISSN: 2059-481X

Journal RSS Feed

Language: English

Submitted on: Nov 12, 2024

Accepted on: Dec 13, 2024

Published on: Jan 23, 2025

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

digital text analysis,

metadata enhancement,

© 2025 Jon Carlstedt Tønnessen, Magnus Breder Birkenes, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Previous article Volume 11 (2025): Issue 1 Next article