Have a personal or library account? Click to login
Journal Digital Corpus: Swedish Newsreel Transcriptions Cover

Journal Digital Corpus: Swedish Newsreel Transcriptions

Open Access
|Aug 2025

Full Article

(1) Overview

Repository location

https://doi.org/10.5281/zenodo.15596191

Context

This paper introduces the Journal Digital Corpus (JDC), the first published transcriptions of the Journal Digital collection. 5,214 videos spanning half a century gathered from multiple Swedish video archives, including the most important newsreel archive in Sweden—the so-called SF archive. JDC was created as part of the Modern Times 1936 research project, which asks the question: “What is it that software sees, hears and perceives when technologies for pattern recognition are applied to media historical sources?

The SF archive was bought by Swedish Radio (SR) in 1964 from Swedish Film Industry (SF). It was preserved, duplicated, catalogued and re-used in numerous television programs. In the late 1990s the SF archive was digitized, resulting in the Journal Digital collection. Most of the collection consists of newsreels, SF Veckorevy, screened weekly at cinemas all over Sweden, from the early 1910s to the early 1960s. These recordings provide unique documentation of Swedish society, capturing a wide range of topics and events. In the early 1930s, the application of sound technology marked a pivotal evolution, with all SF newsreels subsequently incorporating voice-over commentary (Snickars, 2024). Much metadata about the collection is available through the Swedish Media Database (SMDB). However, the quality and detail of this metadata is uneven, and there are no available manuscripts or transcripts of the videos themselves.

JDC comprises transcriptions from 2,553 newsreels (2,229,854 words), and intertitles from 4,333 videos (302,342 words). The corpus is the product of two transcription libraries that we developed for this purpose: SweScribe (Aspenskog & Johansson, 2025) for speech (leveraging Automatic Speech Recognition), and stum (Johansson, 2025) for intertitles (leveraging Optical Character Recognition).

(2) Method

Steps

Creating JDC involved developing SweScribe and stum as Python libraries to transcribe speech and intertitles from Swedish 20th-century newsreels. The necessity for creating two different libraries stems from the fact that only half the videos contain speech; the other half is either completely silent or accompanies the clips with music. Both libraries combine existing technologies to create transcription pipelines suitable for Journal Digital collection. Both libraries produce .srt files—a common subtitle format in pure text—making the two outputs compatible, searchable and interoperable.

SweScribe extracts the audio from the 2,553 newsreels with speech to transcribe and align texts to the audio (creating timestamps). The transcription is carried out in WhisperX (Bain et al. 2023) by a Whisper model (from OpenAI) and a wav2vec2 model (Malmsten et al. 2022) to realign the transcriptions with the audio. Both models are fine-tuned for processing Swedish. Our early tests showed both how capable WhisperX is, but also helped us pinpoint some of its weaknesses. For example, it has difficulty transcribing speech in noisy environments or with overlapping speakers, and it tends to hallucinate in both predictable and unpredictable ways: it often fills in the gaps when speech is unintelligible with plausible, yet incorrect, information. It also produces artefacts like “Textning Stina Hedin www.btistudios.com.” We suspect that this most likely is an instance of the model ‘leaking’ from its training data—the model was probably trained on publicly available subtitles, such as those accompanying downloadable films and TV series. Additional errors corrected by the cleaning process include isolated lines containing seemingly random words—for instance, “Men.” (“But”.) and “För.” (“For”.)—as well as repetitive sequences like “Musik”, “Musik Musik”, and “Musik Musik Musik” in segments featuring music.

We used stum to process all videos, creating individual .srt files for the 4,333 videos with their intertitle texts, totalling 302,342 words. Notably, the intertitles have a lower capacity for encoding text than a voice narrator but remains a vital source of information about the content of the videos. Moreover, it is the only way to cover the earlier (silent) years. The library relies on OpenCV2 (Bradski, 2000) and TesseractOCR (Smith, 2007) to detect extract intertitle texts. OCR quality of all the tested intertitles is high—the highest Character Error Rate we have encountered is below 7%, part of which stems from the recurring ‘SF’ logo. The main problem to solve was intertitle detection. The first obstacle to this is that some intertitles are a single frame long (See Image 1)—meaning that we must extract every single frame. stum groups these frames on low levels of changes in pixels. Similar frames are treated as a single segment, from which we use the middle frame for intertitle detection. Intertitles in this collection are typically white, but there is also a significant portion with a dark background and a further portion where the text is shown on top of the running video. We focus on the former and use a combination of calculating the relative size of the largest contour (black or white) and an EAST model (Zhou et al. 2017) for detecting characters. Frames that are deemed relevant by both these (somewhat greedy) approaches are then passed through the OCR engine in its stored form and its mirror; many intertitles are mirrored. The resulting text output is combined with the timestamp calculated from the intertitle’s sequence’s frames.

johd-11-344-g1.png
Image 1

Three consecutive frames from SF604A.1, out of which only one show the full (mirrored) intertitle.

Quality control

During the development of SweScribe, we used CI (Continuous Integration) to run quality controls in the form of WER (Word Error Rate) calculations, letting us track changes in performance every time we changed the codebase and scaled up the number of videos. In this check we artificially cut out segments of the videos with unintelligible speech or speech in a foreign language. The first batch of Ground Truth (GT) transcriptions consisted of 27 manual transcriptions from a previous experiment (Lagerlöf, 2022).

We expanded the GT set in three steps by drawing a random sample of 27 videos, processing them with SweScribe and correcting the output. This is both significantly quicker than transcribing it entirely manually, and a way to discover what types of errors it tends to perform. The new types of errors that we could deal with algorithmically were added to the cleaning steps, and tested, before drawing a new sample. We drew the first set from the entire corpus, the second from 1936 (a year in which we have a particular interest), and the final set from the 1930s. After filtering out the videos without speech, this increased our GT set with 18, 23, and 21 transcriptions respectively, resulting in 89 transcriptions totalling 72,812 words. The WER for the initial GT transcripts was 17%. Following the development and implementation of the cleaning step, the WER was reduced to 10.6%. The WER for the final dataset of 89 GT transcripts was 7%.

For the WER evaluation, only sequences in Swedish are used. Foreign segments—predominantly in English and German—were excluded to align with the project’s research scope. Furthermore, WhisperX demonstrated unpredictable handling of foreign-language sequences: some segments get transcribed in their original language, others in Swedish. The transcription quality in either case varies greatly.

A considerable share of the remaining transcription errors involves proper names. While the model generally performs well with toponyms, it struggles with other proper names that deviate from conventional, modern spellings—for instance, transcribing “Silverhjälm” in place of “Silfverhielm”. Due to the numerous spelling variations of names, we do not attempt to clean these. Consequently, despite the phonetic accuracy of such transcriptions, they contribute to the overall WER.

Similarly, for stum, we relied on continuous testing during development. However, since the testing procedure necessitates access to a large number of images, and the occasional video, it is not suitable for running in CI; we therefore relied on a combination of pre-commit and PyTest (Krekel et al., 2004) to run the test suite on every commit. As new types of problems were discovered, representative frames were added to the testing data. The testing data consists of frames and transcriptions of intertitles with white (15) and black (14) backgrounds and non-intertitle frames (54).

(3) Dataset Description

Repository name

Zenodo

Object name

At the time of writing Journal Digital Corpus consists of 6,886 ‘*.srt’ files, named after the corresponding video-file1 that it transcribes (separate files for speech and intertitles).

Format names and versions

SRT

Creation dates

2024-09-01–2025-06-04

Dataset creators

Mathias Johansson, Systems developer, Lund University

Robert Aspenskog, Research assistant, Lund University

Language

Swedish

License

CC-BY-NC 4.0

Publication date

2025-06-04

(4) Reuse Potential

Journal Digital Corpus covers a period of immense social change and offers an unparalleled window into 20th-century Swedish society, resulting in considerable reuse potential for scholars across several fields. Researchers can analyse shifts in public discourse, media framing, linguistic variation, and national identity, among other themes, on a large scale. We can now employ a variety of distant reading (Moretti, 2000) methods, such as topic modelling, Retrieval Augmented Generation, Regular Expressions or word frequencies.

Through a simple text search, we traced the quote “Sirenerna tjuter över Göteborg” from an unspecified Veckorevyn newsreel, cited by Johansson (2022, p. 27), to a segment entitled “Stor luftskyddsövning i Göteborg” in SF949A.1 (see Image 2).2

johd-11-344-g2.png
Image 2

Intertitle of the newsreel segment and the quoted sentence from SF949A.1.

Passing the texts through a chain of existing tools3 we can quickly identify a concentration along the northern coast (Image 3), particularly in cities (Table 1). The related transcripts portray the north as untamed and the modern railway infrastructure as an economic,4 cultural and civilising force.5

johd-11-344-g3.png
Image 3

Heatmap of locations, showing concentration of attention on Sweden’s northern coast.

Table 1

Frequencies of the mentioned cities from northern Sweden.

LOCALITYMENTIONS
Umeå82
Sundsvall28
Örnsköldsvik25
Skellefteå24
Kramfors21
Luleå20
Boliden19

The corpus includes distribution years for nearly all videos, which can easily be combined with data generated from the given examples of methods or similar, unlocking many avenues for temporal analysis.

Journal Digital Corpus,6 SweScribe7 and stum8 are built with open-source libraries and their sources are published under a CC-BY-NC 4.0 license on both PyPI and GitHub so that users can suggest improvement. We consider this the first version of the corpus.

Notes

[1] By simply adding the ‘.srt’ suffix to the full filename we can differentiate between the archiving formats. Audio and video quality affect transcription quality.

[3] Named Entity Recognition model (KBLab, 2022), a geocoder (Nominatim developer community, 2024) and visualiser (Filipe et al. 2022).

[4] Connected to the booming forest and mining industries. See, for instance, Coates & Holroyd (2021).

Acknowledgements

We gratefully acknowledge Ester Lagerlöf for her transcriptions of the first batch of ground truth files during her involvement with the project in 2022. We also thank our colleagues within Modern Times 1936, Fredrik Mohammadi Norén and Emil Stjernholm, who contributed with their valuable ideas, advice and feedback.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Robert Aspenskog: Conceptualization, Data Curation, Methodology, Software

Mathias Johansson: Conceptualization, Data Curation, Methodology, Supervision, Software

Pelle Snickars: Conceptualization, Supervision, Principal Investigator

DOI: https://doi.org/10.5334/johd.344 | Journal eISSN: 2059-481X
Language: English
Submitted on: Jun 5, 2025
Accepted on: Jul 16, 2025
Published on: Aug 4, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Robert Aspenskog, Mathias Johansson, Pelle Snickars, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.