The Spoken Corpus of the Southern Dutch Dialects (Gesproken Corpus van de zuidelijk-Nederlandse Dialecten)

Anne Breitbarth; Melissa Farasyn; Anne-Sophie Ghyselen; Lien Hellebaut; Frederic Lamsens; Katrien Depuydt; Jesse de Does; Jan Niestadt; Koen Mertens

doi:10.5334/johd.536

1 Overview

Repository location

Freely accessible material is hosted at the OSF repository (https://doi.org/10.17605/OSF.IO/RHU3N). The corpus itself is accessible through CLARIN (https://hdl.handle.net/10032/tm-a2-z8) under CC-BY-SA for non-commercial research. CLARIN however also provides a facility for unaffiliated users, allowing researchers who wish to use the data but do not have an institutional login to nevertheless gain access to the application.

Context

The Spoken Corpus of the Southern Dutch Dialects (Gesproken Corpus van de zuidelijk-Nederlandse Dialecten; GCND) is a linguistically annotated (POS-tagged and parsed) audio-aligned corpus of 653 transcribed recordings of spoken Southern Dutch dialects. 550 of these recordings were made between 1963 and 1976 at Ghent University, and are part of the “Voices from the Past” collection (Stemmen uit het Verleden; Van Keymeulen et al. 2019).¹ These recordings come from places in the Southern Dutch dialect areas in France and Belgium: French Flanders, West Flanders, East Flanders, Zeelandic Flanders (partly in The Netherlands), Flemish Brabant and (Belgian) Limburg. Another 73 recordings, made in roughly the same period, come from the Dutch dialect database (Nederlandse Dialectenbank) hosted at the Meertens Institute.² They cover the places in the territory of the Netherlands that in terms of relatedness still belong to the Southern Dutch dialects: Zeelandic Flanders, North Brabant, and (Dutch) Limburg. These 73 places were selected on the basis of their also being sampling places for the Syntactic Atlas of the Dutch Dialects (Syntactische Atlas van de Nederlandse Dialecten, SAND, Barbiers et al. 2005, 2008). Finally, 30 new recordings were made between 2020 and 2024 in 27 SAND sampling places in Brussels, Flemish Brabant and Limburg for which there were no recordings yet in the Voices from the Past-collection. Figure 1 shows the distribution of all sampling places in the GCND.

Distribution of the sampling places in the GCND.

The GCND was produced with the help of a medium-scale research infrastructure grant by the Flemish Research Foundation (Fonds for Wetenschappelijk Onderzoek–Vlaanderen; FWO).

2 Method

Steps

Audio recording. The GCND combines historical recordings from the Voices from the Past collection and the Meertens Institute with newly collected material. The 550 Ghent University recordings and the 73 recordings from the Dutch Dialect Database were originally made between 1963 and 1976 using reel-to-reel recorders (Telefunken M25 or Revox F-36) on BASF LGS 35 tapes at 19 cm/s (Van Keymeulen et al., 2019).³ The 30 new recordings (2020–2024) were made digitally using a Marantz PMD recorder and a Rode HS microphone.
Transcription. All recordings were transcribed in ELAN (Max Planck Institute for Psycholinguistics Nijmegen, The Language Archive, n.d.) using the two-tier orthographic protocol described in Ghyselen, Breitbarth, et al. (2020); Ghyselen, Van Keymeulen, et al. (2020). Regular phonetic variation was normalized, while dialect-specific lexical items were retained. The first tier preserves dialect function words in normalized spelling (e.g. me, wunder ‘we’), and clitic clusters are transcribed together, with #-marks to separate their parts (e.g. maa#bè#ja#m ‘but of course’, lit. ‘but interjection yes we’). The second tier standardizes function words and separates clitics into individual tokens while maintaining identical token order and full time alignment to the audio (e.g. maar bè ja wij ‘but interjection yes we’). For a more detailed discussion and motivation of these transcription choices, see Ghyselen, Van Keymeulen, et al. (2020); Ghyselen, Breitbarth, et al. (2020).
Annotation. The second tier was further linguistically annotated. POS tags, lemmata, and syntactic parses were produced with the Alpino parser (Van der Beek et al., 2002; van Noord, 2006), following a preprocessing step in which dialectal structures problematic for Alpino were masked.⁴ In post-processing, these masked elements were correctly reintegrated into the parse using the TrEd editor (Pajas & Stepanek, 2008). Parsed sentences remain aligned with the corresponding audio segments through ELAN timestamps.
Publication. The corpus was published in October 2024 via the Dutch Language Institute (INT). Users can search transcription tiers, metadata, POS tags, and syntactic structures through a dual platform consisting of BlackLab (for token-based search)⁵ and GrETEL (for treebank queries) (Augustinus et al., 2017). A “Help” page is included in the search interface to assist new users in composing queries.
Expansion. Annotation refinement continues, and through an ongoing FWO infrastructure project (I002124N; 2024–2028), the GCND is being expanded to include additional northern Dutch dialects (GCND+). As a result, the GCND will evolve from a Spoken Corpus of Southern-Dutch dialects into a broader Spoken Corpus of Dutch Dialects (GCND+).

Quality control

Most transcriptions were produced by trained student assistants from Ghent University. All received instruction on ELAN and the GCND transcription protocol, and were selected based on a transcription test. The French Flemish recordings were transcribed by Melissa Farasyn and two experienced volunteers. Because student assistants did not always have sufficient dialect competence—due to dialect loss (Ghyselen & Keymeulen, 2014) or unfamiliarity with traditional topics discussed in the recordings such as traditional farming practices or disappeared trades and technologies—all transcriptions were subsequently checked by (older) volunteers with strong dialect competence. Corrections for 344 of the 653 recordings were incorporated into the first released version of the corpus. A targeted error analysis of 13 fully corrected transcriptions, conducted by trained linguists who were native speakers of the relevant dialects, showed an average Word Error Rate of only 1.3%, indicating high transcription accuracy (Ghyselen et al., 2025). Of the corrected transcriptions, 289 have now been semi-automatically annotated. For 70 of these, the Alpino parser output was manually reviewed by student assistants and project team members. All POS tags are available in the online interface, with filters to exclude unverified tags. In contrast, annotations of syntactic structure are published only after manual verification. This verification process is ongoing. Future work aims to improve efficiency by training a new (native UD) parser on GCND data to reduce the need for manual correction. This effort forms part of the current FWO infrastructure project (2024–2028).

3 Dataset Description

Repository name

Institute for the Dutch Language (Instituut voor de Nederlandse Taal; INT).

Object name

Gesproken Corpus van de zuidelijk-Nederlandse Dialecten/Spoken Corpus of the Southern Dutch Dialects (GCND)

Format names and versions

The current version is 1.0. The audio data are in .mp3 and .wav format. The transcriptions were produced using ELAN (.eaf format). Preprocessing files are in .txt format; the link between audio and text is created by .json-files. The output of the parser is in the Alpino XML format.⁶ The distribution format, which combines transcription, alignment with the audio data and linguistic annotation is FoLiA (Format for Linguistic Annotation) XML, which is the standard used for various corpora in the Netherlands and Flanders (van Gompel & Reynaert, 2013).⁷

Creation dates

The bulk of the recordings were made between 1963 and 1976; 27 recordings were added between 2020 and 2024. The transcriptions were produced between 2018 and 2024 and linguistically annotated between 2022 and 2024. The data were integrated into the INT corpus application running on a BlackLab backend, in 2024.

Dataset creators

Anne Breitbarth (PI, conceptualization, transcription protocol, correction of transcriptions and parsing, documentation); Melissa Farasyn (co-PI, conceptualization, transcription protocol, transcription, correction of transcriptions, parsing, documentation); Anne-Sophie Ghyselen (co-PI, conceptualization, transcription protocol, transcription, correction of transcriptions, parsing, documentation); Lien Hellebaut (project coordinator, correction of transcriptions); Frederic Lamsens (database design, development); Katrien Depuydt (co-PI; backend design, testing); Jesse de Does (backend design and programming, testing); Jan Niestadt (backend programming, testing); Koen Mertens (backend programming, testing).

Language

Dutch; Southern Dutch dialects (French Flemish, West Flemish, East Flemish, Zeelandic Flemish, Brabantic, Limburgish).

License

The corpus is accessible via CLARIN at https://gcnd.ivdnt.org. It is not entirely open access as the recordings contain human voices, and in some cases sensitive content. The data are accessible for non-commercial purposes through a CC-BY-SA license upon request, to be directed to servicedesk@ivdnt.org.

Publication date

The first version was launched on 2024-10-24.

4 Reuse Potential

The GCND offers substantial reuse potential for research on language variation, speech technology, and digital humanities. The combination of audio-aligned transcriptions, dual-tier annotation, and syntactic parsing were from the outset meant to facilitate research into morphosyntactic and lexical variation across Southern Dutch dialects. Given the fact that speakers were born from 1871 onwards, the GCND also provides a basis for historical and diachronic research. The aligned audio further enables phonetic and prosodic analysis. The corpus also provides high-quality training material for dialect-sensitive automatic speech recognition (ASR) and related computational tasks, including acoustic modelling, forced alignment, and syntactic parsing of non-standard varieties. Ongoing work with KU Leuven has already demonstrated its suitability for developing and evaluating ASR systems for Southern Dutch dialects using the GCND data (Mehralian et al., 2026a, b). Beyond linguistics and NLP, the recordings and metadata offer opportunities for cultural-historical and oral-history research, including studies of traditional ways of living, cultural practices, and regional heritage. Because the dataset contains human voices and sensitive content, reuse is limited to non-commercial research and requires a data transfer agreement, ensuring responsible and privacy-conscious use.

Notes

[1] This collection can be accessed through https://www.dialectloket.be/geluid/stemmen-uit-het-verleden/.

[2] https://ndb.meertens.knaw.nl.

[3] The entire Voices from the Past collection contains 783 recordings from 550 different locations; 113 from French Flanders, 188 from West Flanders, 285 from East Flanders, 85 from Antwerp, 47 from Flemish Brabant, 29 from Limburg, 31 from Zeeland, and 5 from Hainault. For the GCND, one recording per place was chosen.

[4] For this step, we availed ourselves of the option of marking parts of the text with [ @skip … ] brackets, as described in the Alpino annotation guide https://www.let.rug.nl/~vannoord/DCOI/AnnotationGuide.html.

[5] https://blacklab.ivdnt.org/.

[6] https://www.let.rug.nl/vannoord/Lassy/sa-man_lassy.pdf.

[7] https://proycon.github.io/folia/.

Acknowledgements

We acknowledge the contributions of Sally Chambers (data management); Timothy Colleman (conceptualization; co-PI); Gertjan van Noord (consultancy—ALPINO-parser; co-PI), as well as Francine Verstraete for transcribing and correcting many recordings. We are furthermore indebted to the student assistants transcribing the recordings, pre-processing the transcriptions for parsing, and post-processing the output of the parser, Zenon Andries, Terence Aspeslagh, Marieke Avermaete, Ineke Barbaix, Roeland Bataillie, Maxime Bervoets, David Beunk, Laurens Biesmans, Rémi Bruggeman, Floor Bruinen, Ember Brulez, Lennert Camp, Celine Capelle, Kimberly Casier, Phara Claerbout, Cisse Clemminck, Robin Coolen, Jolien Cornilly, Fien Croux, Mathias D’Hulst, Janne David, Fien De Brie, Lennert De Clercq, Joshua De Donder, Zoë De Feyter, Myrdhin De Fruyt, Hanna De Haes, Debby De Jaeger, Eva De Koker, Lies De Middeleer, Kobe De Neve, Palmyra De Nil, Elsbet De Pauw, Ben De Slagmulder, Laurie De Smet, Lukas De Spiegeleer, Frouke Debunne, Chelsey Declerck Merel Decrock, Lieselot Degraeve, Yumi Demeyere, Anne-Li Demonie, Casper Deroose, Frouke Eeckman, Lore Fonteyn, Loes Francken, Ilse Franssen, Charlotte Fuertes, Stan Geernaert, Ine Geldermans, Maja Gmur, Claire Govaerts, Ashley Iacobs, Jaron Jacobus, Marie Jalon, Amber Kempynck, Kat Kenis, Fien Kerkhove, Lander Kesteloot, Annelies Kolijn, Joni Kruijsbergen, Simon Lambrechts, Marthe Laureys, Emma Maes, Marieke Martens, Fran Meulebrouck, Gaëlle Mignon, Jan Raeymaekers, Heleen Reuners, Flor Roesbeke, Caitlin Rolier, Lotte Ruysschaert, Gaëlle Rycx, Yasmina Saret, Benjamin Schepens, Margot Schots, Shauny Seynhaeve, Hanne Smet, Louise Snauwaert, Hannah Soenens, Jaan Teelen, Bram Van Beurden, Melanie Van Beurden, Farah Van Compernolle, Yoni Van Dam, Lisa van de Loo, Eva van der Steen, Maite Van Ginderdeuren, Ella Van Lint, Tijmen Van Loo, Niels Van Pamel, Nicholas Vandenbussche, Miriel Vandeperre, Summer Vanhaverbeke, Maite Vanthournout, Cato Vanzieleghem, Pauline Verhelst, Elke Verlinden, Bette Vermet, Daan Verschueren, Merel Verzelen, Frauke Vervaeke, Jürgen Voet, Nathalie Wittevrongel and Ruben Wolsing. For the correction of the transcriptions produced by the student assistants, we finally thank our tireless volunteers, Lobke Aelbrecht, Clement Agten, Hedwig Belien, Karel Blomme, Chris Blondeel, Annemie Bogaert, Wivina Briers, André Cambier, Mady Colson, Monique Cox, Guus de Block, Cyriel De Bruyne, Anny De Cock, Louis De Cock, Paul De Man, Liliane De Mey, Rita De Spiegeleire, Veronique De Tier, Gunther De Vogelaer, Sigrid Debruyne, Karen Declercq, Philippe Deheegher, Jo Devos, Luk Draye, Marc Flamee, André Fraats, Derek Giroulle, Christa Goethuys, Theo Kees, Marijke Kesteloot, Claude Lammens, Myriam Lammertijn, Marie-Charlotte Lauwereins, Luut Leroy, Pieter Loyens, Stefan Lycops, Vic Mennen, Mattie Nuytinck-Bogte, Dirk Pijpops, Marjan Pittevils, Lode Pletinckx, Dirk Raes, Mira Ryon, Inga Scharley, Jean-Marie Schepens, Stefaan Schoonheere, Liesbet Triest, Jacques Van Keymeulen, André Van Laer, Reinhild Vandekerckhove, Robert Vandenberghe, Roxane Vandenberghe, Michel Vander Vennet, Nele Vanfleteren, Rik Vanhoucke, Walter Vercammen, Gert Verdonck, Leslie Verpoorte and Francine Verstraete.

Author Contributions

Anne Breitbarth: conceptualization, writing–original draft; Melissa Farasyn: conceptualization, writing–review & editing; Anne-Sophie Ghyselen: conceptualization, writing–review & editing; Lien Hellebaut: project administration, writing–review & editing; Frederic Lamsens: software, validation; Katrien Depuydt: conceptualization; Jesse de Does: conceptualization, software; Jan Niestadt: software, validation; Koen Mertens: software, validation.