
The Spoken Corpus of the Southern Dutch Dialects (Gesproken Corpus van de zuidelijk-Nederlandse Dialecten)
Abstract
The Spoken Corpus of the Southern Dutch Dialects (GCND) is a linguistically annotated, audioaligned corpus of 653 recordings from 639 locations across Belgium, northern France, and the southern Netherlands. Most recordings (623) date from 1963–1976 and involve speakers born around 1900 (beginning 1871); 30 new recordings (2020–2024) fill geographic gaps. All recordings were transcribed using a two-tier orthographic protocol and annotated with POS tags, lemmas, and syntactic parses via the Alpino parser, supplemented by manual pre- and post-processing. The corpus is searchable through the Dutch Language Institute (INT) using the BlackLab and GrETEL interfaces. GCND forms an unprecedented resource for research on Dutch dialect variation, spoken corpus annotation, but also dialect-sensitive speech technology. Words, lemmata, metadata, and syntactic structures are accessible for non-commercial research use.
© 2026 Anne Breitbarth, Melissa Farasyn, Anne-Sophie Ghyselen, Lien Hellebaut, Frederic Lamsens, Katrien Depuydt, Jesse de Does, Jan Niestadt, Koen Mertens, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.