
Introducing the First Module of the Multimedia Corpus of Spoken Kazakh Language
Abstract
The first module of the Multimedia Corpus of Spoken Kazakh Language is a dataset documenting contemporary Kazakh as spoken in Kazakhstan and Xinjiang (China). It includes 33 audio recordings (ca. 12 hours) and time-aligned transcriptions collected from 78 participants. Recordings feature naturally occurring conversation among native Kazakh speakers. The corpus is anonymized and published under a CC BY 4.0 license. The dataset is intended as a linguistic resource for the empirical analysis of Kazakh and it is suitable for reuse in a wide-range of linguistics-adjacent disciplines concerned with the analysis of naturally occurring language in use.
© 2026 Giorgia Troiani, Andrey Filchenko, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.