Skip to main content

Introducing the First Module of the Multimedia Corpus of Spoken Kazakh Language Cover

Introducing the First Module of the Multimedia Corpus of Spoken Kazakh Language

Journal of Open Humanities Data

Volume 12 (2026): Issue 1

By: Giorgia Troiani and Andrey Filchenko

Open Access

|May 2026

Abstract

The first module of the Multimedia Corpus of Spoken Kazakh Language is a dataset documenting contemporary Kazakh as spoken in Kazakhstan and Xinjiang (China). It includes 33 audio recordings (ca. 12 hours) and time-aligned transcriptions collected from 78 participants. Recordings feature naturally occurring conversation among native Kazakh speakers. The corpus is anonymized and published under a CC BY 4.0 license. The dataset is intended as a linguistic resource for the empirical analysis of Kazakh and it is suitable for reuse in a wide-range of linguistics-adjacent disciplines concerned with the analysis of naturally occurring language in use.

References

Akanov, A. (2025). Russian ‘vot’ as an interactional practice in bilingual kazakh conversations. International Journal of Bilingualism (Special issue: Language Convergence and Diversity in the Post‑Soviet Multilingual Diaspora Across the World), 1–25. 10.1177/13670069251396260
Open DOI Search in Google Scholar Back to article
Arkhipov, A., & Däbritz, C. L. (2018). Hamburg corpora for indigenous northern eurasian languages. Томский журнал лингвистических и антропологических исследований, 3, 9–18. 10.23951/2307-6119-2018-3-9-18
Open DOI Search in Google Scholar Back to article
Bahry, S. A. (2016). Language ecology: Understanding central asian multilingualism. In E. S. Ahn & J. Smagulova (Eds.), Language change in central asia (pp. 11–32). De Gruyter. 10.1515/9781614514534-006
Open DOI Search in Google Scholar Back to article
Chafe, W. L. (1994). Discourse, consciousness, and time: the flow and displacement of conscious experience in speaking and writing. University of Chicago Press.
Search in Google Scholar Back to article
Chernyavskaya, V. E., & Zharkynbekova, S. K. (2024). Code switching patterns in kazakh-russian hybrid language practice: An empirical study. Training, Language and Culture, 8(2), 9–19. 10.22363/2521-442X-2024-8-2-9-19
Open DOI Search in Google Scholar Back to article
Chui, K., & Lai, H.-L. (2008). The NCCU corpus of spoken chinese: Mandarin, hakka, and southern min. Taiwan Journal of Linguistics, 6(2), 119–144.
Search in Google Scholar Back to article
Dobrushina, N., & Moroz, G. (2021). The speakers of minority languages are more multilingual. International Journal of Bilingualism, 25(4), 921–938. 10.1177/13670069211023150
Open DOI Search in Google Scholar Back to article
Du Bois, J. W., Chafe, W., Meyers, C., & Thompson, S. A. (2000). Santa barbara corpus of spoken american english. Linguistic Data Consortium. 10.35111/S2Q7-GQ73
Open DOI Search in Google Scholar Back to article
Du Bois, J. W., Schuetze-Coburn, S., Cumming, S., & Paolino, D. (1993). Outline of discourse transcription. In J. A. Edwards & M. D. Lampert (Eds.), Talking data: Transcription and coding in discourse research (pp. 45–89). Lawrence Erlbaum Associates Publishers.
Search in Google Scholar Back to article
Himmelmann, N. P., Sandler, M., Strunk, J., & Unterladstetter, V. (2018). On the universality of intonational phrases: a cross-linguistic interrater study. Phonology, 35(2), 207–245. 10.1017/S0952675718000039
Open DOI Search in Google Scholar Back to article
Khassanov, Y., Mussakhojayeva, S., Mirzakhmetov, A., Adiyev, A., Nurpeiissov, M., & Varol, H. A. (2021). A crowdsourced open-source kazakh speech corpus and initial speech recognition baseline. In P. Merlo, J. Tiedemann, & R. Tsarfaty (Eds.), Proceedings of the 16th conference of the european chapter of the association for computational linguistics: Main volume (pp. 697–706). Association for Computational Linguistics. 10.18653/v1/2021.eacl-main.58
Open DOI Search in Google Scholar Back to article
Koptleuova, K., Karagulova, B., Zhumakhanova, A., Kondybay, K., & Salikhova, A. (2023). Multilingualism and the current language situation in the republic of kazakhstan. International Journal of Society, Culture and Language, 11(3), 242–257. 10.22034/ijscl.2023.2007080.3099
Open DOI Search in Google Scholar Back to article
Madiyeva, G., Michael, D., Arkhangelsky, T., Toldova, S., Lyashevskaya, O., Umatova, Z., …, Alisheva, Z. (2016). Almaty corpus of kazakh language. Retrieved from https://web-corpora.net/KazakhCorpus/search/?interface_language=en
Search in Google Scholar Back to article
Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matkarimov, B., Sabyrgaliyev, I., & Sharafudinov, A. (2013). Assembling the kazakh language corpus. EMNLP 2013–2013 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 1022–1031). 10.18653/v1/D13-1104
Open DOI Search in Google Scholar Back to article
Mauri, C., Ballarè, S., Goria, E., Cerruti, M., & Suriano, F. (2019). KIParla corpus: A new resource for spoken italian. In R. Bernardi, R. Navigli, & G. Semeraro (Eds.), Proceedings of the sixth italian conference on computational linguistics (CLiC-it 2019) (pp. 243–249). CEUR Workshop Proceedings.
Search in Google Scholar Back to article
Troiani, G. (2023). Representing a language in use: corpus construction, prosody, and grammar in kazakh (phdthesis).
Search in Google Scholar Back to article
Troiani, G., Du Bois, J. W., & Filchenko, A. (2024). Corpus as a slice of life: Representing naturally occurring language and its speakers. Research in Corpus Linguistics, 12(2), 174–202. 10.32714/ricl.12.02.08
Open DOI Search in Google Scholar Back to article
Troiani, G., & Mukanova, K. (in press). Conversational functions of russian‑borrowed ‘like’‑quotatives tipa and takoj in kazakh spoken discourse. International Journal of Bilingualism, (Special issue: Language Convergence and Diversity in the Post‑Soviet Multilingual Diaspora Across the World).
Search in Google Scholar Back to article
Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., & Sloetjes, H. (2006). ELAN: a professional framework for multimodality research. In N. Calzolari, et al. (Eds.), Proceedings of the fifth international conference on language resources and evaluation (LREC‘06). European Language Resources Association (ELRA). 10.63317/5pwa5zpssv4z
Open DOI Search in Google Scholar Back to article
Zhanabekova, A. A. (2012). Ұлттық Корпус Дегеніміз Не? [what is the national corpus?]. In Proceedings of the international scientific-practical conference (Vol. 1, pp. 57–61).
Search in Google Scholar Back to article

Figures & tables

Articles in this issue

DOI: https://doi.org/10.5334/johd.529 | Journal eISSN: 2059-481X

Journal RSS Feed

Language: English

Page range: 67 - 67

Submitted on: Feb 26, 2026

|

Accepted on: Apr 24, 2026

|

Published on: May 25, 2026

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

corpus linguistics,

Turkic languages,

multilingualism

© 2026 Giorgia Troiani, Andrey Filchenko, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 12 (2026): Issue 1

Previous article