Introducing the First Module of the Multimedia Corpus of Spoken Kazakh Language

Giorgia Troiani; Andrey Filchenko

doi:10.5334/johd.529

1 Overview

Repository location

Context

The Multimedia Corpus of Spoken Kazakh Language (MCSKL) is the first large-scale initiative dedicated to the documentation of naturally occurring spoken Kazakh. Until now, oral subsections could be found in corpora otherwise primarily concerned with written language, such as the Almaty Corpus of Kazakh (Madiyeva et al., 2016) and the National Corpus of Kazakh Language (Zhanabekova, 2012).¹ These datasets leave a gap in resources dedicated to spoken interaction, essential for documenting the contemporary structural and pragmatic innovations driven by a multilingual landscape. This dataset differs from previous initiatives in three key aspects: i) it features naturally occurring speech events (mainly conversation); ii) it is transcribed in intonation units; iii) it is available as open access.

The design of the MCSKL follows that of the Santa Barbara Corpus of Spoken American English (Du Bois et al., 2000), particularly concerning the criteria for inclusion of speech events and the use of intonation units. Comparable corpora exist for well-documented languages, such as the KiParla corpus for Italian (Mauri et al., 2019) and the NCCU Corpus of spoken Chinese (Chui & Lai, 2008). These resources usually focus on languages used within national borders and limit the representation of language diversity to regional variation. In contrast, the MCSKL extends data collection beyond Kazakhstan’s borders and has to accommodate extensive multilingual usage. In this respect, the MCSKL shows similarities to corpora built in the framework of documentary linguistics, which engage with the multilingual competence of speakers of minoritized languages (Dobrushina & Moroz, 2021), such as in the experience of languages of Northern Eurasia (Arkhipov & Däbritz, 2018).

Central Asia presents a highly multilingual landscape (Bahry, 2016; Koptleuova et al., 2023). Regional multilingualism is articulated around three directions of contact: historical local migratory paths and commerce networks, Russian/Soviet colonization, and globalization processes. Each of these paths brought Kazakh (Turkic) in contact with different languages, mainly Russian (in Kazakhstan), Chinese and Uyghur (in Xinjiang), and Mongolian (in Bayan Olgii). The multilingual competence of Kazakh speakers and sustained contact encouraged the development of structural, lexical, and pragmatic innovations in Kazakh. These complex developments are documented in the corpus through the decision to maintain occurrences of multilingual language use in the recordings (Chernyavskaya & Zharkynbekova, 2024).

The MCSKL features naturally occurring speech events. The recording phases took place over four summers (2021–2025). The corpus mainly represents conversation, though other genres are included, such as task-oriented talk (food preparation, ritual slaughtering of an animal for Kurban Ayt), interviews (Zoom calls), and extended narratives. Whenever possible, the recordings were carried out by the participants themselves after receiving training from the research team. This strategy ensured that the speech events would not be disrupted by the presence of a researcher and that participants could recuse themselves from collecting and depositing data at any time. Part of the recordings collected during the COVID-19 quarantine (2021–2022) were collected with devices available to the participants, such as mobile phones. Other recordings were collected with dedicated specialized devices, such as Zoom h4n recorders.

Transcription of the corpus follows the Discourse Functional Transcription (DFT) framework (Du Bois et al., 1993), suitable to represent prosodic contour, pauses, disfluencies, vocalisms, laughter, and other typical elements of interactional speech. Discourse is segmented into intonation units, stretches of speech uttered under a cohesive intonational contour (Chafe, 1994). Intonation units can be identified solely on prosodic grounds and they can be consistently and reliably identified even by non-native-speaking transcribers (Himmelmann et al., 2018; Troiani, 2023), which makes them advantageous as a transcription tool.

The MCSKL archival collection includes 475 hours of recordings, collected across Kazakhstan and in the adjacent regions of Mongolia, Iran, and Uzbekistan. Publication is being organized in modules to be released as transcription is finalized. At the moment, one module is available to the public. Module-1 is composed of 12 hours of audio material across 33 speech events collected in different regions of Kazakhstan and in Xinjiang (China), in both rural and urban settings. A total of 78 participants took part in these events. Participants come from different regions of Kazakhstan and Xinjiang and self-identified as native speakers of Kazakh. The majority of participants display multilingual competence in either Russian, Chinese, or English. Metadata for speech events and participants are available. Part of the events of Module-1 has been annotated with interlinear glosses. Module-1 sets the standards of language data and metadata representation, utilized annotation conventions, and the general structure of the corpus.

2 Method

Steps

The corpus aims to represent language use among Kazakh speakers, a community of several millions speakers distributed across different territories. To achieve this goal, we decided to select speech events based on language-extrinsic features, meaning that we included only speech-events in the corpus that would have taken place even if we had not recorded them and are consequential to the lives of participants (Troiani et al., 2024). Consent for audio and/or video recording was collected for all the recordings. Whenever possible, we collected consent a day in advance of the session to minimize disruption at the moment of recording. After consent was collected, researchers set up the recording device (or instructed a participant on how to set up a recording device) and left the room. If the absence of a researcher would have resulted in an artificial event, researchers took part in the event being recorded (e.g., gatherings of the researchers’ families). If a participant expected to interact with the researcher, we complied with this expectation. Participants were not given any instruction on how to interact in the recording. After the session, participants were given the opportunity of deleting or redacting the recording.

Sampling strategy

Participants were recruited among friends, relatives, and acquaintances of the members of the research team. Additional participants were recruited through social media campaigns. The team recruited participants with different levels of fluency and multilingual competence, as long as participants self-identified as Kazakh speakers. Prior to the recording, researchers collected metadata about the participants, including: age, gender, region of origin and language(s) spoken at home, region of current residence, preferred language of everyday interaction. Module-1 features 73 participants, distributed as displayed in Table 1. The majority of these participants resided in cities at the time of recording and indicated Kazakh as the preferred language of interaction.

Table 1

Demographic composition of participants in recordings of the MCSKL Module-1.

Age
Under 25	33	45.20%
26–35	5	6.85%
36–50	14	19.18%
50–65	10	13.70%
Over 65	6	8.22%
Unknown	5	6.85%
Gender
Women	56	76.71%
Men	17	23.29%
Region of origin
East Kazakhstan	25	34.25%
West Kazakhstan	9	12.33%
South Kazakhstan	4	5.48%
North Kazakhstan	5	6.85%
Central Kazakhstan	4	5.48%
Xinjiang (China)	8	10.96%
Uzbekistan	2	2.74%
Unknown	16	21.91%

Transcription and annotation

Audio recordings were manually transcribed and annotated by native speakers using the transcription tool ELAN (Wittenburg et al., 2006) following DFT conventions. Segmentation into intonation units has been checked for internal consistency and manually corrected and reconciled by Giorgia Troiani (who trained the annotators).² Manual translations in English have been provided. Personal names and addresses were systematically replaced with placeholders to ensure participant anonymity. Speaker pseudonyms were consistently applied across all transcription tiers. Personal information was also removed from audio recordings.

Quality control

Quality control procedures for the MCSKL Module-1 focused on ensuring segmentation consistency (see above), as well as the accuracy of transcription and translations, which have been checked by two Kazakh-speaking senior members of the research team. Consistency in the application of DFT conventions has been validated through the application of a series of Python scripts that were used to standardize the annotations and identify discrepancies to be manually corrected. The module was then manually reviewed to ensure the correct anonymization and alignment of transcription content with speaker metadata.

3 Dataset Description

Repository name

OSF

Object name

Multimedia Corpus of Spoken Kazakh Language module 1

Format names and versions

WAV, EAF, TXT, TSV, CSV

Creation dates

Data collected between years 2021 and 2023

Dataset creators

Principal investigator: Andrey Filchenko (Nazarbayev University)
Collection and transcription manager, training manager, data manager: Giorgia Troiani (Nazarbayev University and University of Bologna)
Training manager: Gulnar Sarseke (Eurasian National University)
Data collectors, transcribers, and validators: Akyl Akanov (Nazarbayev University and Nanyang Technological University) and Moldir Bizhanova (Nazarbayev University)
Data managers: Nikolay Mikhailov (Nazarbayev University) and Ludovica Pannitto (University of Bologna)
Data collectors and transcribers: Zhansaya Turaliyeva, Saodat Kurmanayeva, Wulaer Nurlan, Tansulu Temirbekova, Dana Akhmerova, Alyona Krapivina, Tomiris Nurgaliyeva, Ilya Razorenov, Madina Kelessova

Language

Kazakh; Russian; Chinese.

License

CC-BY Attribution-NonCommercial-ShareAlike 4.0 International

Publication date

2026-02-19

4 Reuse Potential

The MCSKL Module-1 constitutes the first collection of naturally-occurring Kazakh conversation released in open access. This resource provides an empirical base for the analysis of contemporary language use for a low-resource language. First and foremost, the corpus is designed to support different levels of analysis of language in use and it has been used to investigate grammaticalization and contact-induced change (Troiani, 2023; Troiani & Mukanova, in press), conversational practices, and discourse markers (Akanov, 2025) in Kazakh. The emphasis on naturally occurring discourse led our team to record rare cultural-specific events that are of interest to anthropologists and ethnographers (e.g. ritual slaughters, weddings, traditional recitals). Educators and language instructors may also find the corpus a useful resource to be integrated into instructional curricula in light of recent government-backed and grassroots-led efforts to stimulate the use of Kazakh among youth. The audio/video recordings and time-aligned transcriptions have a potential use in the training of LLMs and in the development of technological applications for Kazakh language. Finally, the methodological steps used to assemble the dataset lay the groundwork for the creation of conversational corpora of other languages of Central Asia, where corpus designers are bound to encounter conditions that are similar to the one discussed here.

Notes

[1] NLP resources are also available such as the Kazakh Language Corpus (Makhambetov et al., 2013) and the Kazakh Speech Corpus (Khassanov et al., 2021). Since the MCSKL is not comparable in scope and research goal with these datasets, we are not focusing on them here.

[2] We did not run scripts to check for inter-rater agreement but confirmed in an experimental setting that expert transcribers displayed high rates of agreement with each other and with Giorgia (Troiani, 2023).

Ethics and Consent

The project design and the data collection protocol of the project was approved by the Institutional Research Ethics Committee of Nazarbayev University (ethics approval reference: [881/20032024, 682/22022023, 500/29112021]).

Acknowledgements

We thank all the research assistants from Nazarbayev University and Eurasian National University who contributed to this project providing recordings that will be published in the future releases.

Author Contributions

Giorgia Troiani: conceptualization, data curation, methodology, supervision, validation, writing – original draft.

Andrey Filchenko: conceptualization, funding acquisition, methodology, project administration, supervision, writing – review and editing.