(1) Overview
Repository location
Zenodo – DOI: https://doi.org/10.5281/zenodo.15465769
Context
The Kölner Korpus des Kiezdeutschen/Cologne Corpus of Kiezdeutsch was developed to document and analyze the multiethnic youth variety known as Kiezdeutsch as spoken in Cologne, one of Germany’s largest urban centers. Although Kiezdeutsch has been extensively studied – see, e.g., the Kiezdeutsch-Korpus (Wiese et al. 2010) and numerous publications by Wiese (2009, 2011, 2012, 2013) and colleagues (e.g., Wiese, Freywald & Mayr 2009) –, most research has focused on Berlin and other northern metropolitan areas (for an exception, see Behringer 2020). This dataset addresses a significant empirical gap by capturing a regional instantiation of this multilingual contact variety in a western German context, where it remains underrepresented in corpus-based research.
Kiezdeutsch is often described as a multilingual urban vernacular, shaped by sustained contact between German and a range of heritage and migrant languages in urban neighborhoods. Its structure reflects contact-induced innovations, lexical borrowing, and discourse-pragmatic strategies, among other structural and functional features associated with youth peer interaction and colloquial speech.
Unlike traditional regional dialects, Kiezdeutsch has emerged in multilingual urban neighborhoods and reflects peer-group identity more than geographic origin. Its development is shaped by everyday multilingualism, contact-induced change, and youth-driven norms, making it a key case for studying contemporary language change “from below”. The corpus captures this through naturally occurring, informal speech in peer settings – crucial for understanding how such varieties evolve and spread.
The corpus was designed as a resource for both qualitative and quantitative analysis, with broad applicability across multiple subfields of linguistics. It enables researchers to investigate structural properties of Kiezdeutsch, including morphosyntactic variation, lexical innovation and borrowing and discourse-pragmatic/interactional phenomena characteristic of peer-group communication (e.g., ritual insults, hyperbole, wordplay). In addition, the spontaneous nature of the speech and its informal register facilitate the analysis of turn-taking, repair strategies, stance marking and peer alignment among speakers with diverse linguistic repertoires.
Methodologically, the corpus adheres to rigorous standards of linguistic data collection. Recordings were made during naturally occurring break-time conversations without researcher intervention, thereby maximizing ecological validity. 13 male speakers aged 17–20 were recruited from an IT-focused two-year vocational program at a Berufskolleg (vocational college) in Cologne. The final participant sample shares a common educational background but includes individuals both with and without migration backgrounds, allowing for the exploration of intra-group linguistic variation within a multilingual youth demographic.
In this context, we use the term “multilingual” to refer to speakers who regularly use German alongside at least one other language in their everyday lives (e.g., Turkish, Arabic, Kurdish). The multilingual speakers in our sample have migration backgrounds and are second-generation members of multilingual households. In contrast, the monolingual group comprises speakers who self-identify as using only German as their first language, both in domestic and public settings.
Transcriptions follow the Gesprächsanalytisches Transkriptionssystem 2 (GAT 2) (Selting et al. 2009), the most widely adopted standard in German-speaking conversation analysis and interactional linguistics. This ensures fine-grained representation of paralinguistic features such as prosodic contour, pauses, overlaps, and laughter – elements that are essential for pragmatic and discourse-oriented analysis. The corpus is further enriched by metadata concerning speaker group composition (monolingual, multilingual, mixed).
In its current form, this compact corpus includes approximately 180 minutes of audio material (about 60 minutes per group), yielding just over 33,000 tokens and 3,721 annotated speaker turns, and therefore provides a sufficiently rich and diverse dataset to serve as a robust empirical foundation for corpus-based research on urban vernaculars. Potential applications span a range of subfields, including variationist syntax, youth language practices and sociophonetics. In addition, the corpus lends itself well to pedagogical contexts, particularly for instruction in transcription techniques, corpus design and empirical methods within German linguistics.
The dataset is openly accessible via Zenodo under a CC BY 4.0 license and is intended as a sustainable, long-term resource for the linguistic research community. It complements ongoing initiatives to document non-standard and multilingual varieties of German and contributes to the broader objective of making spoken language data from underrepresented urban contexts publicly available. In alignment with the FAIR data principles, the corpus is findable through a persistent DOI, accessible without restrictions, interoperable through standard formats and metadata and reusable thanks to comprehensive documentation and open licensing. It is thus designed to support reproducible research and facilitate wide-ranging scholarly use.
The corpus can be cited as follows:
Neubauer, Antonia Marie & Catasso, Nicholas (2025). Kölner Korpus des Kiezdeutschen/Cologne Corpus of Kiezdeutsch [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.15465769
(2) Method
Steps
The corpus was constructed with a focus on ecological validity and reproducibility. Data collection took place in 2023 at a vocational secondary school specializing in technical and media-oriented educational programs, providing a relevant sociolinguistic environment for accessing speakers of urban youth varieties. Recordings were conducted without the presence of researchers or teachers during unsupervised break periods, in order to ensure natural interaction and minimize the observer’s paradox (Labov 1972). Participants were instructed only in general terms to speak as they normally would among peers and were simply invited to choose from among topics they found engaging. The resulting conversations typically revolved around school, food, cars, religion, football and future plans. No scripts, interview protocols or experimental tasks were employed. Three distinct groups of speakers were recorded, each group comprising four to six male students:
Group 1 (G1): multilingual speakers only (four speakers)
Group 2 (G2): monolingual speakers only (four speakers)
Group 3 (G3): mixed group (mono- and multilingual speakers) (six speakers)
Although a total of 13 individuals were recruited, the overall number of participants across the three groups is 14, as one speaker was included in two groups (G3 and one of the other groups). This overlap was intentional to ensure a balanced representation of monolingual and multilingual speakers within G3.
Recordings were conducted using readily available mobile devices with integrated microphones, allowing for the effective capture of overlapping speech and ambient interactional cues in naturalistic settings.
Sampling strategy
Participants were selected using a purposive stratified sampling approach, drawing on existing cohorts from the vocational school. Stratification was guided by observable linguistic and social characteristics, with the goal of forming peer groups marked by coherent internal dynamics.
While the overall number of participants is limited, the three-group design allows for targeted contrastive analysis of peer interaction across speakers with different linguistic repertoires.
All groups consisted of male speakers who use features of Kiezdeutsch. The design enables analysis of linguistic variation not only across differing language profiles, but also in relation to group-internal structures and dynamics. The three-group setup, in turn, supports comparative analysis of interactional accommodation (i.e., speakers adjusting their speech to align with others), peer alignment (the display of agreement or shared stance through verbal or non-verbal cues) and pragmatic convergence (a gradual harmonization of discourse styles and strategies within interactions).
Transcription and annotation
All transcription was performed manually and subsequently revised for internal consistency and interpretive clarity. Paralinguistic and selected prosodic features relevant to conversational structure – such as pauses, lengthening, cut-offs, and overlap – were annotated following GAT2 conventions, but rendered using English labels rather than the original German notation. No normalization was applied to slang and non-standard vocabulary and pronunciation.
The original transcription and annotations by Authorb were systematically reviewed by a second transcriber (Authora) with the primary aim of eliminating inaccuracies and harmonizing transcription strategies, particularly in the representation of paralinguistic signals and prosodic structuring.
All transcripts underwent rigorous pseudonymization: personal names, addresses and institutional affiliations were either removed or systematically replaced with neutral placeholders to ensure participant anonymity. Speaker identifiers were consistently applied across all transcription tiers, employing a structured pseudonymization scheme in which labels such as ME (for mehrsprachig, ‘multilingual’) and MO (for monolingual, ‘monolingual’) precede speaker numbers.
Quality control
Quality control procedures focused on ensuring transcription accuracy and methodological coherence. A subset of approximately 15% of the data was independently transcribed by the second annotator to assess consistency in the application of GAT 2 conventions, particularly for prosodic and interactional features. Discrepancies were addressed through collaborative review and alignment with the established transcription framework.
The finalized corpus was reviewed for overall coherence, including the alignment of transcription content with speaker metadata.
(3) Dataset Description
Repository name: Zenodo
Object name: Kölner Korpus des Kiezdeutschen/Cologne Corpus of Kiezdeutsch
Format names and versions: mp3, PDF
Creation dates: 2023-06-28
Dataset creators:
Catasso, Nicholas (data manager), Bergische Universität Wuppertal
Neubauer, Antonia Marie (data collector), Bergische Universität Wuppertal
Language: German (transcriptions); English (annotation conventions)
License: CC BY 4.0
Publication date: May 11, 2025
(4) Reuse potential
In addition to its empirical and pedagogical utility, the corpus contributes to the broader documentation of spoken German in socially diverse settings. It captures informal, peer-group interaction among multilingual youth, offering insights into everyday linguistic practices shaped by mobility, social networks and institutional context. While compact in size, the dataset enriches the available empirical base for studying contemporary language use beyond standardized norms and supports ongoing efforts to make underrepresented varieties more accessible to linguistic research. The corpus can also serve as a tool in university courses on corpus linguistics, transcription methodology, multilingualism and variational linguistics. Because of its modular group-based structure and clear metadata, it is well suited for student projects and coursework at the BA/MA level.
In addition, the dataset invites comparative studies across regional variants of Kiezdeutsch and other European urban vernaculars (e.g., Multicultural London English, Citétaal in Belgium, maranza language in Italy, etc.), contributing to broader inquiries in language contact, urban multilingualism and language and identity in youth culture.
Although the dataset is demographically focused – male speakers from a single school context –, it offers a highly replicable design that can be extended to other speaker groups or urban centers. As such, it lays empirical groundwork for future longitudinal, gender-comparative and multi-site studies of contemporary spoken German in (directly or indirectly) migration-influenced contexts.
The dataset supports a wide range of research questions beyond those mentioned above. Among many other aspects, it enables the study of phenomena such as borrowed discourse particles (e.g., inshallah, vallah, ga), youth-language Anglicisms, vulgar expressions, pseudo-vocatives (e.g., Bruder, Digga), determiner omission (e.g., G1, 711: bruder denkt ich hab pfeife zu hause) and preposition-determiner contraction or omission (e.g., G3, 2290: der geht einfach baumarkt), non-canonical case marking (e.g., G3, 1756: ich geh zu eine studie) and syntactic patterns such as Verb Third (e.g., heutzutage leute ersetzen viel zu schnell …).
(5) Ethical considerations
All participants provided informed consent prior to the study. For those who were under the age of 18 at the time of data collection, consent was obtained from their legal guardians. Additionally, the participants – who, by the time of publication in 2025, were all of legal age – provided explicit consent for the release of the anonymized data in the Zenodo repository. These procedures were conducted in accordance with institutional ethics guidelines and applicable data protection laws, including the General Data Protection Regulation (GDPR) and relevant German data protection legislation (e.g., the Bundesdatenschutzgesetz, BDSG). The dataset is pseudonymized and does not contain any personally identifiable information. As such, it is suitable for public sharing and reuse under the open license provided.
Acknowledgements
We thank the participating students for their cooperation.
Competing Interests
The authors have no competing interests to declare.
Author Contributions
– Nicholas Catasso: Conceptualization; Data management; Supervision; Validation; Writing – original draft; Writing – review & editing
– Antonia Marie Neubauer: Data collection and transcription; Writing – original draft; Writing – review & editing
