Tell Me What Happened: Validation of a Measurement Tool to Assess Kindergarten Children’s Oral Narrative Ability

Judith Schönberger; Fabio Sticca; Dieter Isler

doi:10.5334/spo.79

Introduction

Classroom participation requires diverse oral text abilities, including explaining knowledge, negotiating points of view, and crucially, relating narratives (Feagans & Applebaum, 1986; Westerveld & Vidler, 2016). Narratives hold particular significance in academic settings (Boudreau, 2008). Narratives are typically monologic, meaning one person speaks at length, for example, a child recounting a personal experience or retelling a story. (Engel, 1995; Spencer & Pierce, 2023). A key characteristic of narratives is their organized structure, which encompasses both temporal and causal connections (Berman & Slobin, 1994). Temporal structure refers to how events are sequenced in time, often signaled by words such as “first” or “then”. Causal structure involves linking events by explaining why they happened, often using connecting words such as “because”. Oral narrative ability can therefore be regarded as the ability to construct such structured accounts and convey them coherently to a listener. The construction of a coherent narrative is a cognitively demanding task that relies on executive functions (Janssen et al., 2024; Schönberger et al., 2025). Specifically, it requires working memory to hold multiple story elements, inhibitory control to maintain chronological sequences by suppressing irrelevant details, and cognitive flexibility to manage different character perspectives and their internal states (McCabe & Peterson, 1991; Roth, 2009; Trabasso et al., 1992). While all components are integral, cognitive flexibility is particularly relevant for the complex demands of structuring a story, which requires a child to switch between character perspectives and flexibly organize plot points (Janssen et al., 2024; Schönberger et al., 2025).

As oral narrative ability develops during the preschool years, it relies less on shared knowledge and evolves into a decontextualized form, encompassing past or fictional events that are removed from the immediate context (Roth, 2009; Snow & Dickinson, 1991). This means children’s narratives rely less on the listeners support, gestures or immediately present objects and their narratives can stand alone, conveying meaning about past events (e.g., Yesterday I saw a big dog) or fictional scenarios (e.g., “Once upon a time, there was a prince…”). This transition from the immediate context to a distant context is essential in written texts prominent in academic settings (Snow & Dickinson, 1991). Longitudinal studies link early oral narrative ability to subsequent writing and reading skills (Babayiğit et al., 2021; Suggate et al., 2018). Moreover, interventions improving oral narrative ability have been shown to improve reading comprehension (Petersen et al., 2020) and writing (Petersen et al., 2022). Thus, assessing oral narrative ability in kindergarten-aged children can predict subsequent writing and reading skills.

To accurately track such developmental trajectories or evaluate intervention effectiveness over time, researchers require assessment instruments with robust longitudinal psychometric properties. Such instruments must consistently measure the same construct across time points and exhibit sensitivity towards developmental changes. The MuTex (the German acronym for Mündliche Textfähigkeiten, i.e., Oral Narrative Abilities) is an instrument designed to assess oral narrative ability and was cross-sectionally validated in a pilot study involving 109 kindergarten children (Isler et al., 2018). However, its psychometric suitability for longitudinal application has not been investigated.

Therefore, the primary aim of this study was to conduct a longitudinal validation of the MuTex in kindergarten children over a span of 18 months. Establishing these longitudinal measurement properties is essential to determine if MuTex can serve as a reliable and valid instrument for future research investigating the development of oral narrative ability and for evaluating the efficacy of interventions targeting this ability. Such a validated tool is crucial, given the significant development in oral narrative ability during the preschool years (Roth, 2009) and the current scarcity of widely recognised longitudinal assessments for measuring oral narrative ability in this age group.

Measurement of the Oral Narrative Ability

MuTex assesses children’s ability to retell a speech-free cartoon immediately after watching it. This story retelling elicitation method was chosen because having both the child and the tester share knowledge of the story content increases the validity of assessing oral narrative ability (Reese et al., 2011). Speech-free cartoons were preferred over picture books to better capture children’s independent ability to verbally express their understanding of events. Since the cartoon is speech-free, children cannot simply repeat phrases they have just heard, a key difference compared to other narrative assessment instruments (Bowles et al., 2020; Petersen & Spencer, 2016). Furthermore, this approach was chosen to enhance the instrument’s suitability for linguistically diverse settings. This approach separates verbal comprehension from narrative production. The visual stimulus reduces the verbal comprehension barrier inherent in traditional story-retelling tasks, thus mitigating a key disadvantage for children from different language backgrounds as they tell the story in the language of instruction. Three different cartoons (lamb, crocodile, octopus) were selected, enabling the same children to be tested repeatedly over a period of time. These different versions help maintain consistent administration procedures, crucial for establishing longitudinal measurement invariance. Slightly longer than three minutes, each cartoon adhered to a classic story grammar (Thorndyke, 1977) and had a different animal protagonist, problem and setting. The characters and settings were first introduced. A problem then arose and was solved by the protagonists after several attempts.

Narrative ability can be differentiated into specific macrostructure and microstructure features (Justice et al., 2010; Westerveld & Vidler, 2016). The macrostructure is defined as the global characteristics of a narrative. This includes adherence to a typical story grammar such as clearly outlining the setting (e.g., “in a living room…”), introducing characters (e.g., “there was a crocodile…”), presenting a problem (e.g., “who couldn’t eat a salt stick…”). But it also means ensuring overall coherence, meaning the story flows logically. Microstructure, on the other hand, concerns the linguistic details within sentences, such as the use of complex sentence structures or (e.g., using conjunctions like “while” or “although”) or varied vocabulary (Justice et al., 2010). While MuTex provides a comprehensive assessment of macrostructure (e.g., story grammar, coherence) and incorporates key functional aspects of microstructure (such as the use of cohesive devices), it also uniquely includes dedicated dimensions for interactional and referential competence, moving beyond a simple macro/micro dichotomy. It was developed based on the observations and analyses from a qualitative video study (Isler & Ineichen, 2016) and grounded on theoretical models of interaction (Hausendorf & Quasthoff, 1996), representation (Meng et al., 2007), text (Feilke, 2014), and genre (Thorndyke, 1977).

MuTex assesses narrative competence across four key dimensions: Soloistic Production, Representation of Distant Content, Textual Organisation, and Genre-Specific Patterns. Firstly, Soloistic Production assesses the child’s capacity to produce an extended, independent narrative. For example, it looks at whether the child can tell the story with minimal prompting from the adult, effectively holding the floor as the primary storyteller (Hausendorf & Quasthoff, 1996). Secondly, Representation of Distant Content measures the child’s ability to use language to describe characters, settings, and events that are not physically present during the telling, a key aspect of decontextualized language. This includes recounting what happened in the speech-free cartoon, such as specific actions (“The crocodile tried to eat the salt stick”). Crucially, it also involves representing non-perceivable elements such as thoughts or feelings of the protagonists (e.g. “The crocodile was angry, because he couldn’t eat the salt stick”) (Meng et al., 2007; Milosky, 1987). Thirdly, Textual Organisation concerns the logical sequencing of story events and the use of linguistic devices to create a cohesive and coherent account. This means events are told in chronological order and linked using cohesive devices such as “then” or more complex causal connectors such as “because”(Bachmann & Feilke, 2014). Finally, Genre-Specific Patterns assess the child’s inclusion of elements characteristic of the narrative genre, often aligned with story grammar. Additionally, this dimension includes the use of expressive narrative markers that engage the listener, such as imitating a character’s voice, or using storytelling conventions such as “Once upon a time…”. Each of the four dimensions consists of a combination of criteria based on theory. Therefore, the various criteria used to assess the four dimensions cannot be treated as homogeneous, equivalent items. It is only through their specific operationalisation and combination that the measurement of a dimension as a whole becomes valid and reliable.

In their pilot study, Isler et al. (2018) found support for the four-factor structure and observed high reliability (α = 0.82). After the pilot study from 2018 to 2021, an intervention study referred to as EmTiK, a German acronym that means Promotion of Oral Text Abilities, was conducted. The data obtained during this intervention were used in this longitudinal study to examine the psychometric properties of MuTex, namely its reliability and its ability to consistently assess the same construct over time (Meredith, 1993). Additionally, concurrent validity, a facet of criterion-related validity, was determined by evaluating the correlation between MuTex and various criterion variables (Standards for Educational & Psychological Testing, 2014). These correlations were subsequently compared to expected outcomes. The criterion variables encompassed parental education level, executive functions, age, and the use of the official school language at home. Previous research has linked narrative ability to parental education level and children’s executive functions (Mozzanica et al., 2016; Scionti et al., 2023). Furthermore, as MuTex is designed to measure the development of oral narrative ability in kindergarten-aged children, it should thus be positively correlated with age. MuTex should also depend on the use of the official school language at home.

Aim and Hypotheses

The study’s aim was to validate the MuTex scales longitudinally using a larger sample size than in the aforementioned pilot study. Specific goals were to examine a) its reliability b) its factorial validity, c) its longitudinal measurement invariance, and d) its criterion validity. We expected to find support for the four-factor structure and a high reliability since prior research has shown this (Isler et al., 2018). Considering the nature of the theoretical models that MuTex is based on, it seems justified to assume that even as children significantly improve their oral narrative ability, the four dimensions and therefore also the underlying construct remain relatively stable during preschool years. Thus, MuTex was expected to measure the same construct over time.

Criterion validity was examined to assess the convergence of MuTex with various criterion variables assessed at the first time point. In previous research, oral narrative ability demonstrated a moderate positive association with parental education levels when comparing parents with low and high education levels. However, when comparing parents with low and intermediate education levels, no significant correlation was observed (Mozzanica et al., 2016). Consequently, we anticipated finding a small positive correlation between MuTex and parental education level. As established in the introduction, oral narrative ability is cognitively demanding and relies on executive functions; therefore, a moderate positive correlation was expected between MuTex scores and the measure of executive functions (Scionti et al., 2023). A positive correlation with age was also expected, since MuTex is designed to measure the development of oral narrative ability. Finally, a positive, but not strong, association was expected with the use of the official school language at home, as the instrument’s design with a silent cartoon is intended to mitigate disadvantages for children from diverse language backgrounds.

Method

Participants and Procedure

The EmTiK study dataset, comprised 292 kindergarten children from 65 Swiss kindergartens in the cantons of Zurich and Thurgau. The EmTiK study aimed to improve teachers’ scaffolding practices and included both an intervention and a control group. To address the potential for the intervention to influence the present validation analyses, a series of t-tests were performed. These tests verified that the children in the intervention and control groups did not have differing scores at any of the three time points. For oral narrative ability, no significant differences emerged at T0 (M_CG = 9.16, SD_CG = 3.72; M_IG = 8.85, SD_IG = 3.43; t = 0.73, df = 265.96, p = 0.46), T1 (M_CG = 12.35, SD_CG = 3.64; M_IG = 12.10, SD_IG = 3.42; t = 0.55, df = 234.82, p = 0.59), or T2 (M_CG = 14.03, SD_CG = 3.42; M_IG = 13.63, SD_IG = 3.33; t = 0.91, df = 237.94, p = 0.37). The same pattern held for executive functions at T0 (M_CG = 45.72, SD_CG = 11.53; M_IG = 45.13, SD_IG = 12.09; t = 0.42, df = 273.39, p = 0.68), T1 (M_CG = 65.11, SD_CG = 14.40; M_IG = 65.28, SD_IG = 13.51; t = –0.09, df = 228.43, p = 0.93), and T2 (M_CG = 69.58, SD_CG = 12.32; M_IG = 69.76, SD_IG = 11.35; t = –0.12, df = 233.24, p = 0.91). Furthermore, there was no significant difference between the groups in the grand means across the three waves for either oral narrative ability (M_CG = 11.68, SD_CG = 3.02; M_IG = 11.21, SD_IG = 2.95; t = –1.34, df = 283.90, p = 0.18) or executive functions (M_CG = 58.60, SD_CG = 11.91; M_IG = 58.00, SD_IG = 11.74; t = –0.43, df = 285.01, p = 0.67). Given these findings, the data from both groups were combined for all subsequent analyses.

The three time points were intended to align with the design of an intervention study including pre-intervention (T0), post-intervention (T1), and a follow-up assessment (T2). The planned intervals were 12 months (T0 to T1) and 6 months (T1 to T2), for a total planned follow-up of 18 months. However, the T1 assessments were delayed due to the school closures during the COVID-19 pandemic. This resulted in a longer interval between T0 and T1 (~13 months) and a consequently shorter interval between T1 and T2 (~5 months).

Regarding sample size, the initial aim was to recruit approximately 320 children. This was to be achieved by first recruiting 80 kindergarten teachers. Four to five children were randomly selected from each teacher’s class. Teacher recruitment was based on a randomized selection from a complete list of kindergarten teachers in the respective cantons. After three large random draws, totaling approximately 1,200 invitations, the teacher acceptance rate remained low (5.4%). At this point, teacher recruitment was halted. This decision was made to preserve the integrity of the study‘s random sampling design, as continuing with further draws or switching to non-random methods to reach the initial teacher target would have introduced selection bias. Although the final sample of 292 children was smaller than initially planned, it was deemed sufficient for the planned analyses. The study was adequately powered to detect the expected small to moderate correlations for the criterion validity analysis. More importantly, the sample size proved sufficient for the complex longitudinal Confirmatory Factor Analyses (CFA), which achieved good model fit, indicating that the data were stable and robust enough to support the investigation of the instrument’s factorial structure and measurement invariance over time.

The composition of the participating children per time point can be seen in Table 1. Regarding their social background, 82% of the children spoke the official school language at home. In 35 % of the cases at least one parent held a university degree and 9% of the children had parents without any qualified degree. Four trained researchers, including the first and third author, visited the kindergartens to conduct the assessments.

Table 1

Participant Demographics at Each of the Three Time Points.

	T0 0 MONTHS (BEGINNING OF THE FIRST YEAR OF KINDERGARTEN)	T1 13 MONTHS (BEGINNING OF THE SECOND YEAR OF KINDERGARTEN)	T2 18 MONTHS (TOWARDS THE END OF KINDERGARTEN)
Participants (n)	279 (95.55%)	245 (83.90%)	242 (82.88%)
Age (months)	M = 58.41; SD = 4.47	M = 72.66; SD = 4.38	M = 77.42; SD = 4.35
Gender (%)	46% female	46% female	47% female
Group Allocation (%)	47% control group	47% control group	48% control group

[i] Note. 76% of the kindergarten children participated at all three time points. After T0, 36 of the participants exited the study due to relocation. After T0, 13 participants were added to the pool to help offset this loss.

The study was approved by the school authorities of the cantons Zurich and Thurgau. School principals were informed about the study. Parental consent was obtained, and parents were asked to fill out a short questionnaire. Children with diagnosed intellectual disabilities were excluded from the study. However, children from diverse language backgrounds or those receiving therapy, such as speech therapy or psychomotor therapy, were included. Potential ethical concerns related to the study were discussed with the ethics commission of eastern Switzerland, who assured us that there were no ethical concerns that needed to be considered.

Measures

Oral Narrative Ability (MuTex)

Elicitation and Transcription: Children’s oral narrative ability was assessed using the MuTex instrument. Following a brief warm-up, each child individually viewed a speech-free cartoon approximately three minutes in length. Appendix A provides key still images from all cartoons, as well as the full instructions. The presentation order of these cartoons was randomized for each child across the three time points, ensuring that each child viewed a different cartoon at each time point. Furthermore, to control for potential diffusion effects within the classroom, a systematic rotation of the three cartoons (e.g., A-B-C-A-B) was employed. By ensuring that identical stimuli were never presented to consecutive children, the temporal gap between repetitions was maximized. This rotation, combined with the fact that children were tested individually in a separate room, mitigated the risk of children sharing story content with peers who had not yet been assessed with that specific stimulus (as predetermined on the assessment run sheet). An examiner then provided a narrative prompt: “Tell me, what happened in the cartoon”. If the child did not start by him- or herself or did not explicitly finish the story, the examiner prompted the child to start or continue up to three times. The entire interaction was audio-recorded. For analysis, a clean narrative excerpt was transcribed by removing the examiners turns, repetition of words or pause filling utterances.

Cartoon Equivalence: The three cartoons used in this testing were subjected to regression analyses using dummy variables. These analyses revealed no significant differences in the MuTex scores for any of the three cartoons or at any of the three time points. At T1 and T2, children scored slightly better with the cartoons crocodile and octopus compared to the cartoon lamb. However, these differences were not statistically significant (T0; Crocodile: β = 0.08, p = 0.548; Octopus: β = 0.02, p = 0.901/T1; Crocodile: β = 0.20, p = 0.141; Octopus: β = 0.24, p = 0.065/T2; Crocodile: β = 0.21, p = 0.116; Octopus: β = 0.22, p = 0.099).

The scoring procedure combines objective and inferential criteria. The first dimension, Soloistic Production, is captured using countable characteristics such as text length and number of turns. The other three dimensions are captured through rater judgments based on observable characteristics. These dimensions are scored using two types of criteria: Basic Criteria and Additional Criteria, which are detailed in Appendix B. Basic Criteria are closely tailored to the core requirements of the task and enable a fundamental assessment of each dimension. In contrast, Additional Criteria make it possible to account for special achievements, such as unique elaborations, that are not necessarily elicited by the task. For Soloistic Production, only Basic Criteria are applied. For the three more inferential dimensions (Representation of Distant Content, Textual Organisation, and Genre-Specific Patterns), both Basic and Additional Criteria are combined to form the dimension score.

Coding and Scoring Framework: The scoring procedure for the four dimensions of oral narrative ability is detailed below. Each dimension’s final score, ranging from 1 to 5, is derived by scoring individual criteria, summing the points, and converting this total based on established benchmarks. These benchmarks are grounded in prior qualitative video study analyses by the last author. The four final dimension scores are then summed to yield a total oral narrative score ranging from 4 to 20. The following narrative, produced by a child, will be used to illustrate the scoring process:

“Two crocodiles tried to eat, but they had to hold the food with their tail and then the tail had to lift it up and the ‘ugh’ couldn’t eat it because, you see, he is a crocodile. Crocodiles can’t eat, after all. They have to swim under water.

And then they were next to the water and then they fell into the water.

Then a shark ate them, the crocodiles.”

The first dimension, Soloistic Production, assesses the child’s ability to act as the primary speaker. In the example, the child’s 3 Interactional Turns earn 3 points, the Longest Turn of 50 words earns 2 points, and the Average Turn Length of 24.3 words earns 1 point. As shown in Table 2, these points are summed for a total of 6, which falls within the 6-to-8 point range and yields a Final Score of 2.

Table 2

Scoring of Soloistic Production.

CRITERIA	POINTS
CRITERIA	1	2	3	4	5
1. Number of Interactional Turns	≥5	4	3	2	1
2. Longest Turn in Words	40	40–64	65–88	89–113	>113
3. Average Turn Length	<29	29–49	50–70	71–91	>91
Final Score	<6	6–8	9–11	12–14	15

The second dimension, Representation of Distant Content, evaluates the child‘s ability to describe elements from the cartoon. In the example, the narrative’s inclusion of only 6 distinct Perceivable Elements (e.g., “two crocodiles,” “tried to eat”) earns 1 point. The child also earns another 1 point for including Additional Elements, such as the character explanation (“couldn’t eat it because he is a crocodile…”) and a continuation of the plot (“Then a shark ate them…”). As detailed in Table 3, summing these scores gives a total of 2 points, which corresponds to a Final Score of 2.

Table 3

Scoring of Representation of Distant Content.

CRITERIA	POINTS
CRITERIA	1	2	3	4	5
1. Number of Perceivable Elements	≥9	9–12	13–16	17–20	>20
2. Occurrence of Additional Elements	≥1	–	–	–	–
Final Score	1	2	3	4	≥5

The third dimension, Textual Organisation, assesses the narrative’s structural quality. In the example, the oral narrative earns 1 point for being partially chronological, as it begins with two crocodiles whereas the cartoon starts with one. It also earns 1 point for its frequent but unvaried Use of Simple Cohesive Devices. The inclusion of the Complex Cohesive Devices “but” and “because” earns another 1 point, while the absence of Signs of Text Organisation results in 0 points. As shown in Table 4, summing these points gives a total of 3, which corresponds to a Final Score of 2.

Table 4

Scoring of Textual Organisation.

CRITERIA	POINTS
CRITERIA	1	2	3	4	5
1. Chronological Order of Propositions	Partially	Always	–	–	–
2. Use of Simple Cohesive Devices	Often or seldom and varied	Often and varied	–	–	–
3. Use of Complex Cohesive Devices	Yes	–	–	–	–
4. Signs of Text Organisation	Yes	–	–	–	–
Final Score	<3	3	4	5	≥6

The fourth dimension, Genre-Specific Patterns, evaluates the use of narrative structures and markers. For the criterion, Completion of Narrative Tasks, a raw score of 3 was awarded. The description of the problem was complete, earning 2 raw points and only one solution attempt was described, earning another raw point. The raw score of 3 for this criterion converts to 1 point. For the criterion, Occurrence of Narrative Markers, the child’s use of three markers (‘ugh,’ “you see,” “after all”) within a total of 73 words earns 2 points for frequency. Finally, for the criterion Different Types of Narrative Markers, 1 point was awarded since different types of markers (onomatopoeia and collaboration) were used. As detailed in Table 5, summing all components gives a total of 4 points, corresponding to a Final Score of 4.

Table 5

Scoring of Genre-Specific Patterns.

CRITERIA	POINTS
CRITERIA	1	2	3	4	5
1. Completion of Narrative Tasks	≤4	4–5	6–7	8–9	≥10
2. Occurrence of Narrative Markers	≤1 marker per 54 words	>1 marker per 54 words	–	–	–
3. Different Types of Narrative Markers	Yes	–	–	–	–
Final Score	1	2	3	4	≥5

Three members of the research team rated the excerpts. The raters were trained and used a detailed manual that was continuously improved during the training process. The narrative excerpts were anonymized and the raters were blind as to whether the child was in the intervention or control group. Raters were also instructed to code independently and not discuss their ratings with each other. Twenty percent of all ratings were double rated.

Executive Functions (MEFS)

Executive functions were measured using the app based Minnesota Executive Function Scale (“MEFS”; Carlson & Zelazo, 2014). During the test, the children were asked to sort virtual cards presented by the MEFS™ app into two boxes according to an increasingly complex set of sorting rules. The results were transmitted online to the MEFS providers, the total score was computed using an algorithm that combines accuracy and response time. The MEFS was normed using over 7000 test results from children, and was found to be reliable with adequate test-retest reliability (r_tt =0.86). Furthermore, it was shown that the test captures the well-known age trends in executive functions and is highly correlated with the NIH Toolbox DCCS, a commonly used research measurement of executive functions (Carlson, 2017).

Parental Educational Level and use of the official school language at home. This information was collected through a short questionnaire included in the parental consent form. The parents were asked to select their level of education and that of their partner from seven categories. The options ranged from no degree to a university degree. The parents were also asked to indicate the family languages. Parental educational level was computed as an ordinal variable with the following four categories: primary, secondary, tertiary A and tertiary B. The educational level of the parents was coded as “primary” if both parents or the single parent did not graduate at all, only completed compulsory school or only hold a certificate for assistant jobs. If at least one of the parents had completed a higher secondary degree (baccalaureate) or a qualified professional apprenticeship, the level was coded as “secondary”. If at least one parent held a degree of higher professional education, the level was coded as “tertiary A” and if at least one parent held a degree of a university (including universities of applied sciences), the level was coded as “tertiary B”. The use of the official school language at home was dummy-coded with a 0 if German (official school language) was not spoken at home or with a value of 1 if it was spoken at home.

Analysis Strategy

All analyses were conducted using the lavaan package (Rosseel, 2012) for the R Software Version 4.1.2 (R Core Team, 2021). The interrater reliability analyses were conducted calculating the intraclass-correlation (ICC) estimates and their 95% confidence interval using the psych package (Revelle, 2021) based on a single-rating, absolute-agreement, 2-way mixed-effects model. A confirmatory factor analysis (CFA) across all time points was performed using the maximum likelihood with robust standard error estimator (MLR). The full information maximum likelihood (FIML) procedure was employed to address missing data. Oral Narrative ability and executive functions were tested at each time point. The missing values for oral narrative ability ranged from 4.5% to 17.1% (T0: 4.5%, T1: 16.1%, T2: 17.1%) and for executive functions they ranged from 5.1 to 18.5 (T0: 5.1%, T1: 18.5%, T2: 17.5). Percentages of missing data for age, parental education level, and use of the official school language at home ranged from 0.0% to 0.3%. The primary reasons for missing data at T1 and T2 were a child’s absence due to illness on the day of assessment or the family having moved between time points. To test for systematic attrition, t-tests were conducted to determine if children who participated at all three time points differed on their initial (T0) scores from those who missed at least one time point. The results showed no significant baseline differences for either oral narrative ability (completers: M = 8.94, SD = 3.62; non-completers: M = 9.21, SD = 3.38; t = 0.53, p = .60) or executive functions (completers: M = 45.66, SD = 12.41; non-completers: M = 44.49, SD = 9.41; t = –0.79, p = .43). Because the missingness was not related to the initial levels of the key variables, the data can be assumed to be missing at random (MAR). Therefore, the use of the FIML procedure is an appropriate method for handling the missing data in this study. The effect coding method was used to identify the latent variables (Little, 2013), as it allows estimating the latent parameters in a nonarbitrary metric that reflects the metric of the measured indicators. This is achieved by constraining the average of the intercepts to 0 and the average of the loadings for a given construct to 1. To test for reliability, McDonald’s omega (ω) was computed for the factor oral narrative ability (McDonald, 1999). Configural, metric and scalar measurement invariance (Meredith, 1993) across all time points were assessed. Model structure, factor loadings and item intercepts were sequentially constrained to be longitudinally equal and differences in fit indices were checked. The cutoff criteria for testing loading invariance were as follows: ΔCFI ≤ –0.005, ΔRMSEA ≤ 0.010, and ΔSRMR ≤ 0.025. The cutoff criteria for testing intercept invariance were the same except for ΔSRMR ≤ 0.005 (Chen, 2007). With regard to concurrent validity, Pearson correlations were computed at T0 to assess the relationships between age, executive functions, parental education level, use of the official school language at home, and MuTex. No outliers had to be removed from the analyses.

Results

Interrater Reliability

The interrater reliability was assessed by calculating intraclass correlation (ICC) estimates based on a single-rating, absolute-agreement, 2-way mixed-effects model (Revelle, 2021). According to established guidelines, ICC values between 0.75 and 0.90 are considered good, and values above .90 indicate excellent reliability (Koo & Li, 2016). The interrater reliabilities for all three time points and dimensions are shown in Table 6. The interrater reliabilities for Soloistic Production showed perfect agreement because it was based on objective and quantifiable criteria. The dimension Genre-Specific Patterns showed the lowest interrater reliabilities but still remained in an acceptable range. The total score of MuTex achieved excellent interrater reliability, ranging from 0.92 to 0.94.

Table 6

Interrater Reliability of MuTex.

	SOLOISTIC PRODUCTION	REPRESENTATION OF DISTANT CONTENT	TEXTUAL ORGANISATION	GENRE-SPECIFIC PATTERNS	TOTAL SCORE
T0	1	0.91	0.85	0.80	0.94
T1	1	0.87	0.72	0.67	0.93
T2	1	0.86	0.87	0.64	0.92

Correlations and Descriptive Statistics

The correlation coefficients and descriptive statistics are presented in Table 7. Correlational analyses showed that the four dimensions of MuTex correlated positively at all three time points. Thus, the four dimensions were closely associated with each other, except for the item Textual Organisation at T1 and T2 which only had weak associations.

Table 7

Descriptive Statistics and Correlations.

VARIABLE	M	SD	1	2	3	4	5	6	7	8	9	10	11	12	13	14
1. SPt0	2.27	1.04
2. RDt0	1.91	0.90	0.63**
3. TOt0	2.63	1.22	0.51**	0.50**
4. GPt0	2.18	1.23	0.55**	0.59**	0.51**
5. TSt0	8.99	3.57	0.81**	0.81**	0.79**	0.83**
6. SPt1	3.03	1.02	0.35**	0.32**	0.39**	0.21**	0.39**
7. RDt1	2.72	1.18	0.30**	0.32**	0.39**	0.24**	0.38**	0.62**
8. TOt1	3.40	0.98	0.18**	0.27**	0.29**	0.22**	0.30**	0.27**	0.23**
9. GPt1	3.07	1.40	0.30**	0.27**	0.30**	0.24**	0.33**	0.51**	0.62**	0.35**
10. TSt1	12.22	3.52	0.37**	0.38**	0.44**	0.30**	0.46**	0.78**	0.83**	0.57**	0.85**
11. SPt2	3.49	1.02	0.19**	0.22**	0.19**	0.18**	0.24**	0.34**	0.25**	0.18**	0.24**	0.33**
12. RDt2	3.20	1.19	0.30**	0.31**	0.27**	0.25**	0.35**	0.39**	0.39**	0.24**	0.30**	0.43**	0.62**
13. TOt2	3.69	0.95	0.22**	0.29**	0.29**	0.32**	0.35**	0.23**	0.19**	0.40**	0.19**	0.32**	0.32**	0.33**
14. GPt2	3.44	1.30	0.36**	0.35**	0.30**	0.26**	0.39**	0.34**	0.30**	0.27**	0.24**	0.37**	0.43**	0.55**	0.25**
15. TSt2	13.82	3.38	0.36**	0.39**	0.35**	0.34**	0.44**	0.44**	0.39**	0.35**	0.33**	0.48**	0.78**	0.85**	0.59**	0.78**

[i] Note. SP = Soloistic Production. RD = Representation of Distant Content. TO = Textual Organisation. GP = Genre-Specific Patterns, TS = Total score for oral narrative ability.

**p < .01.

Confirmatory Factor Analysis

Confirmatory factor analysis across all time points revealed good fit for the model with the four proposed factors (χ² = 79.48***, df = 43, RMSEA = 0.05, CFI = 0.97, TLI = 0.95). Factor loadings and reliabilities of all time points are presented in Figure 1. The majority of the factor loadings were high except for those related to Textual Organisation. The reliabilities ranged from 0.77 to 0.82.

Measurement Invariance over Time

In order to test measurement invariance, a CFA was computed for the three time points (Putnick & Bornstein, 2016). The configuration of the latent variables was invariant across time, but full metric invariance was not given as indicated by a difference in CFI much larger than 0.01 (ΔCFI = –0.05). After the examination of the modification indices and the item loadings, the equality constraints on the loadings of the dimension Textual Organisation were released. This step did not yet improve the model sufficiently (ΔCFI = –0.02). After further examination, the constraint on the loading of the dimension Representation of Distant Content was also released. The metric invariance model comparison then fulfilled all the cutoff criteria as shown in Table 8. Additionally, scalar invariance was tested. The constraints applied in the partial metric invariance model were retained and partial scalar invariance across time was reached (Steenkamp & Baumgartner, 1998).

Table 8

Model Fit Indices MuTex T0, T1, T2 (N = 292).

MODEL	χ²	df	CFI	ΔCFI	RMSEA (90% CI)	ΔRMSEA	SRMR	ΔSRMR
Configural	74.91	39	0.967		0.056(.036–.075)		0.056
Metric^a	76.95	41	0.967	0	0.055(.035–.074)	–0.001	0.057	0.001
Scalar^b	79.48	43	0.966	–0.001	0.054(.035–.072)	–0.001	0.058	0.001

[i] Note. X² = Chi-Square; df = Degrees of freedom; CFI = Comparative fit index; RMSEA = Root mean square error of approximation; SRMR; Standardised root mean square residual; CI = Confidence interval.

^a Factor loadings (partly) set equally among time points.

^b Factor loadings and item intercepts (partly) set equally among time points.

This result indicates that the MuTex instrument measures a stable underlying construct of oral narrative ability across the three time points, allowing for valid longitudinal comparisons of latent means even as the contribution of individual dimensions evolves.

Criterion Validity

Correlations of relevant variables with the measured oral narrative ability as assessed by MuTex at T0 were calculated and reported in Table 9. Parental education level showed a small positive correlation with MuTex. Executive functions were moderately positively correlated with MuTex. Age was positively correlated with MuTex and for children who do speak the official school language at home.

Table 9

Descriptive Statistics and Correlations for Criterion Variables at T0.

VARIABLE	M	SD	RANGE	1	2	3	4
1. Oral Narrative Ability (MuTex)	8.99	3.57	4–19
2. Age	58.45	4.44	50–75	0.22**
3. Parental Education Level	1.84	1.01	0–3	0.14*	–0.03
4. Executive Functions	45.40	11.81	8–92	0.35**	0.10	0.21**
5. German spoken at home (0 = No, 1 = Yes)	0.82	0.39	0–1	0.28**	0.01	0.31**	0.28**

[i] Note. Oral Narrative Ability is the total score for the MuTex instrument. Age is reported in months. Parental Education Level was coded on a 4-point ordinal scale: 0 = compulsory schooling or less, 1 = higher secondary degree or professional apprenticeship, 2 = higher professional education, and 3 = university degree. Executive Functions is the total score from the MEFS, ranging from 0 to 100. German spoken at home was dummy-coded (0 = German is not spoken at home, 1 = German is spoken at home). The mean (M) for this variable represents the proportion of the sample speaking German at home. Correlations involving this variable are point-biserial correlations. *p < .05; **p < .01.

Discussion

The aim of this study was to investigate the reliability and validity of the MuTex instrument for longitudinal assessment in kindergarten children. The results provided substantial support for its psychometric properties. The instrument demonstrated high reliability and the confirmatory factor analysis supported the original structure proposed by Isler et al. (2018). Furthermore, the analysis established partial scalar invariance across the three time points, indicating that the instrument provides a stable measure of the underlying construct of oral narrative ability over time. Finally, the evaluation of criterion validity was successful, showing that MuTex scores correlated in the expected directions with age, executive functions, parental education level, and use of the official school language at home.

The finding of partial, rather than full, scalar invariance warrants a more detailed discussion. While full invariance represents the ideal psychometric standard, achieving partial invariance is a robust and widely accepted result in longitudinal research (Steenkamp & Baumgartner, 1998). It indicates that the underlying construct of oral narrative ability measured by MuTex is stable enough over time to allow for valid comparisons of latent means, as the model satisfies the criterion of having at least two invariant indicators. In addition, simulation studies demonstrated that employing a partial invariance model resulted in valid estimates and inferences of the latent parameters (Guenole & Brown, 2014; Hsiao & Lai, 2018). Importantly, the way in which the dimensions Textual Organisation and Representation of Distant Content were found to be non-invariant offers valuable interpretive insights that align with developmental theory. For Textual Organisation, the varying factor loading suggests that its importance as an indicator of high ability diminishes as children get older. This developmental shift is also reflected in the low correlations observed between Textual Organisation and the other narrative dimensions, particularly at T1 and T2. While this may suggest a ceiling effect, the variance for Textual Organisation was not substantially lower than for the other dimensions. A more likely explanation lies in the operationalisation of the Textual Organisation criteria. For instance, the use of complex cohesive devices is scored dichotomously (0 for no, 1 for yes), which fails to capture the increasing frequency and variety of these devices in older children’s narratives. As children grow older and this skill becomes more ubiquitous, a simple presence/absence score is no longer a sensitive measure of ability. This limitation in capturing the nuances of development is a plausible reason for the decreasing factor loadings and weaker correlations at later time points. Similarly, the factor loading for Representation of Distant Content was not only variant but also increased over time. This indicates that the ability to represent distant content became an even stronger, more central indicator of a child’s overall narrative competence as they grew older. Consequently, the lack of full measurement invariance does not appear to stem from an unstable construct, but rather from the metric limitations inherent in the instrument’s operationalisation. MuTex therefore functions as a valid instrument for longitudinal assessment, anchored by its stable dimensions while also exposing the weaknesses of its less granular components.

The evaluation of the criterion validity demonstrated associations in line with previous research. As hypothesized based on the findings of Mozzanica et al. (2016), parental education level was found to have a small but significant association with oral narrative ability. Executive functions exhibited a moderate association with oral narrative ability, aligning with previous research (Scionti et al., 2023). Furthermore, a positive correlation between age and oral narrative ability was present, and the mean values of the MuTex total score also demonstrated a substantial increase across the time points. These findings suggest that MuTex captures oral narrative ability in an age-sensitive manner. Given that T0 took place shortly after the children started kindergarten, we expected the measured oral narrative ability to be positively associated with the children’s use of the official school language at home. The results, as anticipated, revealed a small to moderate positive correlation between use of the official school language at home and MuTex. Because the correlations are small to moderate rather than strong, they indicate that MuTex is applicable for children with different language backgrounds and reduces their disadvantage compared to children speaking the language of schooling in their families. In summary, these findings support the criterion validity of MuTex.

Beyond its psychometric properties, the successful longitudinal validation of MuTex has significant implications for both theory and practice. The introduction highlighted that early oral narrative ability is a key predictor for later literacy and academic success. By establishing MuTex as a reliable and longitudinally invariant instrument, this study provides researchers with a robust tool to empirically investigate these developmental trajectories. It allows for the tracking of genuine growth in narrative competence over the crucial kindergarten years and provides a valid outcome measure for evaluating the effectiveness of future language interventions.

Furthermore, the findings regarding criterion validity offer practical value. The instrument’s age-sensitivity confirms its utility in capturing the development of oral narrative ability during this period. Critically, the finding that MuTex scores are only weakly to moderately correlated with the use of the official school language of instruction at home supports its use in diverse educational settings. As classrooms become more linguistically heterogeneous, there is a pressing need for assessment tools that are equitable in terms of language demand. By reducing reliance on language comprehension, MuTex offers educators and clinicians a fairer method for assessing the core oral narrative abilities of all children, ensuring that those with low language proficiency are less disadvantaged. This provides a more accurate foundation for identifying children who may need support and for fostering the decontextualized language skills that are essential for academic readiness.

Despite the strong support for its psychometric properties, some limitations and avenues for future research should be noted. A key practical limitation of MuTex is its reliance on transcription. Because the assessment is based on countable text characteristics, a full transcript is indispensable, which impedes its use as a rapid screening instrument. However, for its intended research context, the process from elicitation to rating remains efficient.

A further limitation concerns the generalizability of MuTex to children with other native languages and from different cultural backgrounds. The instrument’s core dimensions are based on foundational principles of narrative construction (e.g., cohesion, story grammar) that are largely universal across languages, suggesting broad potential for applicability. This must be balanced with careful consideration for cultural background, as research highlights that narrative style can vary significantly, shaped by the storytelling traditions within a child’s family and community (Gardner-Neblett et al., 2012; Gardner-Neblett & Iruka, 2015). For example, some children may produce linear, “topic-centered” narratives, while others may use a “topic-associating” style that consists of a series of implicitly linked anecdotes, a style that is equally complex but structurally different (Gardner-Neblett et al., 2012). This stylistic diversity has direct implications for the scoring of several MuTex dimensions, most notably Textual Organisation and Genre-Specific Patterns. A rater unfamiliar with the topic-associating style might unfairly penalize its non-linear structure under the Textual Organisation criteria. Similarly, the Genre-Specific Patterns dimension, which assesses story grammar and expressive markers, might not capture culturally specific elements such as the “performative narratives” or interactive “call-response” patterns valued in some oral traditions. Therefore, a crucial direction for future research is the cross-cultural validation of MuTex. This process would need to involve adapting the scoring manual, particularly for the Textual Organisation and Genre-Specific Patterns dimensions, to ensure its sensitivity to the rich variety of narrative styles used by children from different backgrounds.

The validation of this study was limited by the absence of a direct comparison with another established instrument for measuring narrative ability, which would be necessary to fully establish convergent validity. An assessment of the children’s broader language production is planned, which will help address this in the future. It is also acknowledged that the assessment intervals were unequal (T0–T1∼13 months; T1–T2∼5 months. While this does not compromise the primary aim of establishing longitudinal validity, it should be noted that the unequal intervals preclude direct comparison of the pace of developmental growth across the two time spans. Furthermore, while the current study establishes the instrument’s validity, future work could focus on refining the operationalisation of certain criteria. For example, as noted in the discussion of measurement invariance, the dichotomous scoring for the use of complex cohesive devices within the Textual Organisation dimension could be expanded to a graded scale to more sensitively capture growth in older children. Such refinements would build upon the foundation established in this study.

Acknowledging these limitations, the study’s results nonetheless provide substantial support for the reliability and validity of the developed instrument. MuTex provides a promising method for investigating the development and promotion of oral narrative ability.

Transparency Statement

We reported how we determined the sample size and the stopping criterion. We reported all experimental conditions and variables. We report all data exclusion criteria and whether these were determined before or during the data analysis. We report all outlier criteria and whether these were determined before or during data analysis.

Preregistration

No part of the study procedures was pre-registered prior to the research being conducted. No part of the study analyses was pre-registered prior to the research being conducted.

Additional File

The additional file for this article can be found as follows:

Appendices

Appendix A to C. DOI: https://doi.org/10.5334/spo.79.s1

Data Accessibility Statement

Anonymized, non-aggregated raw data will be available from SWISSUbase (https://doi.org/10.48573/hgf9-b827) from May 1, 2024. Interested researchers should click the Download button in the upper-right corner of the browser window and request data access on the following screen. If they do not yet have a SWISS-EDU-ID, they will be prompted to register (at no cost) before submitting their request.

Acknowledgements

The study has been financed by the Swiss National Science Foundation (SNSF).

Author Contributions

Author 1: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – Original Draft, Writing – Review & Editing.

Author 2: Data Curation, Formal Analysis, Methodology, Software, Validation, Writing – Review & Editing.

Author 3: Funding Acquisition, Investigation, Methodology, Project Administration, Resources, Supervision, Writing – Review & Editing.