Fine-Tuning South Tyrolean Dialect-to-Standard German ASR with AlpiLinK

Greta H. Franzini; Luca Ducceschi

doi:10.5334/johd.533

1 Context and motivation

This article reports on ongoing work that builds on the reuse of the AlpiLinK Corpus within a broader effort to adapt Automatic Speech Recognition (ASR) technology to a highly specific linguistic challenge: transcribing South Tyrolean dialectal speech into Standard German. Originally collected for dialectological research, AlpiLinK is here repurposed to support applied ASR development addressing real-world needs across media production, public administration, cultural heritage and accessibility. In these domains, accurate transcription is essential for practical workflows such as subtitling, archival documentation and administrative record-keeping.

South Tyrolean dialect refers to a cluster of Upper German dialects, primarily Bavarian (Wiesinger, 1983), spoken in the autonomous province of South Tyrol in northeastern Italy (Colletti & Lombardo, 2025, p. 73). As with many regional and non-standard varieties, these dialects have traditionally been documented for linguistic research rather than collected with technological reuse in mind. While Standard German serves as the official written language in the region, everyday spoken communication frequently takes place in dialect, whose phonological, morphological and lexical divergence from the standard poses (Stöckle & Vergeiner, 2025) substantial challenges for applications such as ASR. Furthermore, South Tyrolean lacks a standardised orthography; written dialect remains largely informal and inconsistent, which further complicates the development of textual corpora for language modeling. The dialect is spoken by most of the German-speaking population in South Tyrol (Volgger et al., 2024), a region characterised by its bilingual context (German and Italian) and strong cultural identity.

ASR systems trained on standard language varieties generally struggle when applied to dialectal speech, due to pronounced phonological, morphological and lexical divergence. Addressing these differences typically requires the adaptation of existing architectures through fine-tuning on variety-specific data and careful control over data preparation steps such as audio segmentation, alignment and metadata annotation.

While this work includes the fine-tuning and evaluation of an ASR model, its primary contribution lies in the systematic analysis of how an existing dialectological dataset can be repurposed for ASR training, and which technical, linguistic and organisational constraints emerge in this process. The ASR experiments presented in this paper therefore serve primarily as a means to assess the feasibility, limitations and reuse potential of the underlying data rather than as an end in themselves.

2 Related work

ASR for German dialects has gained increased attention in recent years, driven both by methodological advances in multilingual neural architectures and by the growing availability of dialectal speech resources. Nevertheless, dialect varieties remain underrepresented compared to standard language data, and the scarcity of large-scale, consistently annotated corpora continues to limit robust model development. For this reason, recent work has followed two complementary directions: the creation of ASR-oriented dialectal datasets and the adaptation or reuse of existing linguistic resources for speech-technology applications.

For Swiss German, several large-scale speech corpora have enabled systematic ASR research. These include STT4SG-350 (Plüss et al., 2023), SDS-200 Plüss et al. (2022) and SwissDial (Dogan-Schönberger et al., 2021), which provide parallel or near-parallel speech–text resources across multiple dialect regions. Building on these datasets, multiple studies demonstrate that fine-tuning multilingual models such as wav2vec 2.0 or Whisper on relatively small amounts of dialect-specific data substantially improves performance over zero-shot systems, particularly for dialect-to-standard transcription tasks (Plüss et al., 2023; Sicard et al., 2023). Qualitative analyses further show that residual errors often stem from systematic morphosyntactic and lexical differences rather than from signal degradation alone.

Closely related work has emerged for Luxembourgish (Lëtzebuergesch), a West Central Germanic variety. Like South Tyrolean dialects, Luxembourgish is characterised by strong multilingual interaction, particularly with German and French, and by long-standing scarcity of digital resources. A growing body of work has shown that state-of-the-art ASR models can be successfully adapted to Luxembourgish through transfer learning and fine-tuning on comparatively small but carefully curated speech corpora (Adda-Decker et al., 2014; Gilles et al., 2023). More recent systems based on wav2vec 2.0 and Whisper achieve competitive word error rates despite limited training data, while explicitly addressing challenges related to multilingual lexicons, code-switching and orthographic stabilisation (Gilles et al., 2023; Hosseini-Kivanani et al., 2025). Importantly, several Luxembourgish ASR efforts rely on the reuse or gradual enrichment of linguistic, institutional or public-sector speech resources, rather than on speech corpora originally collected for machine-learning purposes. This makes Luxembourgish a particularly relevant point of comparison for the present study, illustrating how existing linguistic resources can be transformed into effective training data for modern ASR in low-resource, multilingual settings.

Research on Austrian German dialects similarly highlights the limitations of standard-language ASR systems when confronted with regional phonological and lexical variation. Prior work documents the impact of pronunciation variation, speaking style and utterance length on ASR performance, particularly in conversational speech (Kerle et al., 2023; Linke et al., 2025). These findings resonate with recent multi-dialectal evaluations for southern German varieties, including Franconian and Bavarian, which show that ASR systems frequently normalise dialectal constructions in inconsistent ways when trained without sufficient variety-specific data (Blaschke et al., 2025).

3 Dataset description

3.1 AlpiLinK – Alpine Languages in Contact

The AlpiLinK – Alpine Languages in Contact project, now completed, aimed to document German, Romance and Slavic non-standard and minority languages spoken in alpine Italy, focussing on language contact and change in phonology, morphology and syntax (Kruijt & Rabanus, 2025).¹ Data collection was crowdsourced through questionnaires covering tasks such as translation, tense and word-class transformation, and image description. Although AlpiLinK was not originally designed for machine-learning applications, its audio-text alignment and rich metadata make it a promising candidate for reuse in ASR fine-tuning scenarios. Indeed, at the time of writing, AlpiLinK remains the only publicly available dataset suitable for this particular task.

Repository location

10.5281/zenodo.8360169.

Repository name

Zenodo.

Object name

tir.zip.

Format names and versions

FLAC, CSV, HTML, Markdown.

Creation dates

First version (1.0.0) created in 2023-09; latest version (v1.2.0) created in 2025-05.

Dataset creators

Authors: Stefan Rabanus (University of Verona), Anne Kruijt (University of Verona), Birgit Alber (Free University of Bozen-Bolzano), Ermenegildo Bidese (University of Trento), Livio Gaeta (University of Turin), Gianmario Raimondi (University of Aosta Valley). Collaborators: Paolo Benedetto Mas (University of Aosta Valley), Sabrina Bertollo (University of Verona), Serena Bissolo (University of Trento), Angelica Bonelli (Free University of Bozen-Bolzano), Dario Capelli (University of Turin), Jan Casalicchio (University of Siena), Raffaele Cioffi (University of Turin), Patrizia Cordin (University of Trento), Silvia Dal Negro (Free University of Bozen-Bolzano), Ilaria Driussi (University of Verona), Sara Erriu (University of Aosta Valley), Alexander Glück (Free University of Bozen-Bolzano), Joachim Kokkelmans (Free University of Bozen-Bolzano), Adriano Murelli (University of Turin), Andrea Padovan (University of Verona), Aline Pons (University of Aosta Valley), Matteo Rivoira (University of Turin), Marta Tagliani (University of Verona), Caterina Saracco (University of Turin), Alessandra Tomaselli (University of Verona), Ruth Videsott (Free University of Bozen-Bolzano), Alessandro Vietti (Free University of Bozen-Bolzano) and Barbara Vogt (University of L’Aquila).

Language

Tyrolean (tir).

License

CC BY-NC-SA 4.0.

Publication date

First version (v1.0.0) published on 2023-10-02; latest version (v1.2.0) published on 2025-05-26.

3.2 ASR training dataset

For ASR fine-tuning, we exclusively use the AlpiLinK translation task (files prefixed with “S”), which required participants to produce spoken translations of thirty German stimuli or sentences. In our study, we used version 1.2.0 of this dataset, comprising 248 distinct speakers. All other AlpiLinK tasks are excluded from model training and from the statistics reported below.

AlpiLinK currently offers four hours of translation data. While it constitutes a significant portion of our training data, its highly repetitive nature —comprising only 30 unique sentences or stimuli repeated by many different speakers— means it cannot serve as the sole basis for a robust ASR model, as it is insufficient for broad linguistic generalisation. Indeed, fine-tuning on a highly repetitive speech dataset like AlpiLinK offers a clear trade-off in model performance. On one hand, extensive speaker variation on the same utterances can improve the model’s speaker generalisation: the model learns to focus on content rather than speaker-specific traits, making it more robust to new voices and accents. Empirically, training sets with many speakers per sentence are known to encourage speaker-invariant acoustic modelling, which is beneficial for recognising unseen speakers. On the other hand, the extremely limited linguistic diversity poses a risk of overfitting to those sentences. With only 30 distinct phrases, the model may essentially memorise their vocabulary and patterns. This can lead to a bias toward transcribing those familiar sentences and poor generalisation to novel or more complex utterances not seen in training. In fact, ASR systems trained on narrow transcript domains often achieve high in-domain accuracy but struggle on broader language: for example, a model fine-tuned on a small, repetitive transcript set can falter on everyday speech, since common words outside the training sentences may be absent from its experience. In summary, a repetitive multi-speaker dataset tends to bolster speaker-invariance (thanks to diverse voices) while severely limiting linguistic coverage, so the model gains speaker robustness at the cost of potential overfitting to sentence structure and limited general language ability. Consequently, in our case, we anticipated improved performance on the specific repeated prompts, but restricted generalisation capabilities unless the dataset was complemented with more varied linguistic material (Solberg et al., 2023).

We, therefore, are actively expanding and diversifying the training data by integrating material from multiple sources beyond AlpiLinK. These sources combine scripted, semi-scripted and spontaneous speech and span both publicly available and proprietary material, reflecting the limited availability of openly licensed South Tyrolean dialect data (Franzini & Ducceschi, to appear). Scripted read speech is drawn from a publicly available learner textbook for South Tyrolean dialect, which provides dialectal audio prompts accompanied by Standard German translations. While this resource offers clean recordings and controlled lexical content, the correspondence between audio and text is occasionally non-literal. Spontaneous speech is extracted from subtitled promotional videos published by a regional public institution. In this case, reference texts are obtained through optical character recognition of embedded subtitles followed by manual correction, and audio–text alignment is performed manually; as the subtitles were not designed as verbatim transcripts, the resulting pairs often reflect loose semantic correspondence rather than literal translation.

Additional training data originates from purpose-recorded material and audiovisual archives contributed by institutional partners. This includes in-house recordings and research interviews conducted at Eurac Research, in which speakers produce spontaneous dialectal translations or engage in conversation, as well as audiovisual archives held by partner cultural institutions. The latter comprise historical news broadcasts from the 1980s and 1990s and more recent audio-visual productions created for documentary purposes, typically featuring spontaneous dialectal speech in formal and semi-formal settings. Both sources are integrated through a shared iterative data-exchange cycle: audio material provided by partners is automatically transcribed using the current fine-tuned ASR model, manually corrected by the partners themselves and subsequently returned to us to be incorporated into later training iterations. This workflow establishes a virtuous cycle in which model outputs directly support data production, while corrected transcriptions contribute to ongoing model improvement.

Taken together, these non-AlpiLinK sources substantially increase speaker, situational and stylistic diversity in the training data, while also introducing varying degrees of noise and alignment uncertainty. Their inclusion therefore supports both model robustness and a realistic assessment of the challenges involved in repurposing heterogeneous linguistic resources for dialect-to-standard ASR.

Table 1 summarises the data used for ASR training in this study and reflects the current state of the training corpus (v1.3), including the AlpiLinK translation task and additional non-AlpiLinK material. While all associated metadata —covering source information, speech types and granular speaker demographics— is publicly available to ensure transparency and reproducibility, licensing restrictions prevent the full distribution of the corresponding audio data through the public GitLab repository.²

Table 1

Current overview of ASR training data used in this study (v1.3).

SOURCE	TYPE	HOURS	SPEAKERS	AGE GROUP	PROVENANCE
Learner textbook	scripted	47 m	3	20–49	public
AlpiLinK	scripted	4 h 47 m	180	10–89	public
In-house recordings	scripted, spontaneous	3 h 42 m	3	20–49	in-house, contributed
Audiovisual archives	scripted, spontaneous	1 h 53 m	9	0–99	contributed
Promotional videos	spontaneous	1 h 03 m	86	20–79	public
TOTAL		13 h 16 m

In Sections 4.2 and 5, we refer to successive versions of the training dataset (v1.0–v1.3), which reflect incremental extensions of a corpus that is still under active development and release, rather than finalised public editions. These versions progressively integrate the data sources described above. Version 1.0 is based primarily on the AlpiLinK data, complemented by scripted learner-textbook recordings, in-house and purpose-recorded interviews, and subtitled promotional video material. Version 1.1 expands this base with additional in-house interview data; version 1.2 further incorporates audiovisual material produced for film and documentary purposes by a partner cultural agency; and version 1.3 adds a limited number of historical news broadcasts from a partner audiovisual archive, resulting in only marginal additional data growth. Taken together, these successive versions enable an analysis of how increasing data volume and, in particular, increasing source heterogeneity influence ASR performance.

4 Method

4.1 Dataset

4.1.1 Acoustic features

To assess the technical integrity and linguistic diversity of the training data, we analysed its acoustic properties with DisVoice (Vásquez-Correa et al., 2018). Our analysis focussed on three primary sets of features: phonation, prosody and articulation.

AlpiLinK

Phonation features (Kent & Read, 2002; Titze, 1994), specifically Jitter (representing the cycle-to-cycle variation in fundamental frequency) and Shimmer (indicating the cycle-to-cycle variation in amplitude), serve as key markers of signal stability. In our corpus, we observed a mean Jitter of 0.42% and a mean Shimmer of 1.85%, reflecting a clean and stable vocal signal with minimal glottal noise. Prosodic analysis (Dehak et al., 2007) centred on the Fundamental Frequency (F0) or perceived pitch, which averaged 240 Hz with a standard deviation of 45 Hz. This broad range confirms a significant diversity of speakers across different age groups and genders. Finally, articulation (Fant, 1960; Vásquez-Correa et al., 2018) was evaluated through the speech rate, measured as the number of syllables per second, which averaged 3.65. Combined with well-distributed formant frequencies, these metrics indicate a robust and healthy dataset where performance gains are primarily driven by linguistic learning rather than signal artifacts or recording quality.

Complete dataset (v1.3)

Phonation metrics indicate high signal stability, with a mean Jitter of 3.65% and Shimmer of 8.56%, suggesting a clean vocal signal suitable for acoustic modeling. Prosodic analysis shows an average Fundamental Frequency (F0) of 177.84 Hz (SD ≈ 44 Hz), confirming a balanced representation of male and female voices. Articulatory diversity is evidenced by mean formant values (F1: 746 Hz, F2: 1854 Hz) that cover a broad vowel space across different dialect variants. While a small subset of 26 AlpiLinK files exhibits low Signal-to-Noise Ratios (SNR < 10 dB), the overall corpus maintains a high-quality baseline (average SNR of 40 dB), ensuring that performance gains are driven by linguistic learning rather than signal quality.

4.2 Fine-tuned models

As previously mentioned, to train our language models, we exclusively use the translation task data from AlpiLinK (i.e. “S”-prefixed files). In all experiments, the target output is written Standard German, reflecting both the structure of the training data and real-world application scenarios such as subtitling and administrative transcription.

In our initial experiment, we chose to work with the multilingual OpenAI Whisper model (Ducceschi & Franzini, 2025; Radford et al., 2023). Whisper requires training audio to be segmented into chunks of no more than 30 seconds and AlpiLinK data already satisfied this constraint. Specifically, we downloaded the tir.zip AlpiLinK archive from Zenodo and converted the audio files from FLAC to WAV format using FFMPEG (FFmpeg Developers, 2016). Files shorter than the two-second threshold were removed as considered unusable. Other data preparation steps included downsampling the files and ensuring the consistency of the sample rate (specifically 16 kHz), concurrently verifying that the number of channels was monophonic and checking the number of tokens to ensure a maximum decoder sequence of 448 tokens, all in accordance with Whisper training specifications.³

The fine-tuning process was executed on the high performance computing cluster provided by our regional scientific network and the training data was exported to a CSV file comprising two columns: the first indicating the path to each audio file stored on the cluster and the second containing the file’s corresponding textual translation.

5 Results and discussion

5.1 ASR performance

To establish a baseline for our experiment, we evaluated the OpenAI implementation of Whisper large-v3 without any fine-tuning (“baseline”). As our task lies at the intersection of transcription and translation, we report both WER (Word Error Rate), capturing transcription accuracy, and BLEU (Bilingual Evaluation Understudy), reflecting translation quality. For WER, a lower score indicates better performance, whereas for BLEU, a higher score is preferable.

As Table 2 shows, the performance trajectory of our model versions —all obtained by fine-tuning on successive versions of the training dataset (v1.0–v1.3)— illustrates the impact of progressively integrating targeted dialectal data. For comparability, all models, including the unfine-tuned baseline, are evaluated on the same held-out test set derived from version v1.3 of the training data (2h 50m, 24,442 tokens), which is kept fixed across all experiments. Notably, even the model trained on dataset version v1.0, despite its limited size and heavy reliance on the initial four hours of AlpiLinK data, achieves encouraging results, reaching a WER of 37%. This finding aligns with results reported for Swiss German by Hollenstein and Aepli (2014) and Samardžić et al. (2015), who show that modest amounts of variety-specific data can outperform substantially larger, out-of-domain Standard German corpora in the training of language processing tools. The transition from v1.0 to v1.1, which increased training data from roughly 6.5 to 9 hours, yielded the single largest performance gain, reducing WER from 37% to 27% (a 27% relative improvement). This jump coincides with the introduction of archival interviews and additional purpose-recorded data, which substantially diversified the speaker pool and exposed the model to more varied spontaneous speech. While AlpiLinK provided the foundational structured utterances necessary for initial training, this phase showed that transitioning toward more spontaneous material is key to broader model robustness. Performance continued to improve with v1.2, which achieved a WER of 24% as the inclusion of broadcast data and further subtitled content expanded the model’s domain coverage. Results between v1.2 and v1.3 remained largely stable, with the global WER holding at 24% and the BLEU score showing a marginal shift from 69.13 to 68.73. This plateau was anticipated, as v1.3 introduced only 38 new training samples, representing less than 10 minutes of additional audio. Rather than indicating a regression, these near-identical scores suggest that the model has effectively saturated on the currently available material, a phenomenon consistent with the logarithmic relationship between training volume and ASR accuracy documented in large-scale studies (Radford et al., 2023; Zhang et al., 2023). These findings signal that further substantial gains will require significantly larger data increments rather than minor additions.

Table 2

Training data evolution and ASR performance computed on the same held-out test set derived from version v1.3 of the training data. Model v1.2 yields the best overall performance.

MODEL	TRAINING DATA		PERFORMANCE (V1.3)
MODEL	TOKENS	DURATION	WER ↓	BLEU ↑
baseline	n/a (no fine-tuning)		0.46	44.58
v1.0	51,474	6 h 39 m	0.37	0.52
v1.1	75,203	9 h 07 m	0.27	65.65
v1.2	87,298	10 h 16 m	0.24	69.13
v1.3	88,810	10 h 26 m	0.24	68.73

Qualitative feedback from users indicates that our model provides a robust foundation for transcription, significantly accelerating manual workflows despite occasional hallucinations. However, error analyses reveal recurrent difficulties with inflectional and lexical distinctions. Among others, in several cases, the model outputs common indicative or finite verb forms such as war (‘was’), gab (‘there was/were’) and sind (‘are’) instead of the grammatically required subjunctive or infinitive forms, namely wäre (‘would be’), gäbe (‘there would be’) and sein (‘to be’). Although such contrasts are linguistically subtle, their repeated occurrence points to systematic challenges in distinguishing between closely related verbal paradigms. More generally, the model sometimes fails to disambiguate lexical items that occur both in Standard German and in South Tyrolean dialect but differ in meaning. In such cases, it tends to favour acoustically plausible interpretations even when they are contextually inappropriate. For instance, the dialectal form wert, intended as a realisation of wird (‘will’), was transcribed as wert (‘worth’), resulting in a semantically incoherent output. Characteristic South Tyrolean forms such as sem (‘there/at that time’), homo (‘we have’), hon (‘I have’), kimmp (‘comes’) and man (‘mine’) are not always rendered accurately, though their overall recognition is generally satisfactory. Additionally, the multilingual environment of South Tyrol is also reflected in model behaviour: foreign lexical items may be retained, translated or inconsistently transliterated. The model also exhibits weaknesses in formatting, particularly regarding punctuation consistency, an issue noted as common across various ASR platforms. Finally, the system shows difficulties with named entity recognition, specifically regarding local historical figures and regional geography.

5.2 Assessing AlpiLinK for ASR

This section constitutes the conceptual core of the paper, examining the transformation of a dialectological dataset into a machine learning training resource and the implications of this shift for data design and reuse.

This repurposing revealed several critical considerations. Firstly, because the audio was crowdsourced from the general public, the input quality varies significantly based on recording devices and environments. While this variability is a challenge for traditional linguistic analysis, for ASR fine-tuning it serves as a form of natural augmentation, forcing the model to learn features that are robust to real-world noise. However, extreme signal degradation in a number of files remains a hurdle, hindering the model’s ability to capture subtle dialectal phonemes.

Secondly, the questionnaire format naturally produces very short utterances. For Whisper and similar architectures, which thrive on 30-second context windows, these excerpts are suboptimal. The challenge for researchers lies in bridging this gap –for instance, by concatenating related segments– to provide the richer contextual information necessary for higher translation accuracy.

Thirdly, the dataset’s non-commercial license further restricts its downstream reuse. This is particularly problematic given the current demand across both public and private sectors in South Tyrol for ASR technology capable of translating South Tyrolean dialect into Standard German.

Another practical challenge arose from the metadata changes introduced across versions; at one point, the delimiter symbol used in comma-separated value metadata files was modified, temporarily breaking our automated preprocessing pipeline. While seemingly minor, such structural shifts emphasise the need for metadata stability in large-scale machine learning applications.

Despite these issues, AlpiLinK’s detailed metadata and transparent changelogs make it an invaluable foundation. By moving the dataset from the domain of dialectology into the pipeline of machine learning, we have demonstrated how archival linguistic resources can be activated to meet contemporary technological demands.

6 Implications and recommendations

The following recommendations are derived from our experience repurposing an existing linguistic dataset for ASR rather than from the design of a dataset originally intended for machine learning. They target researchers creating comparable low-resource or dialectal datasets and aim to support their effective reuse in downstream applications.

Standardise and automate audio quality verification. Crowdsourced data often suffers from high acoustic variability. Beyond providing recording instructions, researchers should implement automated checks (e.g., SNR or clipping detection) during the collection phase. Flagging or excluding files with severe distortion ensures a more robust baseline for machine learning without necessitating (manual) review.

Optimise segment duration for modern architectures. While shorter utterances are common in linguistic questionnaires, modern ASR models benefit significantly from longer auditory context. Where feasible, researchers should encourage or deliberately elicit recordings of at least 10–15 seconds to provide the richer acoustic-linguistic information required for high-accuracy translation and transcription.

Align licensing with broader community needs. Non-commercial licenses can unintentionally hinder the adoption of regional language technologies. In many low-resource contexts, the demand for ASR solutions spans both public administration and private innovation. We recommend adopting more permissive open licenses (e.g., CC BY) or offering a dual-licensing model that accommodates both academic research and the development of practical tools for the local community.

Enforce metadata stability and versioning. Reproducible machine learning pipelines rely on consistent data structures. Small format changes, such as modifying delimiters or field names, can disrupt automated workflows. Researchers should treat metadata as a versioned schema, providing migration notes or advance warnings when structural changes are necessary to maintain compatibility across dataset releases.

Prioritise comprehensive metadata and transparent changelogs. Detailed documentation of speaker demographics, regional variants and recording conditions is a major asset for data reuse. This transparency allows researchers to track performance across specific sub-groups and adds significant value to the dataset as a living resource.

Secure explicit consent for artificial intelligence and machine learning reuse. Ethical and legally compliant data reuse today requires informed consent that explicitly covers artificial intelligence and machine learning applications. Clear language explaining that recordings may be used to train automated systems ensures transparency and allows researchers to employ the data confidently in downstream technological developments.

7 Conclusion

This paper has demonstrated how an existing dialectological resource can be systematically repurposed into a foundational dataset for ASR research, using South Tyrolean dialect speech as a case study. Our work highlights that while such resources require careful technical preparation —including precise segmentation, translation alignment and metadata stabilisation— they provide an invaluable baseline for low-resource dialectal ASR. The reuse of AlpiLinK serves as a case study in regional language technology, proving that archival data can be repurposed to meet contemporary demands for automated transcription and translation. Ultimately, this research underscores the need for continued investment in high-quality, reusable dialectal datasets that can bridge the gap between traditional linguistics and machine learning.

Notes

[1] https://alpilink.it/ (last accessed: 6 May 2026).

[2] https://gitlab.inf.unibz.it/commul/speech-to-text/augusta_data (last accessed: 6 May 2026).

[3] https://huggingface.co/docs/transformers/v4.25.1/model_doc/whisper#transformers.WhisperConfig.max_target_positions (last accessed: 6 May 2026).

Acknowledgements

The authors would like to express their gratitude to Professor Stefan Rabanus (University of Verona), Principal Investigator of AlpiLinK, for his support of this work. Special thanks are also due to Simone Baratella for his technical assistance in extracting the acoustic features of the AlpiLinK dataset and to the anonymous reviewers for their valuable suggestions.

Author Contributions

Greta H. Franzini: conceptualization, data curation, funding acquisition, investigation, methodology, project administration, software, supervision, validation, writing original draft.

Luca Ducceschi: conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, resources, software, supervision, validation.