Fine-Tuning South Tyrolean Dialect-to-Standard German ASR with AlpiLinK

SOURCE	TYPE	HOURS	SPEAKERS	AGE GROUP	PROVENANCE
Learner textbook	scripted	47 m	3	20–49	public
AlpiLinK	scripted	4 h 47 m	180	10–89	public
In-house recordings	scripted, spontaneous	3 h 42 m	3	20–49	in-house, contributed
Audiovisual archives	scripted, spontaneous	1 h 53 m	9	0–99	contributed
Promotional videos	spontaneous	1 h 03 m	86	20–79	public
TOTAL		13 h 16 m

Table 2

Training data evolution and ASR performance computed on the same held-out test set derived from version v1.3 of the training data. Model v1.2 yields the best overall performance.

MODEL	TRAINING DATA		PERFORMANCE (V1.3)
MODEL	TOKENS	DURATION	WER ↓	BLEU ↑
baseline	n/a (no fine-tuning)		0.46	44.58
v1.0	51,474	6 h 39 m	0.37	0.52
v1.1	75,203	9 h 07 m	0.27	65.65
v1.2	87,298	10 h 16 m	0.24	69.13
v1.3	88,810	10 h 26 m	0.24	68.73