Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance

Feb 7, 2025·
Reihaneh Amooie
,
Wietse De Vries
,
Yun Hao
,
Jelske Dijkstra
Matt Coler
Matt Coler
,
Martijn Wieling
· 1 min read
Abstract
Automatic Speech Recognition (ASR) performance for low-resource languages is still far behind that of higher-resource languages such as English, due to a lack of sufficient labeled data. State-of-the-art methods deploy self-supervised transfer learning where a model pre-trained on large amounts of data is fine-tuned using little labeled data in a target low-resource language. In this paper, we present and examine a method for fine-tuning an SSL-based model in order to improve the performance for Frisian and its regional dialects (Clay Frisian, Wood Frisian, and South Frisian). We show that Frisian ASR performance can be improved by using multilingual (Frisian, Dutch, English and German) fine-tuning data and an auxiliary language identification task. In addition, our findings show that performance on dialectal speech suffers substantially, and, importantly, that this effect is moderated by the elicitation approach used to collect the dialectal data. Our findings also particularly suggest that relying solely on standard language data for ASR evaluation may underestimate real-world performance, particularly in languages with substantial dialectal variation.
Type
Publication
arXiv:2502.04883 [cs.CL]

This paper addresses the significant challenge of developing effective automatic speech recognition (ASR) systems for low-resource languages, specifically focusing on Frisian and its regional dialects. We demonstrate that combining multilingual fine-tuning with an auxiliary language identification task substantially improves ASR performance for Frisian.

Our research reveals important insights about dialectal speech recognition:

  1. ASR performance deteriorates significantly when dealing with dialectal speech compared to standard language varieties
  2. The method of data elicitation (scripted vs. spontaneous) has a substantial impact on recognition accuracy
  3. Evaluations based solely on standard language data likely overestimate real-world ASR performance in linguistically diverse settings

These findings have important implications for the development of inclusive speech technology that can serve all speakers of a language, regardless of dialectal background. Our approach offers a promising direction for improving ASR for other low-resource languages with significant dialectal variation.