Today, businesses use Automatic Speech Recognition (ASR) technology more frequently to increase efficiency and productivity while performing many business functions. Due to the increased prevalence of online meetings in remote working and learning environments after the COVID-19 pandemic, speech recognition systems have seen more frequent utilization, exhibiting the significance of these systems. While English, Spanish or French languages have a lot of labeled data, there is very little labeled data for the Turkish language. This directly affects the accuracy of the ASR system negatively. Therefore, this study utilizes unlabeled audio data by learning general data representations with self-supervised learning end-to-end modeling. This study employed a transformer-based machine learning model with improved performance through transfer learning to convert speech recordings to text. The model adopted within the scope of the study is the Wav2Vec 2.0 architecture, which masks the audio inputs and solves the related task. The XLSR-Wav2Vec 2.0 model was pre-trained on speech data in 53 languages and fine-tuned with the Mozilla Common Voice Turkish data set. According to the empirical results obtained within the scope of the study, a 0.23 word error rate was reached in the test set of the same data set.
Wav2vec2 automatic speech recognition speech-to-text transcription natural language processing transformer architecture
Birincil Dil | İngilizce |
---|---|
Konular | Yazılım Mühendisliği (Diğer) |
Bölüm | Araştırma Makalesi |
Yazarlar | |
Yayımlanma Tarihi | 28 Haziran 2024 |
Gönderilme Tarihi | 6 Ağustos 2023 |
Yayımlandığı Sayı | Yıl 2024 Cilt: 8 Sayı: 1 |