TÜRKÇE KONUŞMACI DOĞRULAMADA DİL UYUMSUZLUĞUNUN ETKİSİ

Bu calismada, arkaplan verisi ile gerceklestirme verisi arasinda konusulan dil anlaminda bir uyumsuzluk olmasi durumunda Turkce konusmalar icin konusmaci tanima performansi incelenmistir. Gauss karisim modeli - genel arkaplan modeli siniflandiricisi ile mel-frekansi kepstral katsayilari konusmacilara ozgu oznitelikler olarak secilmistir. 47 erkek ve 26 bayan konusmacidan olusan Turkce veritabani ile yapilan deneylerde gorulmustur ki arkaplan modelini egitmek icin kullanilan seslerin dili ile konusmaci dogrulama deneylerinde kullanilan dil farkli oldugunda konusmaci dogrulama performansi dramatik bir sekilde dusmektedir. Ornegin, erkek konusmacilar icin Turkce ses verileri ile arkaplan modeli egitildiginde %1.73 esit hata orani elde edilirken, Ingilizce sesler ile egitildiginde %12.34 esit hata orani elde edilmistir.


SECTION 1
Speaker verification is the task of automatically authenticating the speaker's claimed identity using his/her voice sample (Hansen and Hasan, 2015). In recent years, automatic speaker verification systems have found their way to commercial use in real-time applications such as online banking, smart cars etc. However, speaker verification systems have still important challenges to address and solve. Speech signals from a speaker carry information related to transmission channel, speaker's emotion, age, accent and spoken language. Any mismatch of these dimensions between training and test stages of speaker verification systems results considerable degradation on the performance. In recent studies, research mostly focused on compensating the mismatch induced by transmission channels and great improvement have been obtained with the sophisticated i-vector approach. However, variability or mismatch in spoken language has been less studied.
Spoken language mismatch on speaker verification can be considered as a less important problem for text-dependent speaker verification. This is because in text-dependent verification, a fixed phrase is chosen by the user and it can be in any language (Benesty et. al., 1997). Similarly, in text-prompted applications, a phrase is prompted to user and prompted phrase can be in any language. However, in text-independent speaker verification, the variability in spoken language is an important problem and requires more attention.
In (Ma and Meng, 2004), bilingual text-independent speaker recognition task was studied where each speaker is trained using English data and tested with Chinese data. In that study, it was reported that language mismatch between training and test data yields significant degradation. To alleviate this degradation, authors proposed to model each speaker using both languages (Ma and Meng, 2004). Another solution for bilingual speaker recognition is training two separate speaker models for each target speaker one with Spanish data and the other using English data (Akbacak and Hansen, 2007). During the recognition phase, first a language detector is used to detect the language of test utterance for choosing the correct speaker model (Akbacak and Hansen, 2007). However, both of these two proposed solutions require knowledge about the languages of training and test utterances. In (Ma et.al., 2007), the effects of device, language and environmental mismatches between training and test data of speaker recognition system is studied and it was found that language mismatch (training each speaker on Chinese data and testing with English speech) brings 288% performance degradation (EER increases to 6.42% from 1.65% whereas environmental mismatch yields 162% degradation. A feature-level solution--combining standard mel-frequency cepstral coefficients (MFCCs) with prosodic features--was proposed in (Luengo et.al., 2008) for multilingual speaker recognition in which Spanish and Basque languages are used in the experiments. In a more recent study (Misra and Hansen, 2014) the performance of the state-of-the-art i-vector speaker recognition system is analyzed and it was found that language-mismatch significantly reduces the i-vector system performance.
One of the fundamental problem with analyzing the effect of language mismatch on speaker recognition is the lack of speaker recognition databases consisting of utterances in different languages from a particular target speaker. Plus, most of the speaker recognition studies carry out their investigations on English language. This is because of the existence of large English databases from NIST * and LDC ** . The annual NIST Speaker Recognition Evaluation provides large databases to the researchers. Therefore, the researchers mostly reports their results and analysis on NIST corpora. Although there are few speaker recognition studies on Turkish language (Büyük and Aslan, 2012a, Büyük and Aslan, 2012b), motivated by the fact that there is a lack of speech databases available for Turkish and lack of studies report their findings for Turkish language, in this paper, we analyze the effect of language and environmental mismatch on Turkish speaker verification which is the preliminary results of an ongoing project. To this end, we propose an experimental setup using speakers from Turkish language. Gaussian mixture model with universal background model (GMM-UBM) method is used as the classifier for speaker verification. The UBM for the speaker verification task is trained using English and Turkish data for the investigation of language mismatch on Turkish speaker verification. Although there are more sophisticated algorithms used for speaker verification (e.g. GMM supervector, joint factor analysis and i-vector), we utilize the simple but efficient GMM-UBM method in the experiments because its performance on Turkish speech database is unknown and it requires less data to train hyperparameters in comparison to other methods. Another reason of selecting the GMM-UBM method is that the most of the state-of-the-art techniques require UBM model trained in advance. However, the effect of the training data for UBM is unknown for Turkish speaker verification. Therefore, in order to use other techniques UBM is required and it has a considerable impact on the performance. Thus in this paper, we study the Turkish speaker verification system using GMM-UBM method. Our study differs from previous studies on Turkish language in some manners: First, in (Büyük and Aslan, 2012a, Büyük and Aslan, 2012b), text-dependent speaker recognition using Turkish speech data is considered whereas we study text-independent speaker verification. Second, to the best of our knowledge this is the first study investigating the effect of database/language and recording condition variability on Turkish speaker recognition.
The remainder of the paper is as follows: in Section 2 we briefly explain the speaker verification task using GMM-UBM method. The details of our experimental setup are given in Section 3. In Section 4, the results of our speaker verification experiments are provided and finally in Section 5, we discuss future work and conclude our results. x x x X  extracted from S, logarithmic likelihood ratio score is given by,

EXPERIMENTAL SETUP
Speaker verification experiments are conducted on TURTEL (TURkish TELephony) speech database consisting of 57 male and 36 female speakers. Each speaker reads the same phonetically balanced 15 Turkish sentences each sampled at 16 kHz and approximately with the duration of 3 seconds. After eliminating the non-speech portions of the speech signal with voice activity detection (VAD) the duration of each utterance reduces approximately to 1.5 seconds. Histogram plots of the duration of speech signals before and after VAD process is shown in Figure 2. Each speaker is trained using his/her randomly selected 5 utterances and the remaining ten utterances are used for verification. Mel-frequency cepstral coefficients (MFCCs) features are extracted from Hamming windowed speech frames of 20 ms with 10 ms overlap. Discrete Fourier transform (DFT) of windowed speech frames are computed to obtain power spectra. Power spectra is then processed through Mel-filterbank consisting of 27 triangular filters in mel-scale. Logarithmic filterbank outputs are converted into MFCCs by taking discrete Cosine transform (DCT).

Figure 2:
Duration of the speech signals before and after VAD Speaker verification performance is evaluated using Gaussian mixture model with universal background model (GMM-UBM) classifier. In order to investigate the effect of language and recording condition mismatch between UBM and evaluation data, two different UBMs trained using different speech data are used in the experiments:  TIMIT-UBM : English microphone speech database TIMIT is used to train UBM which introduces both language and recording conditions mismatch between the TURTEL database and the speech data used to train UBM.  Oracle-UBM : UBM is trained using the speech signals of randomly selected 10 male and 10 female speakers from the TURTEL database. Since the language and the recording conditions of this setup exactly match with the data used in speaker recognition experiments, comparison of the TIMIT-UBM results with Oracle-UBM will help to understand performance differences under language and recording condition mismatch. Since 10 male and 10 female speakers are excluded from TURTEL database to train Oracle-UBM, in speaker recognition experiments the remaining 47 male and 26 female speakers are used. In both cases, TIMIT-UBM and Oracle-UBM, gender-independent UBMs with different model orders (number of Gaussian components) are trained using 20 EM iterations. Target speaker models are created using five training utterances of each speaker with maximum a-posteriori (MAP) adaptation of UBM model with a relevance factor of 8.
With the aforementioned UBM training cases, we aim to compare both the effect of the language and recording condition mismatch on the performance. Since TIMIT database consists of clean microphone speech collected from American speakers uttering English sentences, TIMIT-UBM introduces both language and recording condition mismatch between UBM and evaluation data. Oracle-UBM in turn, exactly matches with the evaluation data in terms of both recording condition and the language.
We use equal error rate (EER) as the performance criterion of speaker verification experiments. EER is the operational point where the false alarm ( fa P ) and the false rejection rates ( miss P ) are equal. The reported EERs in Section 4 are computed using the Bosaris toolkit which uses the convex hulls on receiver operating characteristic curve (ROCCH) (Bosaris Toolkit 2010).

RESULTS
This section the results of the above experiments. All the results are represented in terms of EER (%).

Effect of Number of Features
In the experiments we first study the effect of number of MFCC features on the performance. The EERs (%) for different number of MFCC features on both male and female speakers using the TIMIT-UBM and Oracle-UBM are shown in Figure 3. From the figure, Oracle-UBM yields smaller EERs than TIMIT-UBM as expected. This is because both the language and the recording conditions of the speech data used to train Oracle-UBM exactly matches with the evaluation data. However, using the TIMIT-UBM dramatically degrades the performance because of the mismatch between the UBM and the evaluation data. Using 20 MFCCs yields the smallest EER for both male and female speakers in Oracle-UBM case whereas the smallest EER is obtained by using 18 MFCCs for TIMIT-UBM case. Therefore, 20 MFCCs and 18 MFCCs will be used in the remaining experiments for Oracle-UBM and TIMIT-UBM, respectively

Effect of UBM Size
We next analyze the effect of UBM size -number of Gaussian components -in UBM. To this end, we vary the UBM size between 8 to 2048 and tried to optimize the number of Gaussian components.

Effect of Feature Post-Processing
Next we compare the effect of feature post-processing on the speaker recognition performance.
To be more precise we study the effect of the 0th MFCC feature ( 0 c ) and the dynamic features (∆ and ∆∆) -first and second order derivatives of the MFCC features -in addition to static MFCCs. The results are summarized in Table 3. We can see that including the 0 c to the features considerably improves the performance for male speakers in TIMIT-UBM case (EER reduces to 8.82% from 12.34%). However, for female speakers the raw MFCCs yields the best performance. Appending the dynamic features does not bring any improvement on the performance but also increases the EERs. For Oracle-UBM case, the raw features without any additional features yields the best performance. This is probably because, in general dynamic features are helpful to improve the recognition performance when there is a session variability in speech recordings. However, speech recordings in TURTEL database does not have this kind of variability.

CONCLUSION
In this paper, we study the effect of language and recording condition mismatch on speaker verification using Turkish speech database. There exist studies addressing this challenge in the literature but most of the studies report their findings using English speech corpora. The experiments carried out in this study showed that speaker verification performance dramatically degrades in case of language and recording condition mismatch between the UBM and the evaluation data. It was found that appending dynamic features does not boost the speaker recognition performance independent from the UBM data. Analyzing the effect of such mismatch using more state-of-the-art speaker modeling technique such as i-vector probabilistic linear discriminant analysis (PLDA) would be more interesting as a future work.