INCREASING ROBUSTNESS OF I-VECTORS VIA MASKING: A CASE STUDY IN SYNTHETIC SPEECH DETECTION

: Ensuring security in speaker recognition systems is crucial. In the past years, it has been demonstrated that spoofing attacks can fool these systems. In order to deal with this issue, spoof speech detection systems have been developed. While these systems have served with a good performance, their effectiveness tends to degrade under noise. Traditional speech enhancement methods are not efficient for improving performance, they even make it worse. In this research paper, performance of the noise mask obtained via a convolutional neural network structure for reducing the noise effects was investigated. The mask is used to suppress noisy regions of spectrograms in order to extract robust i-vectors. The proposed system is tested on the ASVspoof 2015 database with three different noise types and accomplished superior performance compared to the traditional systems. However, there is a loss of performance in noise types that are not encountered during training phase.


INTRODUCTION
Speaker recognition refers to identifying individuals based on their voices by utilizing the physical differences in vocal production organs.In addition to these physical differences, each speaker has a unique speaking style, including a certain accent, rhythm, intonation style, pronunciation pattern, word choice, etc. (Kinnunen et al., 2010) Because of the uniqueness of the voice to the individual, speaker recognition systems are used in various fields, including telephone banking (HSBC, 2017), e-commerce (Find Biometrics, 2018), and forensic science (Find Biometrics, 2018).With increasing usage areas, it is crucial to prevent potential attacks that could be carried out by malicious people on these systems.
Possible attack types on a speaker recognition system include synthesizing the speaker's voice, using various software to transform the attacker's voice into the target person's voice, imitating the target speaker's voice, and using a pre-recorded voice of the target speaker (Find Biometrics, 2018), (Find Biometrics, 2018), (Hanilçi et al., 2016), (Gomez-Alanis et al., 2019)).
In recent years, various organizations such as ASVspoof have increased awareness and research in the field of spoofed speech detection ( (Evans et al., 2013), (Alegre et al., 2013), (Sizov et al., 2015), (Evans et al., 2013), (Wu et al., 2014), (Wu et al., 2015), (Dutoit et al., 2007)).In particular, systems that utilize deep learning algorithms can achieve highly successful results (Wang et al., 2021), (Jung et al., 2022).On the other hand, additive noise, which is one of the biggest problems in speech-related systems, reduces the success rate in spoofed speech detection (Hanilçi et al., 2016).There are limited studies on robust fake speech detection ( (Hanilçi et al., 2016), (Gomez-Alanis et al., 2019), (Gomez-Alanis et al., 2018)).Specifically, low performance of traditional speech enhancement methods (Wiener filter, spectral subtraction, etc.) makes the problem even more challenging (Hanilçi et al., 2016).However, much more successful results can be achieved with complex deep learning systems (Gomez-Alanis et al., 2019).These types of systems use different methods for feature extraction and classification than those used in traditional speaker recognition systems.Therefore, these systems are focused solely on the problem of noise and spoofed speech detection.EER (Equal Error Rate) is a typical statistic for measuring the performance of spoof speech detection systems.It is defined as the point at which the false acceptance rate (FAR) equals the false rejection rate (FRR).This criterion shows the degree to which systems are able to discriminate between synthetic and real speech.
Studies have indicated that the emergence of diverse speech synthesis (SS) and voice conversion (VC) methodologies has rendered speech-based biometric systems exceedingly susceptible to spoofing assaults.According to (Diyopsi et al., 2017), this circumstance may result in a rise in false acceptance rates, making countermeasures against spoofing attacks necessary.EER is important in assessing the effectiveness of these measures since a low EER value shows that the systems can successfully distinguish between spoof and genuine speech.(Hassan et al., 2021) suggests combining spectral features like MFCC, GTCC, Spectral Flux, and Spectral Centroid to create a synthetic speech detector.In order to train a biLSTM to categorise the speech, the fused feature set attempts to capture differences between real and synthetic signals.
Using the ASVspoof 2019 LA dataset, the system demonstrated efficacy in identifying voice conversion and synthetic speech attacks.Novel speech features for improved detection of spoofing attacks (Dipjyoti et al., 2015) presents new speech features for spoofing attack detection that are based on alternate frequency-warping methods.When tested against the ASVspoof 2015 corpora, the features-which were computed using formant-specific block transformation-perform better than previous methods in differentiating between natural and synthetic speech, achieving 0% equal error rates on a variety of spoofing attack tasks.(Nugroho et al., 2022) applies a Deep Neural Network (DNN) approach, achieving significant performance with a model accuracy rate of 96.5%, precision of 97.3%, recall of 96.5%, and an F1 Measure of 96.7%.The study underscores DNN's robustness in fake speech detection, processing extensive data.
A one-class learning anti-spoofing system is introduced by (Zhang et al., 2020) to identify unknown synthetic voice spoofing assaults.Using an angular margin to distinguish spoofing assaults in the embedding space and compacting bona fide speech representation, the system achieves an EER of 2.19% on the ASVspoof 2019 Challenge, outperforming previous approaches.
The effectiveness of high dimensional magnitude and phase-based features for detecting spoofed speech is examined in (Xiao et al.,2015)'s study.Advances in text-to-speech (TTS) and voice conversion (VC) technologies pose a serious threat to automatic speaker verification (ASV) systems.Through the use of two magnitude-based and five phase-based characteristics in combination with multilayer perceptron analysis, the research was able to detect known spoofing assaults in the ASVspoof 2015 challenge with a low equal error rate (EER) of 0.29%.With an EER of 5.23%, the detection performance for unknown spoofing kinds was less successful, underscoring the need for additional study to increase the method's generalizability to novel and undiscovered spoofing approaches.
The ASSERT system is reviewed in the publication "LARIHS ASSERT Reassessment for Logical Access ASVspoof 2021 Challenge" by (Benhafid et al. 2021), with an emphasis on improving the detection of logical access spoofing attacks.Thinner SENet backbones with new activation functions and the use of advanced features and loss algorithms are among the improvements.The success of these changes in spoofing detection was demonstrated by the reevaluated system's 60% improvement in min-tDCF for the ASVspoof 2019 evaluation, which marked a considerable improvement over the original.
In (Dişken, 2023), a novel differential convolutional neural network generates finer noise masks based on directional changes of activations.These masks, combined with linear filterbank magnitudes, are inputted into various spoof detection systems, including PLDA with x-vectors, Emphasized Channel Attention, ECAPA-TDNN, and LCNN with LSTM layers.Experiments on the ASVspoof 2015 dataset show that the LCNN-LSTM network with noise masks achieves superior performance, with an average Equal Error Rate (EER) of 2.67% for known noise types and 3.10% for unknown noise types.Clean ASVspoof 2015 data has an EER of 0.83%, while ASVspoof 2019 data under logical access conditions has a 2.6% EER.
The main purpose of the study is that while traditional methods use two separate systems with different features for spoof detection and speaker verification, the study can perform both tasks with a single system via i-vector.(Dehak et al., 2011).Thus, both speaker recognition and spoofed speech detection using the same i-vectors can be possible.The mask obtained using a convolutional neural network (CNN) is utilized to reduce the effect of noise.This mask is applied on the noisy spectrogram, then, conventional i-vector extraction steps are followed.To test the proposed system, ASVspoof 2015 database and three different noise types (babble, white, car) are used based on previous studies in the literature( (Hanilçi et al., 2016), (Gomez-Alanis et al.,2018), (Gomez-Alanis et al., 2019)).The results showed that the masking process increased the robustness of i-vector features and, unlike traditional methods, an improvement in performance is observed.There are few studies on this subject in the literature.Successful results can be obtained with deep learning models, by using more than one system.The goal of the study is to ensure the use of a single system instead of two different systems.

ROBUSTNESS VIA MASKING
The goal of masking is to improve the robustness against noise by distinguishing less reliable regions of the speech spectrum (more corrupted by noise) and more reliable regions (less affected by noise).Previous studies have shown that applying classic SNR-based masks for spoofed speech detection yields the best results in noisy scenarios (Gomez-Alanis et al., 2019).In the proposed system, the CNN structure used in (Gomez-Alanis et al., 2019) is preferred due to its high performance.The mask creation network is shown in Figure 1.Noisy spectrograms of 31 frames in length, consisting of 15 frames to the right and 15 frames to the left of the central frame are used as inputs to the CNN structure.Therefore, the output of the system (the last linear layer) indicates the signal-to-noise ratio (SNR) for the relevant frame.The sigmoid function is applied to these values to obtain mask values in the range of 0-1.Here, 0 represents completely noise, and 1 represents completely speech information.
The average noise shown in Figure 1 is calculated by averaging the first 10 frames of the corresponding noisy speech data.Typically, it is assumed that the first and last few frames of the speech signal contain only noise information.Therefore, the trained CNN structure has an explicit noise reference, instead of relying only on the spectrogram.During the training phase, the instantaneous SNR target presented to the CNN for each frame is calculated as follows: The notation (t, f) represents time-frequency partitions.X and N are the spectrograms of the clean speech and noise, respectively (generated using short-time Fourier transform (STFT)).The values of the target masks to be used in the training phase are obtained using an adjustable sigmoid function given in Equation 2.
Here, α controls the slope of the sigmoid, and β corresponds to the threshold commonly used to define Ideal Binary Masks (IBMs) (Wang et al., 2009).Combining these two equations, the Equation 3 is obtained, which calculates the target mask values as ℽ= 20.α/ log(10)). .

I-VECTORS
I-vectors are low and fixed dimensional representations of variable length audio data.This feature allows applications of various normalization techniques in a low-dimensional space (Varga et al., 1993).Following the traditional i-vector extraction steps given in (Sizov et al., 2015), the training of the universal background model (UBM) and the total variability matrix T is performed.A speaker and channel-independent Gaussian mixture model (GMM) supervector can be defined as follows: Here, m is the mean supervector taken from the UBM and ω is a randomly generated vector with a normal distribution.The i-vector is obtained by maximizing the posterior of ω for each audio file.Once the i-vector is extracted, various compensation and dimensionality reduction techniques such as within-class covariance normalization (WCCN), linear discriminant analysis (LDA), and length normalization can be applied ( (Dehak et al., 2011), (Delgado et al. 2018)).The features obtained from audio data are used for GMM and UBM training via applying Melfrequency cepstral coefficients (MFCC) or CQCC.MFCC extraction steps typically involve filtering the magnitude spectrogram obtained with STFT with triangular filters placed linearly on the Mel scale, taking the logarithm, and applying discrete cosine transform.
CQCC is based on constant-q transform which gives a variable resolution, providing greater frequency resolution for lower frequencies and enhanced temporal resolution for higher frequencies.CQCC usually performs better than MFCC for spoofed speech detection.
In the proposed study, robust i-vectors are created by using vector extraction of masked spectrograms with reduced noise effect.Apart from the masking process, all steps (MFCC extraction, UBM and T training) follow traditional methods.The obtained vectors are scored using classifiers such as cosine distance and probabilistic LDA (PLDA) to calculate performance.

ASVSPOOF 2015
The proposed system was tested using the ASVspoof 2015 dataset (Wu et al., 2015).This dataset consists of non-overlapping training, development, and test subsets.The spoof speech attacks are included in this dataset.While there are 10 different attack algorithms, only five of them (S1-S5) are available in the training set.The remaining five (S6-S10) are only available in the test set.

CNN PARAMETERS
The CNN structure used to obtain the mask was trained with a learning rate of 3e-4 and binary cross entropy was chosen as the learning criterion.Only the noisy versions of real speech audios (3750 speech samples) were used as the training data.Each data was corrupted with random noise type selected from white, babble, and cafe noises at a random SNR values between 0 dB and 20 dB.Thus, multi-condition training was performed to prevent the system from focusing on a single noise type and level.
The CNN architecture depicted processes a noisy spectrogram through multiple layers to enhance speech.Initially, the input spectrogram-encapsulating a series of frames-is fed into the first convolutional layer, which extracts basic features like edges and patterns indicative of noise or speech characteristics.Subsequently, a pooling layer reduces the feature map's dimensionality, emphasizing the most salient features and making the network less sensitive to the exact positioning of features within the frames.A second convolutional layer then detects more complex features, combining the simpler patterns identified earlier.This is again followed by a pooling layer, which further condenses the data, preparing it for the final classification steps.
The last part of the network consists of fully connected layers culminating in a linear layer that computes the SNR values for the central frame.These SNR values undergo normalization through a sigmoid function, resulting in a binary mask that distinguishes between noise (0) and speech (1).An average noise reference, derived from the first noise-dominated frames, informs the network what noise looks like.This reference improves the network's ability to differentiate between noise and speech.The output mask from the CNN is then used to clean up the noisy input, and a speech activity detection system removes any remaining silent or noise-heavy frames, ensuring a clear speech output.Overall structure is shown in Figure 3.

I-VECTOR PARAMETERS
The first step in i-vector extraction is obtaining MFCC features.For this, the audio signal is divided into frames of 25 ms length with a frame step of 10 ms.The windowed frames are transformed with a 512-point FFT.The filter bank consists of 32 triangular filters.After discrete cosine transformation, 32 coefficients are used.In addition, delta and delta-delta features are added to obtain 96-dimensional features per frame.
The CQT is applied with a maximum frequency of Fmax = 8kHz.The number of octaves is 9.
The number of bins per octave B is set to 96.Re-sampling is applied with a sampling period of 16 bins in the first octave.Resulting feature vectors are of dimension 19, excluding the  0 coefficient (  29 coefficients +  0 for the original system).
The UBM consists of 512 Gaussian components and is trained only on real speakers in the training data.The 600-dimensional T matrix is trained using the entire training data (Hanilçi, 2018).After obtaining i-vectors, whitening, WCCN, and length normalization are applied.The process is shown in Figure 2.

RESULTS
Table 2 shows the performance of the proposed methods on the development data.The EER value represents the average EER values for five different attacks in the development set.For comparison, the results of MFCC-based i-vector without mask (Hanilçi et al., 2016) are also included in the table.As can be seen, i-vectors enhanced with masks provide significantly better performance than classical i-vectors.
As seen in the Table 2, the highest relative improvement in MFCC based i-vector system, with a rate of over 50%, was achieved with the cosine distance classifier at 20 dB level for the babble noise type for encountered noise type.The lowest improvement in MFCC based i-vector system was observed with the PLDA classifier at 20 dB level for car noise type with a 22% improvement rate.The results indicate that masking contributes to robustness compared to the same system without mask.Similar observations can be made for CQCC based system.

Table 2: EERs (%) for the development data
Table 3 and Table 4 show the results for the test data by using MFCC based i-vector system.The noticeable issue here is the low performance for the car noise type.Generally, this type of noise is considered to be the least disruptive ((Hanilçi et al., 2016), (Gomez-Alanis et al., 2019)).However, the masking performance is affected by this noise type because it was not used in the training stage of the mask.As evidence, the examples given on the logarithmic scale in Figure 4.The noisy spectrograms on the top belong to signals corrupted with babble 0 dB and car 20 dB.Table 3. EERs (%) for known attacks of evaluation data (Proposed System/MFCC based ivector with mask) Table 5 shows the performance of the proposed systems and the GMM method for the test data from (Hanilçi et al., 2016) for comparison.(Hanilçi et al., 2016) did not analyze the performance of i-vectors since GMM outperformed it.The average EER values are calculated for known attacks (S1-S5), and unknown attacks (S6-S9).Attack type S10 is difficult to detect and has reduced the system performance; therefore, it is not included in the average calculation as in (Hanilçi et  The results indicate that the proposed approach is more effective for lower dBs.Also, the proposed system delivered the worst results for the unseen noise type (car).This result emphasizes the importance of creating balanced training data.More noise types are necessary to increase the generalization capacity of the network, as noise type may not be known a priori for a practical spoof detection system.

CONCLUSION
This study proposed noise mask based robust i-vector extraction and examined its performance for noisy spoofed speech detection tasks.The results showed that the mask structure is successful in reducing noise effects.This situation reveals that mask structures can be useful in an area where traditional speech enhancement methods have performance-decreasing effects (Hanilçi et al., 2016).The CNN-based mask, on the other hand, failed against a noise type that was not encountered in the training phase.This situation provides a clue about the necessity of increasing the diversity of noise in the database to prevent memorization.
The i-vector method was chosen due to its high performance for speaker verification.With the proposed method, the same i-vectors can be used for both speaker verification and spoof detection, in a robust manner.However, compared to the state-of-the-art systems in detecting synthetic speech under noise ((Gomez-Alanis et al., 2019), (Wang et al., 2021)), the proposed system was found to be far behind.A reason for this is the low performance of i-vectors in short audio recordings (Hanilçi, 2018), where the average data length in ASVspoof 2015 dataset is 3.5 seconds.Another reason is the other systems' complexities.For example, the study in (Gomez-Alanis et al., 2019) designed a system which consists of two different feature types, deep learning models for each feature, and an external classifier in addition to the CNN-based noise mask.Therefore, even though the masking approach performs better than the traditional methods, advanced architectures are necessary for achieving impressive results.

Figure 1 ,
Figure 1, an example spectrogram at the input of the network and the corresponding mask at the output can be seen.The generated mask is multiplied with the noisy spectrogram to obtain a spectrogram with reduced noise.After this stage, a simple speech activity detection system (Kinnunen et al., 2010) was used to discard frames containing silence or noise.

Figure 4 :
Figure 4: Spectrogram of speech data corrupted with babble 0 dB (upper-left) and car 20 dB (upperright), and their masks below

Table 1 : ASVSPOOF 2015 data distribution
(Dean et al., 2015))hite', and 'babble' noises from the Noisex-92 database(Varga et al., 1993)and 'cafe' noise from the QUT-Noise database(Dean et al., 2015)were used, based on similar studies in the literature ((Hanilçi et al., 2016), (Gomez-Alanis et al., 2019)).White, babble, and cafe noises were used for mask training.Car noise was only used in the tests to analyze the system's performance under noise types that it had not previously encountered.
S2: A basic voice conversion method that adjusts only the first mel-cepstral coefficient to align the source spectrum slope with the target.S3: A speech synthesis algorithm using a hidden Markov model-based system (HTS) with speaker adaptation techniques, requiring only 20 adaptation utterances.S10: A speech synthesis algorithm executed with the MARY Text-To-Speech system (MaryTTS), an open-source platform for generating speech.