Speech Denoising using Common Vector Analysis in Frequency Domain

Signal denoising approaches on data of any dimension largely relies on the assumption that data and the noise components and the noise itself are somewhat uncorrelated. However, any denoising process heavily depending on this assumption retreats when the signal component is adversely affected by the operation. Therefore, several proposed algorithms try to separate the data into two or more parts with varying noise levels so that denoising process can be applied on them with different parameters and constraints. In this paper, the proposed method separates the speech data into magnitude and phase where the magnitude part is further separated into common and difference parts using common vector analysis. It is assumed that the noise largely resides on difference part and therefore denoised by a known algorithm. The speech data is reconstructed by combining common, difference and phase parts. Using Linear Minimum Mean Square Error Estimation algorithm on the difference part, excellent denoising results are obtained. Results are compared with that of the state of the art in well-known speech quality measures.


Introduction
For decades, as cognitive science penetrated into automation systems, voice automated applications like voice directed banking, voice signatures, intelligent homes; voice recognizing mobile-phone apps and such became both possible and popular.All voice applications somewhat necessitate high quality voice signals, mostly in digital form, impetrating voice denoising algorithms.All naturally collected signals carry some noise energy weather from electronics/transmission or unrelated background signals.Voice denoising aims to improve signal quality, voice intelligibility or do both, achieving that with minimal loss in signal energy.Although the voice denoising methods can be separated into two as single and multiple channel algorithms, researchers mostly focused on single channel because of incomparably higher number of encounters of such.Spectral subtraction, noise estimation, Wiener deconvolution/filtering, statistical and subspace based methods are considered as mainly single channel, notwithstanding the fact that they can be and are also employed in multichannel systems.In this paper, a method based on common vector analysis (CVA) is proposed.Therefore, brief methodology of other well-known subspace is given for comparisons.Subspace methods rely on the expectation that the noisy data can be separated into two or more component where noise can be handled more efficiently within.A Singular Value Decomposition (SVD) based approach, proposed by Dendrinos et al. [1], uses the expectation that, after factorization of the data into sub-data, noise energy concentrates in vectors corresponding to smaller singular values.In the simplest denoising approach these are zeroed and voice data is recomposed.This technique is improved by Jensen et al. [2] for colored noise on which the former method has somewhat failed to reduce.On the other hand, their method with high computational complexity had several constraints for controlling residual noise.Ephraim et al. [3] aimed to optimize the estimator that minimizes distortion caused by residual noise.Noisy signal is separated into noise and signal subspaces using Karhunen Loeve Transform (KLT) whereby zeroing the components in noise subspace and restructuring the signal subspace using a gain function.Components in subspaces are recombined again to obtain denoised signal through inverse KLT.Mittal et al. [4] and Rezayee et al. [5] expanded this work for colored noise.They obtained better results by employing different KLT matrices and converging covariance matrix of the noise vectors to a diagonal matrix respectively.Common Vector Approach (CVA) is a subspace method used in recognition applications.In CVA, training data representing each subject to be discriminated are used to form its own class.In a speech recognition application, ambient noise, ages and genders of speakers result in differences within a class [6].CVA is based on the common component of those, basically by removing these differences within the class.This component is called the common vector.It has been employed in speaker recognition [7], speech recognition [8]- [10], face recognition [11], fault detection in electrical motors [12], spam e-mail detection [13].CVA has also been used in image denoising [14].

Common Vector Approach
When differences between feature vectors in a class are removed, the remaining vector which consists of features invariant within the class is called the common vector.A feature vector is then, presumed to be a sum of common and difference components.If the number of feature vectors ( m ) is greater than the dimension ( n ) of the vectors, then this is called a sufficient case ( mn  ).
In the sufficient case, the common vector is the mean vector.The insufficient case occurs when the vector dimension is larger than or equal to the number of vectors ( nm  ), which is the case in most practical applications where, for example, too few image blocks with many pixels each exist.In general, it covers the setups where number of samples is less than the sample The projection matrix P  is calculated using eigenvectors j u that correspond zero valued eigenvalues (spans The subspace methods other than CVA requires the inverse of the covariance matrix  [15].However, nm  inhibits the calculation of inverse of  whereas CVA does not have this problem.
It is expected that noise mainly reside within the difference components when it is uncorrelated between class vectors.Therefore, it is imperative to construct classes as correctly as possible.When class information of the vectors is not available, classes should be constructed by collecting similar vectors into a data set matrix for each evaluated vector whose common vector is to be found.When the input data is a stream or can be handled as a stream with bounds, for example, classes can be constructed by searching similar vectors within a reasonable time range.Since the raw vectors in speech data are selected to be sample frames of length n , the words vector and frame are used interchangeably used within this paper.

Proposed Algorithm
The denoising algorithm proposed in this paper relies on the intuition that the spectral content of the speech does not abrubtly change and changes are mostly noise related.Approach is similar to time averaging of Fast Fourier Transform (FFT) data in digital spectrum analyzers.On the other hand, the overlapping ratio is the highest in the proposed algorithm.As illustrated in the Fig. 1, frames are picked from original speech data stream by a sliding Hamming window of width n 0.54 0.46cos(2 /( 1)) which slides 1 sample for each subsequent frame.Although not required, it is logical to select n to correspond approximately 4 ms speech data since spectral characteristics of speech may greatly change for longer intervals.We conducted some tests for determining optimal frame length, as explained in the following section and concluded that 4 ms is adequate.
For each frame to be denoised, a class is constructed by picking m most spectrally similar Hamming windowed frames within its neighbourhood.Obviously entire data stream can be used for selecting the frames and constructing the class.In our experiments, it is determined that a neighbourhood size that contains 21 n  frames, m frames with dimensions n leads to an insufficient case CVA with the most number of vectors.We noticed that, removal of few class outliers did not have considerable effect on the results, therefore, keeping that option in possibilities, an outlier removal operation is not performed as this would increase complexity.
A CVA operation is performed on the set A via 1, com dif avg a a a CVA A a  (7) as described in the previous section.a , the denoised difference frame, current denoised magnitude frame is reconstructed via 1 1, com dif a a a .Time domain speech frame is reconstructed by adding phase information and applying inverse FFT.After applying the described algorithm on each frame, denoised time frames are combined to build the denoised speech data.Since the frames are overlapping, there are several options at the recombining stage.Just adding them onto the appropriate time location is one of them.Here a weighting window can be used to increase the weight of the center of the frame.In our experiments, we noted that just adding the frames (flat window) is sufficient and have least complexity.It should be noted here that there are several parameters that can be optimized for the best performance on the speech data to be denoised; m (number of frames in classes), n (frame size), PCA parameters, recombining options.However, since we intended an algorithm that requires no data dependent optimisation parameters, these optimizations are performed for a large training speech data set and best logical parameter set is kept for all.

Experiments
For the experimental work on the proposed algorithm, NOISEUS (Hu, and Loizou, 2007) speech database is used.NOISEUS is composed of 30 English sentences spoken by 3 male and 3 female speakers.Recordings are sampled by 8 kHz 16 bits with approximately 2 seconds in length.8 different noise type (airport, crowd, car, exhibition hall, restaurant, train station, street and train) are added onto each speech data to obtain 4 SNR levels (0 dB, 5 dB, 10 dB, 15 dB).Noise data is itself taken from AURORA database.In addition, the database is extended by adding 4 levels of white noise onto the data.Initial experiments are conducted to determine best or reasonable parameter values for CVA.These are; frame size, overlap ratio and neighbourhood size from which the class member candidates are picked.Fig. 2 shows the performance graphics for various frame sizes and input noise levels.From these tests, it is determined that frame size of 40 samples (corresponding to 4 ms) and highest overlap ratio are adequate for both performance and Speech data with 9 different background noise added at 4 levels are denoised using the proposed CVA algorithm with previously determined parameters.The results are compared against 5 stateof-the-art methods found in the literature; 1. Perceptually motivated subspace algorithm [16], will shortly be called as sub from now on.
4. A variant of minimum controlled recursive average algorithm [19], will be named as rec.
5. Continuous spectral tracking [20], will be named as spec in the following sections.Performance measures used in comparisons are Perceptual Evaluation of Speech Quality (PESQ), Log Likelihood Ratio (LLR) and Euclidean Distance in Cepstral Domain (CEP), most used measures in the literature.
Best PESQ values are marked with boldfaced characters.Each number in the table is the average of PESQ values for 180 recordings (30 sentences spoken by 6 individuals).
2) Log-Likelihood Ratio (LLR) In LLR, distorted and denoised data are compared statistically [22] and is defined as Here vectors of the distorted and denoised speech data respectively.
x R is the autocorrelation matrix of the distorted speech signal [23].Lower LLR values mean higher quality speech signal.In Table II, LLR values for CVA and other 5 methods are compared with boldfaces indicating the best/lowest LLR value for a test input.It is notable that CVA is superior to the compared methods since it generated the lowest LLR for all background noise tests.However, for white noise cases CVA failed to be the best even though the scores are close.CVA performed best among all methods in all PESQ tests except 5.That indicates about 83% success.In white noise cases, although not the best, CVA performed close to the best.

3) Cepstral Distance Measure (CEP)
CEP too is based on LPCs and is defined as the distance between LPCs of original and enhanced speech frames as where N is the dimension of LPCs.Lower CEP values indicate higher speech quality [24].As shown in Table III, proposed CVA method is superior against all other denoising methods for all background noise types and levels applied in tests, except for white noise cases.

Conclusions
Tests conducted on 30 sentences spoken by 6 individuals with added 4 levels of 8 structured background noise recordings (total of 5760 recordings per method per quality measure) let us safely conclude that the proposed CVA method is superior against other 5 methods.In additional tests using white noise, on the other hand, CVA has failed to be the best (total of 720 recordings per method per quality measure).However, in most of the tests that CVA was not the best, its scores were close to the best.
subspace B  , while the remaining eigenvectors span the difference subspace B where B and B  are orthogonal.The common vector of the class can be found by projecting any feature vector onto the indifference subspace i x , including the current frame to be denoised, is sufficient for both required number of vectors for the class and reasonable computational complexity.Fast Fourier Transform (FFT) is applied on these 21 n  frames and their magnitude and phase components are separated as {} {} magnitude frames that are most similar to the magnitude frame of the current frame (the one to be denoised) are picked and the class is constructed with a total of m frames.When distances to the current magnitude frame calculated using the Euclidean are i th dimension components of the current and k th magnitude frames.m frames with the smallest k d are selected into the class member set A .Since cur b would have zero distance to itself, it will be assigned index 1 and called 1 a as indicated in Fig. 1.
avg a should be added back after the denoising.Noise largely resides in difference components ( 1,dif a ).Therefore, common component ( com a ) is kept and difference component of the current frame is denoised using a denoising algorithm that involves Principal Component Analysis (PCA)[15].In fact, any denoising algorithm can be effectively used on 1,dif a since large portion of the signal energy is still in the common component, which is considered almost noise-free.

Figure 1 .
Figure 1.Flow of the proposed algorithm.

Figure 2 .
Figure 2. Initial test results; Input/Output SNRs for different frame sizes.
size tests, on the other hand, were inconclusive for widths greater than three frame sizes.It is seen that algorithm becomes data dependent for larger search areas.In the following tests, we used 3 W NN  as the neighbourhood size, where N is the frame size in samples.

2 a
=-0.0309 are the optimized for speech processed through