Compressing English Speech Data with Hybrid Methods without Data Loss

Understanding the mechanism of speech formation is of great importance in the successful coding of the speech signal. It is also used for various applications, from authenticating audio files to connecting speech recording to data acquisition device (e.g. microphone). Speech coding is of vital importance in the acquisition, analysis and evaluation of sound, and in the investigation of criminal events in forensics. For the collection, processing, analysis, extraction and evaluation of speech or sounds recorded as audio files, which play an important role in crime detection, it is necessary to compress the audio without data loss. Since there are many voice changing software available today, the number of recorded speech files and their correct interpretation play an important role in detecting originality. Using various techniques such as signal processing, noise extraction, filtering on an incomprehensible speech recording, improving the speech, making them comprehensible, determining whether there is any manipulation on the speech recording, understanding whether it is original, whether various methods of addition and subtraction are used, coding of sounds, the code must be decoded and the decoded sounds must be transcribed. In this study, first of all, what sound coding is, its purposes, areas of use, classification of sound coding according to some features and techniques are given. Moreover, in our study speech coding was done on the English audio data. This dataset is the real dataset and consists of approximately 100000 voice recordings. Speech coding was done using waveform, vocoders and hybrid methods and the success of all the methods used on the system we created was measured. Hybrid models gave more successful results than others. The results obtained will set an example for our future work.


Introduction
Voice encoders provide voice transmission in telecommunications and many applications. However, voice coders reduce the bit rate and cause some important voice information to be lost. In this case, they determined the effects of voice recognition systems on 2 voice recognition systems, reserved word identification. [2]In addition, each voice The performance of the encoder in speech recognition was measured with SNR (Signal Noise Ratio) . In their study, it was seen that ADPCM (Adaptive Differential Pulse Code Modulation) voice coders performed lower than encoders such as GSM and CELP (Code Excited Linear Predictive Coding).
Cox and Richard have used low bit rate audio codecs for multimedia visual communication [3]. The aim is to provide a quality audio network. It is actually a study designed for wireless applications. However, they have been used for multimedia communications. Recommended by the International Telecommunications Union (ITU), G. 723.1 Three different audio encoders called G.729 and G.729 have been used in different applications by applying low bit requirements (bit rate, delay, complexity and performance) [4]. It is desired to provide transcoding between different audio coding formats by combining various networks. For this reason, two encoders were used by Cox and Kim to improve the ability of vocoders in transcoding. Thus, the bit stream was provided [5]. Studies were carried out to improve the performance of low bit rate vocoders with Lamane et.al 1.2kb/s decoder. Multiband (MMBE) in their study. linear predictive coding (LPC) coding algorithm [6]. The voice encoders of Lefebure et.al improved the encoder in the packets with their design in the internal packet networks and increased the quality on the channels [7].
Nikolic and Peric developed an adaptive technique on speech coding algorithms by taking certain speech samples and applied Lloyd-Max's algorithm to these sampled voice signals [13]. Good performance was achieved for voice coding with the study. has developed a new method using the CELP coding system to provide good and quality sound that provides data between. Using Adaptive codebooks, it is compared with random codebooks and the speech is characterized. Quantization and search are developed for real-time operations [14].

Speech Coding
Audio has to be converted from analogue to digital in order to be processed. There are audio encoders to accomplish this. Most of these encoding methods are developed for real-time transmission of audio over digital lines to avoid the delay between the audio input to the encoder and the audio seen at the decoder. Two types of sounds are produced. One is voiced (voiced) and the other is unvoiced. The model of the sound production is shown in Figure 1. For voiced syllables, sequential pulse sequences, and for silent syllables, the noise signal is applied to the input of the filters.  To generate an audio signal, it is necessary to calculate the bandwidth and bit rate of the signal. The complexity shows the computational complexity of the encoder, the amount of memory required for the compute load operation, the power consumption, the price, and the implementation environment. Quality, on the other hand, indicates the intelligibility of the voice, the noise performance, the quality, the recognizability of the speaker and the communication efficiency.
Audio coding enables the acquisition of audio signals and the efficient transmission of audio signals from storage devices over wireless cables in an efficient limited band [1]. Nowadays, audio coders are brought to an important point in telecommunications and multimedia. Commercial systems cellular communication, internet protocol (VOIP), electronic toys, video conferencing, computer games, multimedia applications, digital simultaneous voice and data, archiving, etc. depend on audio encoders. In addition, audio coding effectively and efficiently reduces the speech signal in digital environment and encodes the signal. Speech sampling and A lot of important information can be lost in audio and quantization.
An audio encoder usually converts sounds transmitted through the frame into digitized audio signals. An audio decoder receives reconstructed sounds and encoded frames. In both the encoder and decoder, the input and output of audio signals are usually determined using a similar application. Speech encoders perceptually record synthesized audio signals It differs in delay, bit rate, and complexity, such as the operations performed in 1 second. In the limited band (narrowband), the coding is less than 4kHz. In the wideband (wideband), it is more widely used due to the telephone channels (300-3600Hz) in the 7kHz bandwidth. it is more commonly used in many applications such as broadband video conferencing.
The vocoder algorithms are shown in Table 1. Waveform encoders show encoding the exact shape of the waveform of the speech signal. These encoders have a high bit rate. Linear prediction coders(LPC) is the output of the audio signal in linear constant time. LPC vocoders provide extra information. LPC-based analysis-by-synthesis coders (LPC-AS) are between 4.8 and 16kbps. Subband coders speak to different frequency bands in the frequency domain according to different spectral characteristics. signals are not modeled and can be scaled. Moreover, these encoders are also widely used in high-fidelity audio coding. The most important purpose of speech recognition is to present the sound represented by bits in digital form, which provides quality and intelligibility for some applications. Speech coding has an important place in applications on the Internet protocol, digital cellular communication, and processing of telephone networks (PSTN). Digital communication by minimizing the bit rate in speech recognition However, while reducing the bit rate to ensure sound intelligibility and quality, performance errors such as complexity, delay, bit errors and packet losses are caused in many applications.
Audio coding techniques are classified in Table 1 according to their various features.

Waveform Coding
In waveform coders techniques, speech is reproduced in the time and frequency domain. In analysis-by-synhesis methods, linear prediction models are used and perceptual distortions are calculated to reproduce the characteristic features of the entered speech form. It is divided into frequency bands called subbands, and these separated bands are recombined or reconstructed by a waveform coder or analysis-by-synthesis coding method. Enlarging the resolution of the frequency domain causes the converted code to be separated. This conversion is performed within the framework of the input sound, and the conversion coefficient is quantized. There are speech coding techniques such as speech scalable coding, bandwidth scalable coding and diversity-based source coding structures, which are reconstructed by reverse transformation.
Waveform coding methods are methods such as logarithmic pulse code modulation (log-PCM) and adaptive differential pulse code modulation (ADPCM) in the time domain. Both methods are widely used in applications. Log PCM is used for coding telephone networks with long distances and its rate is 64kbps. It is simple encoding and achieves the quality that other narrowband speech encoders provide. ADPCM handles 32kbps or less. It provides performance comparable to log-PCM as ADPCM uses a linear determinant to remove short-term excess signals in speech signals before quantization. The most common form of ADPCM The determinant known as backward adaptation is used and the quantizations follow the waveform more closely. Backward adaption means that in the encoder and decoder, the reducer and quantizer are adapted to the re-adapted signal. No quantizer and predictor values are sent with the quantized waveform values. Each subtracting the estimated value from the input value, the quantized signal is reduced. Therefore, the signal is better produced with fewer bits. By adding a longer determinant, the ADPCM structure is formed. (also known as the Adaptive Predictive Coder (APC)) An important task of ADPCM is analysis-by-To be a messenger for synthesis (analysissynthesis methods) schemes. The structure of noise spectral filters is made on perceptual weighting filters in analysis-by-synthesis methods.
Human tries to encode the exact shape of the waveform of audio signals without considering the details of voice production and sound perception. These encoders are widely used as they are successfully coded for encoding both speech and non-speech signals. In public switched telephone networks (PSTN), for example; fax in modem transmission The most widely used waveform coding algorithms are 16-bit PCM, 8-bit PCM and ADPCM.

A.
Pulse Code Modulation (PCM) PCM is a waveform audio coding model used to transmit analog signal. It is showed Figure 2. It is performed in three steps: • Sampling • Quantizing • Encoding

Figure.2. PCM Structure
In the first step, there are two states, indicated by 1 and 0. All forms of analog data such as video, audio, music are digitized. PCM is easily applied to complex communication systems such as telephone networks. To derive PCM from analog waveform, the analog signal is sampled at the amplitude level. level. In, there are some limitations due to reasons such as the inability to measure between samples. Quantization error means the difference between the original signal and the converted signal. Samples are converted into bits in binary numbers at regular intervals[18].

B.
Differential PCM(DCM) Speech samples with high correlation are successful. Output signals obtained versus input signals given according to different µ-functions. The correlation coefficient is 0.9. the difference is d(n). d(n) is the error rate obtained during quantization. There are two different PCMs. The first is CVSD (continuosly varying slope delta modulation). It weakens performance in quiet environments. However, in noisy environments it provides higher performance than LPC based encoders. Another DPCM is G.726 and processes 16,24,32 or 40kbps. G.726 is often used in fixed phones at 32kbps. G.726 It includes adaptive second-order IIR predictor and adaptive sixthorder FIR predictor. Filter coefficients are calculated by gradient descent algorithm.

C.
Adaptive PCM(ADPCM) Adaptive PCM(ADPCM) is the basis for compressing the audio signal. It is a variant of DPCM. Its main difference from DPCM is that the quantization steps are different. ADPCM uses quantization steps by varying the size to provide better performance. Therefore, it is more complex. ITU's speech coding standards G.721, G.723, G.726 and G.727. The difference between the standards is the bitrate.
It operates at 32kbps and lower speeds. ADPCM provides better performance than log-PCM as it uses a linear determinant to remove short-term signals in speech signals before quantization. The quantized signal is reduced by subtracting the estimated value from each input value. Fewer bits are needed to generate the signal. The most important advantage of ADPCM is that it is used together with analysis-synthesis methods. It converts analog signals by taking samples in the frequency domain and converts the sampled value into PCM. It shows in binary form as in [19].

D.
Delta Modulation(DM) It is a subclass of DPCM coding technique. It is a simplified version of DPCM quantized with 1 bit used with exciter in telephony applications. If its output is 0, the wavelength shape decreases, if it is 1, it increases. It shows the direction in which each signal changes. DM encodes the direction of the difference of the signal amplitude. It is compared with and quantized with a delta signal. It produces quantization output. Positive quantization produces positive impulses. If the difference is negative, negative signals produce negative impulses [21].

E.
Subband Coding These methods separate the input speech signals into subbands by bandpass filters. The number of coded samples is minimal and the sampling rate is reduced for the signals in each band. If noise is ignored during compression, the QMF-Quadrature mirror filter allows alliasing during filtering and creates subsampling that is canceled in the decoder. In each band, a decoder formed by PCM, ADPCM and even analysis-by-synthesis methods is used. The advantage of the subband coding method is that each band is different. It is coding in the same way and it is checking the encoder error that occurs in each band about the perceptual characteristics of the human.
Transform methods were initially applied to images. It is then applied to audio. Its basic principle is to process speech samples with a discrete single transform and encode the resulting transform coefficients for quantization and retransmission to the receiver. Low bitrate and good performance are achieved. Because it breaks down into more significant coefficients, and welldesigned transforms don't need to code the coefficients and For this reason, it is discarded. Although classical transform coding does not have a great effect on narrowband speech coding, subband coding, filtering and transformation methods have a significant effect on high quality coding. In subband coding, an analysis filter is first used and divided into frequency bands, then bits are allocated to each band according to certain characteristics. Since it is very difficult to obtain high quality sounds in low bits, these techniques are widely used for audio coding in the wide bands of high bit rate encoders. The .722 is a standard ADPCM encoder and consists of two subbands. The bit division is 7kHz at 64kbps or less. For audio encoding, 30 subband encoding is recommended. A model of sound reproduction is not used and provides immunity to noise, non-audio signals. It provides high quality compression. Especially Tang provides a powerful framework for highfidelity scaling and embedded audio encoding [17]. Figure  3 shows the basic structure of the encoder.
Dynamic bit allocation and quantization are used to optimize the perceptual nature of the bit sequence. A subband spectral analysis technique has been developed to reduce the complexity in calculating the model.
The encoded embedded bit string is used in a wide variety of services and resources, from high bit rate high quality to low bit rate low quality. Although subband coders are not widely used nowadays, they offer new standards for broadband encoding. Also, a hybrid technique has been created in this technique. Because subband encoder It scales more easily than the standard CELP technique. Wireless communication channel and voice transmission over the Internet are important for network congestion.

Vocoders
These algorithms of audio coding are produced from a single model. At the time of encoding, the parameters of the model are estimated from the input signal. These parameters are transmitted as an encoded bit stream. These encoders do not preserve the shape of the original signal as in waveform encoders. They provide poor performance in non-speech signals. It offers time-dependent filter coefficients in the speech production mechanism. It provides good performance at low bit levels. It does not show good quality if the bit rate is increased. It does not show good quality if the bit rate is increased from 2 to 5 kbps.
A. Linear Predictive Coders (LPC) Complex excitation models are mostly used to model sound production organs. After the sound production organs spectral transfer function H(z) is expressed with the help of the spectral transfer function H(z), the most important problem is to decide on the simplest excitation signal that will provide high quality sound reproduction ( Figure 4). It constitutes a standard model. It is the most powerful audio analysis technique. It provides good sound quality at low bits. It provides accurate speech parameters and is therefore effective for computation. It determines the value of the future sample by linearly combining the first sample. Minimizes the bit rate as much as possible. It analyzes the audio signal by predicting it. It tries to determine the frequency of the sound with reverse filtering by removing the formats [22].
Formant and Channel Vocoders It separates the signals into its components in the frequency domain. The channel vocoder is processed in the low bit and uses a filterbank. It determines the different components of the signal by finding the general period of speech and the speech excitation. A transfer model is applied to produce the vector of the excitation parameters. Thus, it is determined whether the voice is voiced or unvoiced. The best example is LPC audio coding performed in the frequency domain.

Hybrid Coders
Hybrid encoders combine waveform and vocoders. The parameters of the model are determined at the time of encoding. The additional parameters of the model are optimized according to the original waveform. The weighted error of the signal is measured. Hybrid encoders perform at medium encoding speed. They behave like waveforms at high bits and like vocoders at low bits. In hybrid encoders, all detailed information of the excitation signal is discarded at low bit rate. Excitation for sound generation model signal is displayed and this signal is quantized.

A.
Code Excited LPC(CELP) The codebook contains the excitation vector. It makes predictions in long and short-term audio signals. It provides a high degree of quality and intelligibility of the voice. The CELP is one of the most important factors in generating the excitation signal. For each segment, the encoder finds the best match between the speech segment and the excitation vector producing the synthesized signal. As a result, if all the codebook data of the excitation vector matches the actual speech signal, it is sent to this receiver. The quality has given very successful results in the videoless conference. In long-term prediction, it reduces the speech signal by finding the fundamental pitch period. It finds consecutive excitation signals in the codebook. In short-term predictions; predicts the next example from previous examples. It is found by yoke. It is found in LPC filters. The search is performed in a closed loop in weighting. Vector quantization is applied [23].

B.
Multipulse LPC(MPLPC) It consists of a synthesized filter and an excitation generator. The synthesized filter models the short-term spectral representation of speech. The filter parameters are found by estimating the original signal. The excitation generator is applied to the synthesized filter. It optimizes the weighting error between the original signal and the synthesized signal. In the MPLPC algorithm, the shape vectors are impulses. U is usually represented as a weighted sum of 4-8 impulses in the subframe. Optimization of all impulses is usually impossible. The MPLPC encoders impulses are optimized as follows. Calculate the H(z) corresponding to each impulse. Instead of this H(z), the location of the first impulse for the corresponding optimal target vector is chosen. Additional imouls are selected. And thus all impulses are reoptimized.

C.
Mixed-Excitation Linear Prediction(MELP) The MELP encoder was created by introducing some additional features to LPC encoders. The MELP model is shown in Figure 5. LP coefficients are converted to LSF and MSVQ(multistage vector quantizer) is used to quantify LSF vectors. LSF parameters, Foirier magnituleri, to segment audio with a total of 54 bits, Gain, pitch, bandpass (bandpass) audio tuning, synchronization bit, non-periodic flag bit are used. The Fourier magnitude is encoded with 8bit VQ and the related codebook is searched with weighted Euclidean distance. For silent segments, Fourier magnitudes, bandpass audio and nonperiodic flag bit are used. is not sent. Instead, 13bit FEC (forward error correction) is sent. The performance of MELP is 2.4kbps [25].
The purpose of audio coding is to present the sound in digital form with bits that provide quality and intelligibility for some applications. The purpose of audio coding is to present the sound in digital form with bits that provide quality and intelligibility for some applications. In audio coding, the aim is to obtain the highest quality sound by minimizing the bit rate. Errors, delays, complexity, packet applications that occur in Internet applications cause performance errors. PCM, Subband, CELP, LPC, DPCM have important audio coding algorithms.

Experimental Study
The flow chart of the proposed model is given in Figure 5. In this study, it is aimed to design a system for sound coding by looking at the English sound form and features. Approximately 100000 sound samples were taken from 50 men and 50 women as English words and sentences of different lengths. These sound samples contain more than one word as well as a single word. It is a real dataset and unique and unparalleled in the literature. Table 2 shows the success of waveform speech coding methods. When the results were compared, it was observed that the DPCM technique gave more successful results than other waveform techniques. Table 3 shows the success of vocoders speech coding methods. When the results were compared, it was observed that the format vocoders technique gave more successful results than other waveform techniques. Table 4 shows the success of hybrid speech coding methods. When the results were compared, it was observed that the SELP technique gave more successful results than other waveform techniques.
Hybrid models are more successful than other audio coding models. In addition, all techniques in hybrid models gave better results than other techniques.

Conclusion
The purpose of speech coding is to present the sound in digital form with bits that provide quality and intelligibility for some applications. The purpose of audio coding is to present the sound in digital form with bits that provide quality and intelligibility for some applications. In audio coding, the aim is to obtain the highest quality sound by minimizing the bit rate. Errors, delays, complexity, packet applications that occur in Internet applications cause performance errors. PCM, Subband, CELP, LPC, DPCM have important audio coding algorithms.
The aim of this study is to improve the quality of the audio data by making improvements on the obtained audio data. Thus, it will set an example for studies that cover the basic issues of forensic sound informatics such as the improvement, examination, identification from the voice, speaker recognition, determination of the spoken text. For this reason, in order to set an example for the subject of forensic sound informatics, a sample application has been made that defines the speaker.
In the future, we aim to further increase the performance of our work with genetic algorithm, heuristic algorithm.