Gender Determination Using Voice Data

The rapid advancement of today's technologies, it is tried to facilitate whichever system will be used by using voice features such as person recognition and speech recognition by making use of the voices of the users. Organizations serving in these systems need less manpower and facilitate the operation by helping users faster. The decision-making process using sound features is a very challenging process. With gender recognition, which is one of these steps, it is possible to address the user by gender. In this study, it is aimed to define the genders according to the voices in terms of both forensic informatics and the rapid and accurate progress of the processes. In this study, 3168 male and female voice samples were taken as a dataset. Sound samples were first analyzed by acoustic analysis in R using seewave and tuneR packages. Artificial neural networks were used in the classification stage. In order to increase the classification accuracy, the dataset was divided into 10 parts and each part was excluded from training for testing and used for retesting. Average classification success was found by taking the arithmetic mean of the results. In the classification made with artificial neural networks, male and female voices could be distinguished from each other with a success of 97.9%.


Introduction
Speech recognition, which has an important role in human machine interaction, has been used frequently recently. Factors such as environmental conditions, accent and diction can affect the success of voice recognition systems. However, the sound signal samples in the dataset to be created cannot be taken from people with the same environmental conditions, accents and diction. This situation can aggravate the burden of the voice recognition system. In order to overcome this problem, large-volume datasets should be used and the number of attributes of sound samples should be selected at the optimum level [1]. Accent recognition and gender recognition from voices is easy for humans, but not easy to identify gender by computer. There are many studies in literature to find a solution to this issue. Studies have been conducted on effective feature extraction and high accuracy classification architectures to determine the speaker gender from voice signals [2]. Gender recognition by voice gives people the opportunity to help people more by being used in health information systems and education [3]. In a voice recognition study on telephone applications, the vibration, noise ratio, sparkle and frequency properties of the sound were used, and voice recognition was performed with different techniques such as bayesian networks [4]. Sound data, including information about the age of the speaker, were tried to be estimated using artificial neural networks, but could not be successful because the sound samples showed similar characteristics [5]. With the changes made on the Random Forest algorithm, a classification success rate of 96.7% in gender recognition from voice was achieved [6].
Gender recognition studies have been made not only in the field of sound but also from the movements on the screen of touch screen phones and a success of 93.65% has been achieved [7]. In a study on gender and age estimation with fully-connected and convolutional neural networks using voice data collected from German speakers, age recognition rate was found to be 57.53%, and gender recognition rate was 88.8% [8]. Estimating the emotional state of the speakers is a very challenging task as it is influenced by many factors such as thought, mood, behavior and personality. Gender determination in emotion recognition is a factor that increases prediction success. There are studies focusing on gender recognition with gradient enhancing machines and a different version of the Random Forest algorithm [9].
In a study where classification and regression algorithms were combined and used in gender recognition from voice, an ensemble method was created using the Support Vector Machine (SVM), Neural Network and Random Forest methods. This structure was more successful than the methods used singularly in the study [10]. Different methods used in feature extraction from sound data and the selection of effective features among the extracted features are among the factors affecting the success of classification. PCA (Principal component analysis) is a method that enables the reduction of the number of features and obtaining new effective features by making use of the similarities between features. In a study using PCA and SVM algorithm, gender was estimated from voice data and 98.42% success was achieved [11]. Deep neural networks have recently attracted considerable attention as a method that increases the success of classification by detecting hidden features in data. A success of 96.74% was achieved in a study of gender recognition using the deep neural network method [12].
In this study, a dataset containing 20 features and 1 label obtained from the audio data was used. Classification process has been carried out with the Neural network. The material and method used in the second part of the article, the experimental results obtained in the third part, the discussion and the results in the fourth chapter are given.

Dataset
Acoustic analysis of the dataset used in the study [13] has been done before and 20 sound features and 1 classification label have been added to the dataset [14]. The dataset consists of 3168 rows and 21 columns. There are 1584 male and 1584 female sound samples. The sound characteristics and descriptions obtained as a result of preprocessing are shown in Table 1. The effect of these features in the dataset on the success of the classification may differ. Classification success may increase when some of these features are removed [15].

Confusion matrix
The evaluation of a classifier model is not only based on success rate. There are different parameters required for this process. The table required to calculate these parameters is called a confusion matrix. It shows in which category each data in the confusion matrix dataset is classified. Various parameters are obtained by calculating the values in this table and information about the performance of the classifier can be obtained. Table 2 shows the description of the confisuon matrix and the values it contains. By using these values, accuracy, precision, recall and F-1 score values can be obtained. The purpose and formulation of these values are shown in Table 3.

Artificial Neural Network
They are systems that can provide learning in a way that can make inferences from input and output data by imitating the cells and functioning principle in the human brain. Different approaches can be preferred in the artificial neural networks (ANN) classification model due to the data required to make predictions in voice recognition systems. In ANN, the training process is carried out first. During this training process, neurons communicate with each other and all neurons have weights that enable the network to learn. It takes time to create these weights during training. However, in the second stage, the test stage, the information about which output an input will give is faster [16]. The training of the model was carried out by determining the activation function ReLu (Rectified Linear Unit), the optimization function Adam, the learning rate 0.0001, and the number of iterations as 200. The artificial neural network model, which consists of 3 layers, 20 inputs, 100 hidden and 2 output neurons, is shown in Fig. 1.

Experimental results
In the study, artificial neural networks were used as a classifier in gender recognition processes by using voice signals properties. Activation function ReLu (Rectified linear unit) is used in artificial neural networks. The iteration number was set at 200. 1584 of 3168 voice data belong to male and 1584 female speakers. The cross validation technique was used for the reason that the classification result is considered to be a reliable value. In this technique, the dataset is divided into k parts. In each training process, the k-1 part of the dataset is reserved for training. The remaining piece is used for testing. This process continues until k different parts are used for testing. In other words, training and testing is done as much. In this study, k value was determined as 10. The average success rate obtained from the classification made by using sound features was found to be 97.9%. The confusion matrix obtained as a result of the classification is shown in Table 4. True positive value was 1554, true negative value was 1547, false negative value was 30, and false positive value was 37. The values obtained for classification performance using these values are shown in Table 5. The fact that the values in Table 5 are very close to each other is due to the fact that the numbers of data classified correctly and incorrectly are very close to each other. In addition, the high rates show that the classifier is successful in training and testing.
There are studies in the literature made with the same dataset. The comparison of these studies is given in Table  6.

Conclusions
In this study, artificial neural networks, a traditional method, are used for classifying voice data. It was examined that emotion recognition, gender recognition and age prediction can be made by using voice data. The artificial neural network method used was able to predict gender with 97.9% success. Recall, precision and F-1 score values were found affecting the classifier performance. These values were 98%, 97.7% and 97.9%, respectively. It is thought that these values can be increased with different classifiers and hybrid classifiers. In addition, it is thought that higher classification success can be achieved by selecting effective features from among 20 features and removing those that do not contribute to classification or have negative effects. For this reason, studies on different classification methods and feature selection are planned in our future studies.