Lung disease classification using machine learning algorithms

ABSTRACT


Introduction
Millions of people around the world suffer from pulmonary disease. The most common of these diseases are chronic obstructive pulmonary disease (COPD), asthma, pneumonia, lung cancer and tuberculosis [1]. Physicians need a medical history and physical examination in the diagnosis phase of patients with respiratory system disorders [2]. But this information is not available for computer processing [2]. Classifying respiratory diseases by taking a look at the drawn curves of the respiratory impedance or derived parameters is a difficult work for untrained physicians because it depends on the experience and ability of the physician [3]. Inductive learning systems have been used in different medical fields such as oncology, liver pathology, prognosis of the survival in hepatitis, urology, diagnosis of thyroid diseases, rheumatology, diagnosing craniostenosis syndrome, dermatoglyphic diagnosis, cardiology, neuropsychology, gynecology, and perinatology. Automatically created diagnostic rules have increased the diagnostic correctness of specialist doctors [2].
To diagnose or classify anything, patterns have to be identified. However, if the data we have is too large, it is hard to find these patterns. In addition, traditional methods cannot be used to find patterns or create mathematical models because gathered data is generally not linear [4].
Several successful machine learning algorithms have been developed in recent years and now the error rate has become very small with deep learning algorithms [4]. Recently, machine learning, particularly in computer vision and speech recognition, almost approaches human perception level [4]. Even if expert systems are used in practice in clinical settings, machine learning systems are still being used more experimentally today [5].
In classification and regression analysis, one of the most commonly used supervised machine learning models is the Support vector machines (SVMs) [6], [7]. The SVM training algorithm creates a model that appoints new samples as a binary linear classifier that is not based on a likelihood [8]. In the SVM model, samples are shown as points in separate categories that are divided by a gap in the space [9]. New samples are paired into the same space and are categorized according to which side of the gap it is based on [9].
In order for SVMs to perform nonlinear classifications efficiently, data must be labeled to apply the supervised learning. When data are not labeled, an unsupervised learning approach is required that naturally clusters data to groups and then tries to match new data to these groups. SVM has a clustering algorithm that provides an improvement [10].
SVM is such a powerful algorithm it has been widely used in the biological and other sciences [11]- [13].
In previous studies, diagnosis classification was predicted by manually selected text data or audio data. As for classification methods, they used traditional machine learning algorithms such as multi-layer perceptron (MLP), multilayer neural network (MLNN), k-nearest neighbor (k-NN), probabilistic neural network (PNN). Their results can be seen in Table 3.
In this study our aim is to classify respiratory diseases, based on the collected data. This data will consist of patient demographic information, preliminary questions, symptoms, lung function test results, blood test results, Xray results, final diagnosis and audio recordings of lung sounds by chest physicians. Since our experiments include combined text and audio data, our results may also include connections between seemingly unrelated data and conditions which may help the field of medicine with new insights.

Materials and Methods
Since we need a device to record the breathing sound, we first examined all the electronic stethoscopes on the market and found that two types of electronic stethoscopes were used today. These are the Littman 2100 Electronic Stethoscope and Thinklabs One Electronic Stethoscope. In these devices, a microphone and a series of electronic circuits are used to convert the analog signals coming from the head of the stethoscope into digital signals. This digital signal is then transmitted to the computer via a 3.5 mm microphone jack, which is common in computers and mobile devices. However, the main difference between Littman 2100 and Thinklabs One electronic stethoscope was that while Littman 2100 required proprietary software, Thinklabs One transmits the audio signal to any device utilizing any software [14]. Since these devices didn't suit to our needs, we built a custom electronic stethoscope.
The first prototype was a large device with audio out for headphones and a microphone input for the stethoscope with microphone. However, this device captured too much environmental noise which suppressed the respiratory sounds. It was also too big to carry around in a hospital environment.
The second prototype was a smaller version of the first one which had two inputs: one for stethoscope microphone signal and one for recording. It also had audio output for headphones. The device recorded stereo audio, one channel for respiratory sounds and the other channel for environmental noise. The idea behind the device was to record both audio and extract the noise from the respiratory signal. However, we found that the noise in the respiratory signal was not equivalent to the noise signal coming from the second channel, hence when it is extracted, there was a huge data loss on the signal due to the low frequency nature of the respiratory audio signals. So we decided not to use the second one either.
We found that the environmental noise contains electronic noise from the components of the device, so the more complex the device gets the more electronic noise in the final signal. So we removed the signal enhancing hardware and the device with a small and directional microphone strapped inside the head of the stethoscope with a 3.5mm microphone jack.
However, there was still noise in the recorded audio because: • Hospital environments naturally have variety of noises such as people talking, phones, noisy medical devices, ambulance and police sirens etc. • A scratching noise occurs when the diaphragm of stethoscope comes in contact with skin and body hair during recording. The first problem there was not much we could do because it is impossible to provide perfect silence in hospital rooms. However, we solved the second problem simply by lubrication of the contact area.

Software for Data Acquisition
We developed an application that creates patient records and record, play and modify audio. It has 8 main sections: • Patient information: First name, last name, age, gender, smoking habits, sport habits ( Figure 1). • Preliminary questions: Shortness of breath, cough, color of mucus, coughing of blood, chest pains ( Figure 2). • Symptoms: High fever, weight loss, swelling in legs, night sweating, palpitation ( Figure 3). • Audio recording: Audio recordings from 11 areas of patient's chest ( Figure 4). • Lung function test results: Forced vital capacity (FVC), forced expiratory volume in 1st second (FEV1), and FEV1 / FVC ( Figure 5).

Data Acquisition
Three hospitals agreed to host our research in their respiratory diseases department: Ankara University, Yıldırım Beyazıt University and Yıldırım Beyazıt Education and Research Hospital. In this study we used Lenovo ThinkPad E550 Laptop for recording respiratory audio and patient data. We recorded patient information, preliminary questions, symptoms, audio recording, lung function test results, blood test results, X-ray results, final diagnosis and respiratory audio from 1630 subjects as can be seen in Table 1 and 11 positions from each patient, totaling to 17930 audio clips, each 10 seconds long.

Experiments
Apart from the manually selected features (age, gender, smoking habits, sport habits, shortness of breath, cough, color of mucus, coughing of blood, chest pains, high fever, weight loss, swelling in legs, night sweating, palpitation, FVC, FEV1, FEV1 / FVC, white blood cell count, creactive protein count, neutrophils count and x-ray results from 6 regions of lungs), since Mel Frequency Cepstral Coefficient (MFCC) features are widely used in audio detection systems, we also used MFCC features.
SVMs can help solve problems such as classifying text and hypertext and improve image classification. SVMs can provide higher search accuracy than traditional query improvement schemes after only three or four relevant feedback rounds [11].
One of the most frequently used prospective statistical classification algorithms is k-NN. It is a method utilized to classify objects based on the nearest training instances in the property area [4].
The GB is a probabilistic model. It is supposed that all data points are produced from a combination of a few Gaussian distributions with unknown parameters. It can be considered that the mixture models are a universalization of the k-means cluster, which contains data about the covariance structure of the data and hidden Gaussian centers [4].
Because of the advantages of these models, we used SVM, k-NN and GB algorithms to process the following datasets that were built with 1630 subjects: • Dataset to predict whether the subject is ill or healthy with data that is collected by physicians manually • Dataset to predict whether the subject is ill or healthy with MFCC features extracted from combined audio data from each subject's 11 locations on their chest • Dataset to predict whether the subject is ill or healthy with combining data that is collected by physicians manually with MFCC features extracted from combined audio data from each subject's 11 locations on their chest • Dataset for 12 class diagnosis classification with data that is collected by physicians manually • Dataset for 12 class diagnosis classification with MFCC features extracted from combined audio data from each subject's 11 locations on their chest • Dataset for 12 class diagnosis classification with combining data that is collected by physicians manually with MFCC features extracted from combined audio data from each subject's 11 locations on their chest

Results and Discussion
Our results are in Table 2. Several studies have been reported that demonstrate the benefit of computerized lung disease analysis [11], [16], [17]. However, there are by the small number of available studies for the diagnosis of lung diseases as shown in Table 3.
As shown in Table 3 the studies in the literature had limited or pre-recorded datasets [3], [18]- [30]. Because of the low number of samples and very low or distinct features, their results were not consistent. They were either very high or very low. Also pre-recorded datasets provide clean samples which may not be the case in real life, hence producing incorrect results. To overcome this issue, we collected 17930 audio recordings from 1630 healthy and sick subjects.
In the previous studies, they did diagnosis classification with 2 classes, and one study with 3 classes [25], [29]. Also, most studies, classified subjects as healthy and ill while some of them classified a subject group with two different illnesses. The problem with using low number of classes is that it does not really measure the performance and effectiveness of a given machine learning algorithm. In our study we classified 1630 patients into 12 disease classes as can be seen from Table 1. In the literature, diagnosis classification was made either by manually selected text data [3], [20], [23]- [27], [31], [32] or audio data [19], [31] as can be seen in Table  3. In our study we ran our experiments using text, audio and text and audio combined. This provided an insight into which features are more important and if results could be improved with text and audio data combined.
In previous studies, they used traditional machine learning algorithms such as MLP, MLNN, k-NN, PNN. In our study, we used MFCC features of audio data in SVM algorithm for classification.
Our study has three advantages over the state-of-art studies: • Our data set (1630 subjects and 17930 audio clips) is much bigger compared to the studies done on this field. • Respiratory audio clips in the data set are not amplified, modified, cleaned or pre-recorded by a third party which is not the case with many of the studies we looked into. • We tested our algorithms on 6 datasets and obtained consistent results across the board which was not done in any of the state-of-art study so far.

Conclusions
In this study, our first goal was to build an electronic stethoscope along with a software system that can record respiratory sounds and patient information to a computer. Audio and text datasets created by this system were used in SVM, k-NN and GB machine learning algorithms for purposes of automated analysis and diagnosis.
As a result, we have determined that, in these experiments, for the number of patients we had (1630 subjects), it was observed that the best results were found in healthy versus sick classification. The reason for that is our dataset does not have equal number of samples for each disease. Some classes are represented by just a few samples. Therefore, the classification accuracy drops as we have more classes. Also, the total number of samples affects the classification results. We have enough samples to classify 2 classes but for more accurate classification of more classes we need more samples.
In 12 class classification of lung diseases, the most accurate algorithm was SVM with text data. In classifying via audio data, k-NN was the most accurate. Using both audio and text data, SVM was the most accurate.
However when we classify healthy versus sick via text, audio and combined data, GB was always the most accurate with very high accuracy, closely followed by k-NN.
We can infer from here that when we have large number of features but limited amount of samples, SVM and k-NN are best in classifying the dataset in more than two classes. However GB is best when it comes to classifying into two classes.
Also, we can see from the results that when it comes to disease diagnosis, text and combined data produces better results than just audio data. This is also primarily true for deciding if the patient is healthy or sick. However, in deciding if the patient is healthy or sick, pure audio data can also be used as we found it to be highly accurate as well.
In addition, this system will enable to record and store patient information, especially audio data, to be shared with other physicians and to compare the new data recorded later to follow the prognosis of the patient. We believe that our method of diagnosis classification using patient data and respiratory sounds can lead the way for even more advanced computerized analysis techniques in the future.

Author's Note
Abstract version of this paper was presented at 9th International Conference on Advanced Technologies (ICAT'20), 10-12 August 2020, Istanbul, Turkey with the title of "Lung Disease Classification using Machine Learning Algorithms".