Sequential Feature Maps with LSTM Recurrent Neural Networks for Robust Tumor Classification

— In the field of biomedicine, applications for the identification of biomarkers require a robust gene selection mechanism. To identify the characteristic marker of an observed event, the selection of attributes becomes important. The robustness of gene selection methods affects the detection of biologically meaningful genes in tumor diagnosis. For mapping, a sequential feature long short-term memory (LSTM) network was used with artificial immune recognition systems (AIRS) to remember gene sequences and effectively recall learned sequential patterns. An attempt was made to improve AIRS with LSTM, which is a type of RNNs, to produce discriminative gene subsets for finding biologically meaningful genes in tumor diagnosis. The algorithms were evaluated using six common cancer microarray datasets. By converging to the intrinsic information of the microarray datasets, specific groups such as functions of the co-regulated groups were observed. The results showed that the LSTM-based AIRS model could successfully identify biologically significant genes from the microarray datasets. Furthermore, the predictive genes for biological sequences are important in gene expression microarrays. This study confirmed that different genes could be found in the same pathways. It was also found that the gene subsets selected by the algorithms were involved in important biological pathways. In this manuscript, we tried an LSTM network on our learning problem. We suspected that recurrent neural networks would be a good architecture for making predictions. The results showed that the optimal gene subsets were based on the suggested framework, so they should have rich biomedical interpretability. classification. Feature selection models were implemented based on gated recurrent unit (GRU), long short-term memory (LSTM), RNN and bidirectional LSTM for microarray datasets. In the study carried out by [2], a deep neural network model was improved by feature selection algorithms in predicting various biomedical phenotypes. Five binary classification methylome datasets were selected to compute the prediction performances of CNN/DBN/RNN models by utilizing the feature selected by the eleven feature selection algorithms. The results showed that the Deep Belief Network (DBN) model

It is used to identify disease-related genes by comparing gene expression in a diseased and normal cell. Obtaining pointer genes from high-throughput experiments instead of creating models provides advantages for biomarker discoveries. It can be used in gene expression profiling, the diagnosis of diseases, and pharmacogenetics areas. Some of the genes in gene expression data provide us with important information about disease diagnosis. Feature selection methods generally influence the performance of biomarker discoveries. Pointer discovery needs a robust feature selection method for microarray datasets. The groups formed by the associated features are generally mentioned as intrinsic specific groups, such as functions, and present in highdimensional datasets. The current research aimed to develop and assess a method of converging to co-regulated feature groups in microarray datasets, thus, addressing the problem of robust feature groups with high-accuracy classifications. The recent advancements in deep learning techniques in machine learning introduce a strong alternative to high-throughput experiments. The methodology in this research is the immunebased feature selection that is utilized for the discovery of optimal feature sets, enhancing robust tumor classification. The recent research efforts that have utilized feature selection methods include deep learning, feature selection, and classification of cancer microarray datasets. The in-depth investigation of feature selection to enhance the diagnosis of diseases will provide a significant contribution to the literature in the biomarker discovery domain. This study presented and tested a novel method of leveraging feature selection that resulted in the improved classification of tumor diagnosis.
In the study conducted by [1], a new framework of feature selection based on recurrent neural network (RNN) was suggested to select a subset of features. The suggested model was applied to select features from microarray data for cell classification. Feature selection models were implemented based on gated recurrent unit (GRU), long short-term memory (LSTM), RNN and bidirectional LSTM for microarray datasets.
In the study carried out by [2], a deep neural network model was improved by feature selection algorithms in predicting various biomedical phenotypes. Five binary classification methylome datasets were selected to compute the prediction performances of CNN/DBN/RNN models by utilizing the feature selected by the eleven feature selection algorithms. The results showed that the Deep Belief Network (DBN) model utilizing the features selected by SVM-RFE usually had the best prediction accuracy on the five methylome datasets. In the research conducted by [3], a novel approach was established based on clustering-centered feature selection for the classification of gene expression datasets. According to the experiments, the suggested feature clustering support vector machine (FCSVM) was capable of achieving efficient performance on gene expression datasets. In the research performed by [4], the most recent studies using deep learning to establish models for cancer prognosis prediction were reviewed. This study revealed that the application of deep learning in cancer prognosis was equivalent to or better in comparison with the current approaches. In the study conducted by [5], a convolutional neural network (CNN) deep learning algorithm was investigated for the classification of microarray datasets. The promising results proved that CNN had superiorities in terms of accuracy and minimizing gene in classifying cancer. In the research conducted by [6], the performance of deep neural networks for the classification of gene expression microarrays was analyzed. The experimental results suggested that deep learning needs high-throughput datasets to achieve the best performance. In the study carried out by [7], deep learning-based algorithms were developed to make a tumor diagnosis, reveal biomarkers and genetic changes, pathological features. An overview showed that deep learning-based approaches for pathology gave promising results for big data.
The recent developments of robust biomarker techniques with deep learning introduce a strong alternative to tumor diagnosis. The principal contribution of the current manuscript is presented below: 1. The current study presents an extensive framework in which the learning and combination of features are carried out in a novel way, by establishing a biomarker discovery model. 2. The Artificial Immune Recognition (AIRS) algorithm was trained by deep learning-based approaches to sequentially learn biologically meaningful genes in order to predict a tumor diagnosis. 3. We leveraged the LSTM deep learning technique to capture the long context correlations in a cancer microarray dataset. 4. We optimized deep-learned features by utilizing LSTM recurrent neural networks (RNNs) to detect coregulated specific groups, such as functions, in highdimensional data. 5. We examined the possibility of utilizing Deep Neural Network (DNN) recurrent neural network models to learn disease-related genes, and then we used them for the prediction of important biological pathways.
In this study, LSTM-based AIRS version 1, LSTM-based AIRS parallel version 1, LSTM-based AIRS version 2, and LSTMbased AIRS parallel version 2 algorithms were developed to discover optimal biological gene sequences. The suggested algorithms were compared with the traditional genetic algorithm and genetic algorithm-based artificial immune systems. All the experiments in this study used the microarray dataset.
The present research is structured in the following way. Section II summarizes the feature subset group. Sections III briefly explains the systems. Section IV describes the methodology and framework. Sections V and VI focus on the results and performance analysis, and the conclusions.

II. FEATURE SUBSET GROUP
Robust tumor classification was constructed with an ensemble gene selection framework. The framework uses feature subset groups comprised of the associated attributes. Feature groups are created using group formation algorithms that run separately on sub-samples of the training dataset. The bootstrap method was used to ensure the stability of training samples in the presence of variations. The associated feature groups were created with filter-based feature selection methods.
Density-based feature groups were created by kernel density estimation, which was calculated using equation (1). The kernel function is determined by the Cj+1 formula to identify the consecutive locations of the kernel function. Kernel density estimates were made to locate dense feature groups, and then the most relevant groups were selected.
In Eq. (1), variables h, k, fi, and K represent the kernel bandwidth, the nearest neighbor number, the number of attributes in the dataset, any attribute that is represented by parameter fi, and the kernel function, respectively.
The usefulness of attribute subsets in the CFG was identified by Eq. (2). The intuitive usability of a subset of S was based on the heuristic evaluation function.
Variables k, rcf, and rff represent the number of attributes, the mean attribute-class correlation, and the correlation between the mean attributes, respectively.
In Eq. (3), the information gain function identifies the significance of a given attribute in the full feature set. The entropy criteria were used to determine feature knowledge.
Parameters ft and M represent any attribute and the data numbers, respectively.

A. Long Short-Term Memory (LSTM)
LSTM is a variation of RNN architecture and one of the most effective solutions to sequence prediction problems because of the recognition of patterns in data sequences. Since LSTM possesses a certain type of memory, it can selectively remember patterns for a long time. It is quite a reasonable approach to predict the period with the unknown duration between important events. Fig. 1 demonstrates the architecture of the LSTM recurrent neural network. It comprises a self-recurrent connection and three gates, input, forget, and output, which are responsible for remembering things and manipulating the memory cell. Interactions between the memory cell and its environment are modulated by the mentioned gates. The input gate is responsible for adding information to the cell state. The forget gate allows the cell to remember or forget the cell's previous state. The output gate selects beneficial information from the current cell state and shows it out. Each LSTM block has input and output gates that learn to activate or deactivate to obtain new information, change the cell state, and activate it to affect other cells and network outputs. X(t) is an input for the antigenic pattern at time t. For each time series, one LSTM block changes the output of the new cell state (Ĉ) at time t, which acts as the current cell state at time t. A tanh layer is added to Ĉ(t), which represents the new state of the cell at time t. Then, the old cell state C(t-1) is updated as C(t). The modulation and output gates are represented by g(t) and O(t), respectively.

B. Artificial Immune Recognition System (AIRS)
The artificial immune recognition system (AIRS) represents an intelligent system that is inspired by the natural immune system.
AIRS depends on the stages of initialization, memory cell recognition, resource competition, and the selection of memory cells. The normalization of the dataset is performed at the initialization stage in the range of [0, 1]. Then, the affinity threshold is computed using Eq. (4). The affinity threshold represents the average affinity between antigens in the training set. In Eq. (4), variables n, agi, and agj refer to the number of antigens in the dataset, any antigen, and the next antigen in the dataset, respectively. Antigens are trained in the artificial recognition ball (ARB) pool during the resource competition stage. The stimulation value is assigned to every ARB to compete for limited resources. Memory cells are selected at the end of the resource competition stage. The evolved memory cell pool indicates the quality of the classification process. The affinity between two antigens in the training set is calculated using Eq. (5). Eq. (6) calculates stimulation. If a memory unit and an antigen have the same class label, this refers to the stimulation value in affinity. If they have a different class label, this refers to the stimulation value in the Euclidean distance. In Eq. (6), m c represents the memory cell.
The first version of the artificial immune recognition system (AIRS1) utilizes the ARB pool as a permanent resource, and the mutation rate is determined by the user. The second version, AIRS2, utilizes the ARB pool as a temporary resource. Therefore, the complexity of AIRS2 is less, and somatic hypermutation is used, which means that the mutation rate is proportional to affinity. The parallel versions of AIRS are PAIRS1 and PAIRS2. The training datasets are separated into np number processes. The AIRS algorithm was run separately on each process and merged with the np memory pool.

C. Genetic Algorithm
The genetic algorithm (GA) represents a stochastic search model and optimization technique that mimics natural evolutionary mechanisms. GA is a population-based algorithm that evolves solutions on the basis of the principles of Darwinism. Each candidate solution is represented by the chromosome and has a fitness value indicating the quality of the solution to a problem. GA starts by generating a random population. Fitness-based selection determines the recombined parent chromosomes in the mating pool. Through crossover and mutation operators, offspring are produced for the next generation. The evolution of successive generations continues until the stopping criterion is achieved. At the final stage, the best solution to a problem is determined.

D. Genetic Artificial Neural Network with the Genetic Algorithm (ANN +GA)
The artificial neural network (ANN) represents a computational structure that models the neural structure of the human brain. Artificial neurons are the basic units, which are connected to weighted values, synapses. The structure comprises input, hidden, and output layers. The input layer provides data to ANN. The hidden layer consists of units transforming input into something in the output layer. The feature subsets of ANN were created with DGF, CFG, and IGFG feature subset groups. The genetic algorithm (GA) was employed to estimate the best input parameters of ANN to train networks.

IV. METHODOLOGY AND SUGGESTED FRAMEWORK
The methodology was inspired by a theoretical model of the natural immune system, which describes the functioning and behavior of the immune system and has been an inspiration for a new artificial immune system (AIS).
The theoretical model hypothesizes that "a kind of internal restimulation keeps immune memory preserved for a long time" [9]. To model such an internal restimulation mechanism, this study used an LSTM recurrent neural network. The proposed framework was designed for the robust computation of longsequence learning. The adaptive immune system is capable of remembering the same antigenic patterns over different periods. An associative immune memory was developed to remember gene sequences as robust patterns. This study developed a mechanism for sequence modeling in which biologically significant gene sequences could be effectively memorized. The immune memory of AIRS was developed based on this methodology to understand the "remember" behavior of the artificial immune system response. The underlying principle of the LSTM-based AIRS is to allow for the preservation of the subpopulation of surviving ARBs as long-lived unit cells. In the LSTM systems, values, for which durations are random and delays between significant events are unknown, can be remembered. The evolution of each ARB was performed with the LSTM block for long time series. All recognition cells were remodeled with LSTM gates during the training of the system and then treated with the metadynamics of AIRS. LSTM evolves sub-populations of memory cells and treats them as network inputs. The proposed framework formulates longsequence learning problems with LSTM memory blocks, as shown below: The output of the network h(t) is computed by utilizing the formula presented below. refers to the memory amount of every j th LSTM unit at time t. ℎ = tanh( ) (7) (t) denotes the output gate in which the memory content exposure is managed.
The output gate is expressed by the following equation:  represents the standard sigmoid function, while V0 represents a diagonal matrix. Ĉ(t) denotes a new memory content of the memory unit, which is updated by partially forgetting the current memory and adding new memory contents to c(t). = −1 + Ĉ (9) The novel memory contents are presented below: Ĉ t j = tanh(W c X t + U c h t-1 ) j (10) The current memory forgetting gate is modulated by f(t). Input gate i(t) modulates the addition degree of new memory content to the memory cell. Vf and Vi are diagonal matrices [10].

State_id)
Step 9: FOR (Abi Є Clonenum) Step 10: ClonesAntibodies  (CloneandHypermutated (Abi)) Step 11: CloneAffinities {calcAffinities (Abi)} Step 12: FitnessAccuracy (Feature set (Ω), Abi) Step 13: {UpdateAntibodyPool}  (CloneAffinities) Step 14: ĈLSTMUpdateLSTMMemory (AntibodyDNN, t, State_id) Step 15: State_id  State_id+1 Step 16: CLSTMUpdateLSTMMemory (ĈLSTM, t,State_id) Step 17: END FOR Step 18: bestAffinity {getBestAffinity (CloneAffinities)} Step 20: Ω* NewFeatureSet(CLSTM, bestAffinity) Step 20: END FOR LSTM represents a variation of RNN cells, which is easier to train when the vanishing gradient problem is avoided. The vanishing gradient problem emerges during the training of RNNs with long sequential time series data, and the gradient of error concerning the model parameters at early time steps approaches zero. This indicates that it becomes more challenging for the model to learn long-term dependencies in the input time series. For each time series, the propagation of inputs occurs through the recurrent neural network with the memory cells that are newly calculated. The characteristics of gene expression profile (GEP) datasets may cause over-fitting and bias selection problems if small gene sets are selected from large dimensional features. There may be a chance of finding high classification performance from small gene subsets in high-dimensional datasets. Therefore, robust gene selection algorithms are required for GEP datasets [11]. Some predictor genes were reported in this part. The DAVID and REACTOME programs were used for the biological knowledge discovery of the selected genes. The biological information was extracted from the UNIPROT and NCBI ENTREZ databases. The k-NN and SVM classifiers were used directly as a classifier to measure the classification accuracy of optimal gene subsets. Ten-fold crossvalidation was utilized to evaluate the classification model. It was aimed to find reliable accuracy on the training set and test set separately. The frequency measure was used as a measure of the significance of gene subsets in long sequences. In all tables, we marked in bold the jointly selected genes based on DGF, CFG, and IGFG for the tested methods. The selected gene sequences were analyzed based on the gene function and pathway analysis. For the colon dataset, the LSTM_PAIRS2 algorithm exhibited the best performance with the training accuracy of 89% and predicting accuracy of 92.3% using the SVM classifier and the training accuracy of 85.3% and predicting accuracy of 91.2% using the k-NN classifier based on DGF. {R28608, T94993, L19437 (TALDO1), M82919 (GABRB3), T55780} is the selected gene subset in the colon dataset based on DGF. The results in Table 1 II  TUMOR-RELATED GENES FOR SRBCT, LYMPHOMA, AND LEUKEMIA DATASETS   TABLE III  EXPERIMENT RESULTS FOR MICROARRAY DATASETS. cancer cells. Activating transcription factor 4 (ATF4) is effective in colorectal cancer [13].
For the lung dataset, the LSTM_PAIRS1 algorithm exhibited the best performance with the training accuracy of 98.4% and predicting accuracy of 98.6% using the SVM classifier and training accuracy of 97.2% and predicting accuracy of 97.9% using the k-NN classifier with selection in the lung dataset based on DGF. The USP32P1, CD44, HCRTR2, TNFSF4, NUP98, CCNO, NCF2, and TCEB3-AS1 genes commonly involved in the metabolism of RNA and class I MHC mediated antigen processing and presentation pathways [14] are expressed in lung cancer. In the prostate dataset, the highest classification performance was achieved through the LSTM_PAIRS2 algorithm with the For the lymphoma dataset, the results show that the LSTM_AIRS2 algorithm exhibited the best performance with the training accuracy of 91.8% and predicting accuracy of 92.3% using the SVM classifier and training accuracy of 93.8% and predicting accuracy of 94.6% using the k-NN classifier. The LSTM_PAIRS2 algorithm exhibited the classification performance with the training accuracy of 94.5% and predicting accuracy of 95.1% using the SVM classifier and training accuracy of 90.8% and predicting accuracy of 92.3% using the k-NN classifier while selecting the {GENE595X, CARP cardiac ankyrin repeat protein, GENE585X, TNNT1 troponin T1, skeletal, slow.GENE771X, Homo sapiens mRNA; cDNA } gene subset in the lymphoma dataset based on DGF. Type II transmembrane protein contains C-lectin domains and is related to DC-SIGN [16].
For the leukemia dataset the {ATP6V0C; ATPase, H+ transporting, lysosomal 16kDa, V0 subunit c, CTSD; cathepsin D lysosomal aspartyl peptidase, AKT1; v-akt murine thymoma viral oncogene homolog 1, CSRP1; cysteine and glycine-rich protein 1, TGFBI; transforming growth factor, beta-induced, 68kDa, CCND3; cyclin D3, SERPINB1; serpin peptidase inhibitor, clade B ovalbumin, member 1} gene subset was selected based on DGF by the LSTM_PAIRS2 algorithm with the training accuracy of 96.3% and predicting accuracy of 86.6% using the SVM classifier and training accuracy of 98.2% and predicting accuracy of 85.2% using the k-NN classifier. Table 3 shows the experimental performance results of Multi-Layer Perceptron (MLP), Long-Short-Term Memory (LSTM), Gated Recurrent Unit (GRU) classifiers using Deep Learning 4j (DL4J). The results present the GRU and LSTM classifiers had best performances by 89.7 and 89.6 accuracies with Lymphoma and Lung datasets respectively based on the DGF feature group. Also, GRU classifier had accuracy 88.2 with Leukemia dataset based on the DGF feature group. The results of Table 3 presents the worst classification accuracy obtained by MLP classifier with SRBCT dataset with 68.4 accuracies based on the IGFG feature group.  The framework showed that predictive genes for biological sequences were important in gene expression microarrays. It was also found that the gene subsets selected by the algorithms were involved in important biological pathways. LSTM was used to learn sequences over time. The suggested framework was proposed for converting immune memory into an intelligent network system. The analysis of gene sequences was performed, and informative genes from each dataset were detected. This study confirmed that different genes could be found in the same pathways. Optimal gene subsets were obtained from six commonly used microarray datasets.
In future research, our aim is to investigate different deep neural network models (e.g., the BiLSTM network, CNN network) to improve the performance of the proposed model. Furthermore, it is crucial to evaluate the proposed methodology on other datasets. For future studies, we also aim to conduct an assessment of biomarker detection on different types of techniques presented for the detection of Coronavirus (COVID-19). Banu DİRİ is a professor at Yildiz Technical University, and she works on natural language processing. She authored more than 150 publications in this field. Her research interests include speech recognition, natural language processing, and machine learning.