Comparison of the performance of the regression models in gps-total electron content prediction GPS-toplam elektron içeriği tahmininde regresyon modellerinin performansının

In this study, four regression models are compared with each other to predict GPS-TEC data and it is observed that the Exponential Gauss Process Regression model is a very successful and high-performance model for the prediction of the TEC


INTRODUCTION
In the Earth-space and satellite-to-satellite communication systems, the ionosphere behaves as a structure that enables long-distance transmission. At the same time, the ionosphere has negative effects on this transmission. Due to the refractive, absorbing, polarizing, propagation time delay and Doppler frequency shift of the ionosphere, the ionosphere ionosphere can affect satellite signals in different ways. Some of these features of the ionosphere can cause range error, which is one of the most important parameters in determining the accuracy of the receiving system [1]. The range error occurs because of the time delay caused by the refraction of radio waves and variations in the signal velocity of the wave. Also, ionospheric anomalies and disturbances have influence upon the diffraction of the ionosphere and thus vary the electron density distribution of the medium, the range errors in the propagating signal in the ionosphere and the ripples in signal strength. Thus, the monitoring of the ionospheric electron density distribution characterizing the ionosphere plays an important role in correcting the range errors that incurs time delay in the signal propagating in the ionosphere. Because the ionosphere is inhomogeneous and dispersive medium, it results in time delays in the propagation of radio signals [2,3]. The most important characteristic in the ionosphere is electron density. The electron density of the ionosphere changes at different scales depending on geographic location, seasons, time of day, and solar, geomagnetic, and seismic activity. Many parameters in the ionosphere are derived as a function of electron density. If the ionosphere is considered as a single cylinder over its entire height, Total Electron Content (TEC) can be defined as the total number of electrons in this cylinder with a cross section of 1 meter squared between a satellite and a receiver. It is counted along a tube of one meter squared cross section. TEC is equal to 10 16 electrons per square meter and its unit is TECU [4]. The Global Positioning System (GPS) is often used for estimating TEC. TEC can be estimated by using carrier phase delays of the radio signals transmitted from a GNSS (Global Navigation Satellite Systems) receiver as follows [5,6]: where is the electron density per cubic meter along the path between transmitter ( ) and receiver ( ). Similarly, TEC varies with geographic location, seasons, time of day, and solar, geomagnetic, and seismic activity. The empirical models of TEC are mathematical based on the ionospheric long-term variations. The models are based on the ionospheric maps or TEC data obtained from GNSS. In [7], the nonlinear least square estimation technique is used to allow the TEC modelling for a single-station. In the literature, several studies have focused on comprehension of storm-time behavior of the ionosphere to reduce the influence of ionospheric anomalies and irregularities on global positioning services and to advance the performance of the ionospheric models during the major geomagnetic storms [8,9,10,11]. Also, disturbances and irregularities on TEC due to seismic activity, Solar Flares and solar activity cause the deviations on precise of the satellite navigation and the positioning systems [12,13,14]. Recently, the prediction of TEC during periods of geomagnetic and solar activity has been intensively studied for the improvement of the positioning and radio the communication systems. Two types of approaches as parametric and non-parametric are used for the construction in the prediction models. The ordinary kriging is one of the purposive method that is used to predict the unknown value on the observed TEC data [15,16]. Recently, Deep Learning methods has been intensively used to predict the temporal TEC [17,18]. In [19], Support Vector Machine (SVM), which is the one of the Machine Learning technique, has been applied on GPS-TEC data for the detection of earthquake precursors. In [20] and [21], the multiple regression models are used to investigate the regional and the global trend of TEC. In this study, the unstable responce of the ionosphere is consistently observed for 11 days and the performance of the regression models, namely Gaussian Process Regression (GPR), Regression Trees (RT), Linear Regression (LR), and Support Vector Machines (SVM) are compared with each other with these data. The 11-days TEC data is analyzed with these regression methods and the most suitable and high-performing method is determined for the forecasting model. The parameters such R Square ( 2), Mean Square Error (MSE), Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are computed to compare the regression models. The data, the used regression models and the obtained results are given in sections 2 and 3, respectively.

THE DATA AND THE REGRESSION MODELS
In this study, GPS-TEC data is estimated using the Regularized Estimation (Reg-Est) algorithm. The Reg-Est algorithm provides GPS-TEC on a single station in the direction of the local zenith angle [5,22]. With the Reg-Est algorithm, TEC data is estimated as IONOLAB-TEC for the desired coordinates and the days with 2.5 minutes time resolution [23,24].
Golcuk earthquake is chosen for the purpose of the study. The earthquake occurred on August 17, 1999 at 00:01 UTC at the coordinates 40. The IONOLAB-TEC vector for any -station and -day can be defined as follows: Here, N is total number of samples, T is the transpose operator and 1 ≤ ≤ .

Linear Regression
The Linear Regression (LR) technique is used to evaluate the relations between two or more distributions. Regression analysis is used in forecasting. Care is taken to ensure that the selected model is suitable for making predictions based on the data set and correcting an error in a system or process. [27]. Linear Regression is a method that estimates the dependent variable by using the coefficients of one or more independent variables. In Linear Regression, the linear relation between independent variables from 1 to ( ; 1 ( ), ; 2 ( ), … ; ( )) and dependent variable ( ; ) is expressed as follows [28]: where ; 1 and ; represent the initial and the final day of the data set. The values in the equation are the coefficients in the model, and 0 indicates the point where it intersects the ; axis. in the equation is defined as the error term [28].

Regression Trees
Here, ; is the target variable that to be classified or generalized and The vector ; is the vector used for classification or generalization and consists of features such as ; 1 ( ), ; 2 ( ), … , ; ( ).

Support Vector Machines
Performance of the Support Vector Machine (SVM) model depends on the analyzed data with the different accuracies [30]. In practical SVM applications, the kernel functions are generally used depending on the different data and different parameters The SVM uses these outputs from the hypothesis of kernels functions. The Gaussian SVM is defined in its most general form as follows [30]: Here, r is the width of the Gaussian. The different classification accuracy is obtained with different Gaussian SVM. = √ for the medium Gaussian SVM, where is the number of features [30].

Gaussian Process Regression
The Gaussian Process Regression (GPR) is used in various applications such as experimental design, multivariate regression, model approximation, and prediction. GPR operates under probabilistic regression framework, takes a training dataset as input. For input vector ; , GPR output can be defined as follows [31]: where ; is the error. The limit is provided at the values correlating with each other in the Gaussian behavior [31].

The Errors
The R-Square ( 2) is a statistical measure that gives the closeness of the samples in the distribution to the curve fitting the data. 2 is also defined as coefficient of determination for multiple regression or coefficient of multiple determination. [34].
The Mean Absolute Error (MAE) is defined as a measure of the difference between two continuous distributions [35]: The MAE is a linear metric that measures the mean size of error in a forecast without considering that all other errors are equally weighted over the mean.

RESULTS AND DISCUSSION
In this study, four regression models Interactions Linear Regression (ILR), Fine Tree (FT), Medium Gaussian SVM (MGSVM) and Exponential Gauss Process Regression (EGPR) are compared with each other using GPS-TEC data. The GPS TEC data is estimated as IONOLAB-TEC as mentioned in Section 2. IONOLAB-TEC is obtained for three IGS stations ankr, sofi and tubi for the 11-days period between 07 August 1999 and 17 August 1999. The data set includes the total solar eclipse of 11 August 1999 and the Golcuk earthquake of 17 August 1999. First, cross-validation technique is used in all models. Cross-validation is to create sample observation segments defined as validation data from the training data. After placing a model on the data set, a better assessment is obtained of its performance, its benchmarks against each new validation set, and then how the model will perform when new observations are sought to be predicted. The regression model is determined and the model is trained in the second step. The eleventh day out of the ten days of each station is predicted for each of the four models. In Figures 1, 2  When four regression models are compared over these three figures, it is observed that the model with the most overlapping of actual and predicted values is ILR for all three stations. The proximate prediction is the ILR model for the IONOLAB-TEC. The difference between the actual and the predicted values is quite small in the IRL model. The predicted values for the tubi station, which is 38 km from the epicenter, are very close to the actual values. For the sofi station, which is 573 km from the earthquake center, the difference between the predicted and the actual values is slightly higher than the tubi station. After navigating a regression model, the response distribution displays the record number against the predicted response. Since the study used cross-validation, these predictions are predictions of retained (confirmation) observations. Each prediction is obtained using a trained model without the use of corresponding observations. For the 11-days IONOLAB-TEC dataset, it is observed that the model with the best prediction accuracy among the applied regression models is ILR for three stations. The other model whose performance is very close to this model is the EGPR model.
In the second step of the study, the performances of the models are compared by computing R Square ( 2)     In this study, different regression models are applied to IONOLAB-TEC data sets obtained from IGS stations tubi, ankr and sofi during the 11-days period between 07 August 1999 and 17 August 1999. It is observed that among the ILR, FT, MGSVM and EGPR models, the models that outputs the closest predictions to the actual values with the best results are ILR and EGPR. As a result, considering the error rates, it is concluded that the EGPR model is a very successful and highperformance model for the prediction of the TEC.,

CONCLUSION
In this study, four regression models, namely Interactions Linear Regression (ILR), Fine Tree (FT), Medium Gaussian SVM (MGSVM) and Exponential Gauss Process Regression (EGPR) are compared with each other using GPS-TEC data. The GPS-TEC data is estimated as IONOLAB-TEC using the Regularized Estimation (Reg-Est) algorithm. The IONOLAB-TEC is estimated for three IGS stations ankr, sofi and tubi during 11-days period between August 07 and 17, 1999. Four regression models are applied to predict the 11 th day IONOLAB-TEC data obtained from three stations. Four performance metrics, R Square ( 2), Mean Square Error (MSE), Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), are computed to measure the margin of error between the actual and the predicted values of the IONOLAB-TEC. It is observed that the model that makes the closest predictions with the best results among the applied regression models is the ILR and EGPR models. The models with the smallest RMSE, MSE and MAE errors and the R2 value being almost 1 are the ILR and EGPR models. Consequently, it is observed that the ILR and EGPR models are very successful and high performance models for the prediction of GPS-TEC.