Reference evapotranspiration estimation with k-Nearest Neighbour and Artificial neural network models using different climate input variables in the semi-arid environment

Accurate estimation of reference evapotranspiration (ET o ) is an important issue for agricultural water management and irrigation planning. This study investigated the performance of k-Nearest Neighbour (kNN) and Artificial Neural Network (ANN) models to estimate daily ET o using four combinations of climatic data. Four input combinations of daily meteorological data during 1996-2015 in the Middle Anatolia region were applied for model training and testing. The results of ET o estimation with kNN and ANN models were compared with the FAO Penman Monteith equation. The results of ET o values demonstrated that the kNN model had better performances than the ANN model in all combinations. The statistical indicators of the kNN model showed ET o values with MSE, RMSE, MAE and R 2 ranging from 0.541-0.031 mm day -1 , 0.735-0.175 mm day -1 , 0.547-0.124 mm day -1 , 0.900-0.994 in the testing subset. Therefore, the kNN model can be recommended for the estimation of reference evapotranspiration with full and limited climatic data.


Introduction
Evapotranspiration (ET) can be described as water loss into the atmosphere via plant transpiration and soil evaporation (Landeras et al 2008;Fan et al 2018). Water resources are significantly reduced in semi-arid and arid environments due to the consequences of increasing climate change. In these regions where water shortage is a major problem, it is essential to estimate water loss by ET. Therefore, precise prediction of ET is an imperative step for managing water activities, especially in the area which faces water scarcity.
Numerous methods to estimate ET have been recommended but each method has benefits and limitations due to their activities. However, methods which are depending on measurement are high-cost and also have usage difficulties. Therefore, a more economical and practical alternative application to this method is developing tools which are depending on mathematical models using climate variables measured from meteorological stations.
The Penman-Monteith equation is frequently applied method due to recommendation of the Food and Agriculture Organization of the United Nations as a standard method (FAO PM) for reference evapotranspiration (ETo) estimation. In literature (Lopez-Urrea et al 2006;Ali & Shui 2009;Pereira et al, 2015), the method was evaluated under different time steps and environmental conditions. For calculation of ETo, the method requires many climatic input parameters (Feng et al 2017), which is a big disadvantage of this equation. Moreover, the prediction of ET is a complicated process dependent on a huge and good quality of climatic parameters; therefore, it is difficult to represent all these complicated processes in an empirical model. Especially in developing countries, the meteorological data are very limited. This problem brings another obstacle of using FAO PM method. Therefore, simplified empirical methods with less climatic input variables are getting interested for ETo estimation (Hargreaves & Samani 1985;Trabert 1896;Priestley & Taylor 1972). However, these methods obtain less accurate results for daily ETo estimation than on a weekly and monthly (Torres et al 2011).
Interest in the machine learning method in ETo estimation has increased over the last two decades (Kisi & Cimen, 2009;Feng et al 2016;Tangune & Escobedo 2018) because these non-parametric methods can work without specific knowledge about the variables that are used for the models (Kişi 2015; Yamaç & Todorovic 2020). Among the machine learning methods for prediction of ETo, one of the most common methods is the artificial neural network (ANN) model. Ferreira et al (2019) investigated the ANN and support vector machine (SVM) to predict ETo in Brazil, using different climatic variables. The findings showed that the ANN gives the best result for the temperature and relative humidity-based models.
Antonopoulos & Antonopoulos (2017) examined the prediction of ETo comparing the ANN model and empirical equations in Greece. They pointed out that the performance metrics of the ANN model was higher than empirical equations. Landeras et al (2008) studied the prediction of ETo using empirical equations and the ANN in Spain. The ANN is better than the empirical equations. Traore et al (2010) Citakoglu et al (2014) evaluated the estimation of monthly ETo using adaptive network based fuzzy inference system (ANFIS) and ANN models in Turkey. The ANFIS estimated slightly higher performance than the ANN. Kisi (2016) investigated M5 Model Tree (M5Tree), multivariate adaptive regression splines (MARS) and least square support vector regression (LSSVR) methods in Turkey. The overall results indicates that the LSSVR observed the best results with local output and input variables while the MARS model performed the best results in estimating ETo in the lack of local output and input data.
The goal of the study is to make a comparison of kNN and ANN models with a standard method of FAO PM using four combinations of meteorological data for the prediction of ETo. In this way, the paper was purposed to understand the accurate modelling performance for prediction of ETo in semi arid environment of Turkey comparing one recognized and widely used model (ANN) with recently used model (kNN) from first combination to fourth combination which is from less to more meteorological data.

FAO Penman-Monteith
The FAO PM equation was used for prediction of daily ETo; where ETo is the reference evapotranspiration (mm day -1 ), Rn is the net solar radiation (MJ m -2 day -1 ), G is the soil heat flux density (MJ m -2 day -1 ), T is the mean daily air temperature (°C), Δ is the slope of the saturated vapour pressure curve (kPa °C -1 ), is the psychometric constant (0.066 kPa °C -1 ), es is saturation vapour pressure (kPa)and ea is actual vapour pressure (kPa) and U2 is the mean daily wind speed (m s -1 ). T and U2 was measured at 2m height.
The es was estimated as: where e 0 (T) is the saturation vapour pressure (kPa), and Tmin and Tmax are minimum and maximum daily air temperature (°C), respectively. The 0 ( ) was calculated as: The ea was calculated as: where RHmean is the mean daily relative humidity.

k-Nearest Neighbour
The kNN is the simple classification method, presented by Cover & Hart (1967), which is widely used machine learning methods (Gocić et al 2008). It is non-parametric which is easy to implement and which γ obtains efficient and competitive results. This advantage makes method much more significant than many other machine learning methods. Figure 2 shows the kNN schematic illustration for 2 classes of k=1 and k=3. In Figure 1a, a known sample (-), nearest to the sample X, is used for categorization of sample X; in Figure 2b, three nearest (+) samples to X are employed for categorization. The present study was applied Euclidian distance equation (Equation   2) . It can be written as: where x is the Euclidian distance, a and b are the data occuring to N dimensions. n is an index number.

Artificial Neural Network
The ANN model based on numerical model that was developed and designed in order to analyse the performance of a biological neural system. The structure of ANN models is similar as biological brain with numerous layers of connected neurons. (Landeras et al 2008). In recent decades, the ANN has been applied in hydrological and agricultural studies (Kumar et al 2011). The general architecture of the ANN is shown in Figure 3. The model has the capability to learn, memorize and create relationships between weighted neurons from a training dataset. When the testing data is implemented into the system, the model realises the relationships between neurons and assigns the data to the appropriate class. The well known structure of an ANN model is formed of an input layer, where the data is entered; hidden layer(s), where the data is processed; and output layer, where it gives the results (Yamaç et al 2020).

Model development and performance evaluation
The kNN and ANN models were developed to simulate and estimate the daily ETo in a semi-arid environment. To establish kNN and ANN models, six climatic variables (wind speed, solar radiation, minimum-maximum relative humidity and minimum-maximum air temperature) were employed as inputs, while ETo was employed as the output variable. Correlations between these climatic variables and ETo have been shown in Table 2. The reason of development of the correlation matrix was to understand which climatic variables have the best relations with ETo. According to correlation matrix, the next nearest correlation was added for development of combinations. Table 3 shows different input combinations for the models.
Before the models run, all the variables are standardized ranging between 0 to 1. The standardization equation is defined as: where is the standard deviation, µ is the mean value and x is the original data.
The performance of kNN and ANN models were appraised using coefficient of determination (R 2 ), Nash-Sutcliffe model efficiency coefficient (NSE), mean absolute error (MAE), root means square error (RMSE) and the mean squared error (MSE) in the training and testing subsets. The good performance metrics of the models can be understood when MAE, RMSE and MSE values are smaller and NSE and R 2 are higher.

Results and discussion
The kNN and ANN with four combinations of climatic input data were evaluate for training and testing subsets. The findings showed that the kNN and ANN models were able to describe the nonlinear relationships between meteorological variables to estimate daily ETo values adequately. The performance metrics of the models, including MSE, RMSE, MAE and R 2 are presented in Tables 4 and 5 for the prediction of daily ETo. As can be seen in Tables 4 and 5, all the applied kNN and ANN models presented accurate daily ETo estimates during training and testing subsets. In general, the kNN4 showed the highest performance metrics. However, the ANN1 model has the lowest performance in the testing subset.
The best accuracy of the kNN under four climatic conditions to estimate daily ETo over training and testing subsets was observed when the k was chosen as 5. For the ANN model, the 5 was identified for the number of neurons in the hidden layer. The best performance criteria was showed when ANN model has 2(3,4,6)-5-1 structure for daily ETo estimation.
This can be explained that the model occurs of 2 neurons for first, 3 neurons for second, 4 neurons for third and 6 neurons for fourth combinations in input layer, 1 in the output layer and 5 neurons in the hidden layer.
For the activation function, the rectified linear unit function was employed for this study.  (Figure 4).
In general, the statistical indicators demonstrated that the fourth combination provides by far the best performance for kNN and ANN models with complete meteorological data while the poorest performance was obtained with the first combination fed with maximum and minimum temperature. In general, the findings are in agreement with literature (Torres et al 2011;Tabari et al 2012), concluding that more climatic input variables commonly increase modelling accuracy. This result is in accordance with Fan et al (2018) who also indicated that machine learning models with temperature, relative humidity, wind speed and solar radiation inputs have the best performances comparing with the less meteorological variables in the semi-arid environment. Moreover, the findings showed that the kNN and ANN with maximum/minimum temperature, combined with solar radiation (second combination), have a better performance than the kNN and ANN models with minimum and maximum temperature in a semi-arid region. In that case, for testing subset, the kNN2 model, R 2 was 0.957, NSE was 0.961, MSE was 0.232, RMSE was 0.458 and MAE was 0.349. For ANN2 model, R 2 was 0.941, NSE was 0.923, MSE was 0.322, RMSE was 0.567 and MAE was 0.421 in the testing subset. These results demonstrated that the solar radiation input was more substantial than wind speed and relative humidity upon maximum/minimum temperatures in a semi-arid region. According to statistical indicators, with the kNN and ANN models based on solar radiation and maximum/minimum temperature (kNN2 and ANN2), meteorological input variables can also produce satisfactory ETo estimates in the semi-arid environment of Turkey where other meteorological variables are not easily accessible.
Previous studies indicated that employing all meteorological input variables provided the best performances for predicting ETo. Feng et al (2017) predicted daily ETo with random forests (RF) and generalized regression neural networks (GRNN) models using different meteorological variables concluding that the models with complete meteorological data is preferable than the combination which is added less meteorological variables. A similar result was pointed out also by Traore et al (2010) when the ANN was used to predict daily ETo variables in Sudano-Sahelian zone.
The kNN model showed the best performances in all combinations when compared to the ANN model. This could be explained by the fact that the kNN model concentrating on the characteristic of the nearest neighbours similar to the behaviour of applied climatic variables and their correlation with the ETo.
Comparing result from previous study, larger RMSE and MAE were mentioned by (Feng & Tian 2019) using the kNN model. From this comparative analysis, it may be concluded that it is suitable to estimate ETo employing kNN model in semi-arid environment of Turkey.

Conclusion
This paper presented an application of the kNN and ANN models for the accurate estimate of daily ETo with full and limited meteorological data in a semi-arid environment of Turkey. To identify the optimal results to estimate daily ETo in the mentioned semi-arid region, the kNN and ANN models with four different combinations of meteorological input variables were proposed. The recently used kNN model was implemented to estimate daily ETo for analysing the performance metrics of different combinations of climatic input data and to compare with a well-known ANN model. This A NN was applied in many previous studies, therefore; it is used as a comparison model in order to evaluate the performance of kNN model in this study.
The statistical performance in the testing and training subsets was improved by adding one climatic parameter to each combination (from 1 to 4), which demonstrated positive correlations with the number of input variables to the kNN and ANN models. Among all the combinations, the kNN model offered better predictional accuracy and stability than the well-known ANN model. Therefore, the results advocated that the kNN has a high potential for ETo prediction in the semi-arid region of Turkey, even possibly in an another regions of the world with presenting similar environments. In addition, the overall results showed that less meteorological input combinations may be a suitable alternative solution where full meteorological data sets are not available. This finding is especially important for agricultural lands in developing countries, where meteorological data are missing to estimate ETo.  (ETo: reference evapotranspiration, Tmin: minimum air temperature, Tmax: maximum air temperature, Rn: solar radiation, RHmin: minimum air relative humidity, RHmax: maximum air relative humidity, U2: wind speed).