Detection and Diagnostic Methods of Multiple Influential Points in Binary Logistic Regression Model in Animal Breeding

Coklu etkili gozlem noktalari ikili lojistik regresyon modellerinde parametre tahminlerini olumsuz yonde etkilemekte ve sonuclarin yanlis yorumlanmasina sebep olmaktadir. Bir etkili gozlem noktasi verilerin geri kalaninin genel egimini takip etmeyen ve x bakimindan asiri degere sahip olan bir veri noktasidir. Veri seti icinde yaklasik % 10 ve uzerinde etkili gozlem noktasinin bulunmasi parametre tahminlerini etkiledigi icin bu noktalarin tespit ve teshisi oldukca onemlidir. Coklu etkili gozlem noktalarinin tespit ve teshisinde grafiksel (sacilim grafigi ve kutu grafigi gibi) ve analitik yontemler kullanilmaktadir. En yaygin kullanilan teshis yontemleri Pearson Artiklar, Student Turu Artiklar, Şapka Matrisi, Cook Uzakligi, DFFITS, DFBETA vb. yontemlerdir. Ancak bu yontemler coklu etkili gozlem noktalarinin olmasi durumunda maskeleme problemleri ile karsilasmakta ve teshiste basarisiz olmaktadir. Bir cok istatistikci bu problemle basedebilmek icin Genellestirilmis Standartlandirilmis Pearson Artigi (GSPA), Genellestirilmis Agirliklar (GA) gibi yeni yontemler gelistirmis ve onermistir. Bu calismada, Romney irki koyunlardan elde edilen sutten kesim agirligi (SKA), Bir yas canli agirligi (BYCA), yapagi agirligi (YA) ve dogurganlik orani (DO) degiskenlerine ait icinde coklu etkili gozlem noktasi (%15) bulunan veri seti ile calisilmis ve DO uzerine SKA, BYCA ve YA degiskenlerinin etkisi ikili lojistik regresyon modeli ile modellenmistir. Calismanin amaci coklu etkili gozlem noktalarini grafiksel yontemlerle tespit edip yaygin olarak kullanilan ve yeni gelistirilmis yontemlerin bu veri noktalarinin teshisindeki performanlarini incelemektir. Calismanin sonucunda yaygin olarak kullanilan yontemlerin coklu etkili gozlem noktalarini maskeledigi ancak yeni onerilen yontemlerin bu noktalari basarili sekilde teshis ettigi gozlenmistir.


Introduction
The Binary Logistic Regression (BLR) model has been commonly used in the analysis of the functional relationship between an outcome variable and predictor variable(s) in animal breeding for many years and studied by a great number of researchers in recent years (Eyduran et al., 2005;Gaskins et al., 2005;Korkmaz et al., 2012;Aktaş and Doğan, 2014;Yakubu et al., 2014;Aktaş et al., 2015;Takma et al., 2016;Erdinç et al., 2017;Baeza-Rodriguez et al., 2018;Gebre et al., 2018). The most important difference of BLR from the general linear regression model is that the outcome variable refers to binary outcome which is assigned 0 or 1. Therefore, the error variance becomes nonconstant and the error term exhibits logistic distribution. BLR assumes that the sample size is adequate and a high correlation among the predictor variables does not exist and lastly there should be no outlier and/or influential point in the dataset (Hilbe, 2009). An unknown parameter in BLR is estimated by using maximum likelihood (ML), but it is well known that ML can be severely affected in the presence of outliers. The outliers are named differently according to their position on the X and Y axis. For example, both outliers and influential points are measurements that do not fit in the trend shown by the rest of the data. Hence, these two concepts should not be mistaken for each other. To specify, an outlier is an unusual observation whose outcome y does not follow the general slope of the rest of the data, whereas the influential point is a data point that does not follow the general slope of the rest of the data and has an extreme predictor x value. Parameter estimates obtained in the presence of influential points, in particular, will cause misinterpretation of the results. Moreover, the binary outcomes are likely to be misclassified. Hampel et al. (1986) have claimed that if these outliers occur in about 1-10% of the dataset, it is normal and can be removed from the dataset; however, if there are more than 10% outliers, it is recommended to use a robust estimator instead of ML estimator (Midi and Ariffin, 2013). Outliers and influential points often cause problems in the analyses of data in animal and plant breeding. Some researchers have reported that performance of accuracy estimation in genomic prediction methods used in genomic selection studies is adversely affected by outliers (Via et al. 2012;Heslot et al., 2013;Estaghvirou et al., 2014). Therefore, the detection of outliers or influential points is crucial and must be performed before the analysis. Result of a diagnosis refers to a specific amount that is computed from the data and calculated to determine the influential points where the influential points can be eliminated or corrected. Thus, such observations need to be described and their effects on the model and subsequent analysis should be investigated (Nurunnabi et al., 2010). In recent years, diagnostic and detection have become an almost indispensable part of BLR and a great many statisticians have studied diagnostic and detection methods of outliers and/or influential points. Before Imon and Hadi (2008), the diagnostic methods always relied on the detection of outlier. However, the subsequent studies have shown that the observation points that cause significant deviation in parameter estimates are influential points. Since the influential point too is an outlier, the diagnostic methods before Imon and Hadi (2008) are valid for influential points. However, the general objective of all the new diagnostic methods including Imon and Hadi (2008) and the subsequent ones is to detect multiple influential points. Diagnosis of outliers and/or influential points based on residuals is known as the most common method in BLR (Pregibon, 1981;Jennings, 1986;Copas, 1988). The most commonly employed diagnostic methods for the identification of outliers in BLR are Pearson residuals, Standardized Pearson Residuals (SPR), Cook Distance (CD), Hat matrix, Difference of Fits (DFFITS), Difference in Beta (DFBETA). However, these methods are only able to identify single outliers. If the dataset contains multiple outliers/influential points, these methods fail to identify them because of the masking and swamping problems (Imon and Hadi, 2008;Habshah et al., 2009;Sanizah et al., 2011). Recently, diagnostic methods have been developed by a great number of statisticians to overcome these problems (Cook, 1977;Pregibon;1981, Jennings;1986, Copas, 1988Hadi and Simonoff, 1993;Imon, 2006;Imon and Hadi, 2008;Habshah et al., 2009;Nurunnabi et al., 2010;Sarkar et al., 2011). The new approaches developed based on a deleted group are Generalized Standardized Pearson Residual (GSPR), Generalized Weights (GW), Generalized Difference of Fits (GDFFITS), and Generalized Square Difference in Beta (GSDFBETAS). Studies have shown that these methods successfully cope with masking and swamping problems in datasets with multiple influential points. The prediction of genetic parameters and accuracy of breeding values are greatly important for animal breeding and animal improvement programs. In addition, parameter estimation of risk factors affecting some economically important traits, such as fertility rate, birth type, the stillbirth rate, in terms of care and management plays a critical role in the livestock field. Influential points adversely affect the achievement of parameter estimates of traits. However, to date, their detection in animal breeding has not yet been evaluated. Therefore, it has become necessary to identify influential points in the datasets in order to obtain accurate parameter estimates. Accordingly, the aim of this study is to contribute to these scholarly efforts by introducing various existing diagnostic and detection methods adopted to identify multiple influential points in a dataset analyzed by using BLR in animal breeding.

Material
Animal materials of this study consisted of 100 Romney ewes raised in New Zealand. Since the aim of the study was to compare multiple influential points and diagnostic methods, the dataset was arranged to contain 15% influential points. Of the 100 units of data, 85 were selected from the 300 units of data using the random sampling method, while 15 were the influential points already present in the dataset. Thus, the dataset with 15% influential point was created. The study was conducted over this dataset.

Method
Binary Logistic Regression model was used to determine the influence of weaning weight (WW), yearling weight (YW) and fleece weight (FW) of the ewes on fertility rate (FR). Binary variable was coded as 1 (lambed) or 0 (unlambed) in relation to FR. The mathematical model of BLR was as follows: where Y is an nx1 vector of the outcome variable (FR), which is denoted by y = 1 or 0 with probabilities π and 1− π , respectively. ε is a nx1 vector of error terms: which follows a distribution with mean zero and variance ( ) . Thus, we have to use the logit link function to transform it into a linear form.
In the literature, there are many methods of detection and diagnostic of influential points. All of these methods were developed firstly for general linear regression and then they were suggested for BLR by Pregibon (1981). They can be divided into two groups: e.g., graphical and analytical methods. The best-known graphical methods are the scatter, box, and residual plots. However, since the graphical methods fail to provide reliable information, the analytical methods are preferred, especially when the number of predictor variables is high. Many analytical methods are proposed in the related literature. In this study, the most commonly used analytical methods were adopted which were thought to prove more useful for researchers in animal breeding. The analytical methods are statistical values computed from the dataset that can be used to identify the presence of influential points. Although the main tools of the developed analytical methods are residuals, the methods having been developed in recent years are based on the deletion of suspected observations. In BLR, the primary building blocks of analytical methods used to identify influential points are residual vector and projection (leverage) matrix (Pregibon, 1981). According to a similar approach to linear regression (Copas, 1988), the th i residual is defined in BLR as follows: Although residual, also known as raw residual, is very important in detecting ill-fitting, residuals defined in equation (3) are unscaled. Therefore, it is not applicable to influential points diagnosis. There are two versions of the scaled residual type commonly used in BLR to eliminate this problem: Pearson Residuals (PR) and Standardized Pearson Residuals (SPR). PR can be defined as: Pearson residual value of an observation is considered a residual outlier if it's greater than 3 by absolute value (Ahmad et al., 2011). Standardized Pearson Residuals value is obtained by dividing the raw residuals by the standard error provided by ( ) ( ) and ii h is the th i diagonal element of the nxn matrix, known as hat matrix, ( ) , then this may evidence the presence of influential points (Friendly and Meyer, 2015). V is a diagonal matrix with diagonal elements i v (Pregibon, 1981). Hence, the SPR for BLR can be defined as: In BLR, observations with SPRs, which are less than -3 and greater than +3, are considered as outliers (Midi and Ariffin, 2013). Methods other than the methods of identifying influential points/outliers using residuals delete suspect observations. The most common diagnostic statistics adopting observation deletion approach are Cook Distance (CD), Hat matrix (Lev), Difference of Fits (DFFITS) and Difference in Beta (DFBETA) (Cook, 1977;Belsley et al.,1980;Nurunnabi et al., 2010). Pregibon (1981) defined CD by using linear regression models for BLR as follows: If there is an observation with the value of 1 i CD > , it is regarded as an influential point. Another influential point determination measure similar to CD is DFFITS value suggested by Welsch (1982). DFFITS is defined in terms of SPR and Lev values in BLR as follows: An influential point has DFFITS 2 or 3 k n k > − . Although the abovementioned methods often provide effective results in determining influential points, they are effective if there is only one single influential point in the dataset. If there are multiple influential points in the dataset, these methods are ineffective. In the case of multiple influential points, they cause masking and swamping problems (Imon and Hadi, 2008). Therefore, new approaches are needed to prevent these problems from occurring. The proposed approaches based on a deleted group in the BLR are Generalized Standardized Pearson Residual (GSPR), Generalized Difference of Fits (GDFFITS), and Generalized Square Difference in Beta (GSDFBETAS) (Imon and Hadi, 2008;Nurunnabi et al., 2010;Nurunnabi and Nasser, 2011). These methods have been obtained by generalizing the existing methods and are based on deletion of the suspected group from the dataset (Hadi and Simonoff, 1993). Before using these methods, the dataset is examined using scatter plot and possible influential points are identified. Then, the d-dimensional observations, which are considered influential points in the n-dimensional dataset, are deleted before the fitting of the model. R and D, respectively, represent the set of situations of the "remaining" and "deleted" observations. The parameters of the model with the remaining set are estimated by using ML. Statistical values of the proposed methods were obtained with estimated parameters. Thus, the probability values determined according to the R set are defined as: In this case, after the th i observation is deletion, the residuals are defined as follows: The variance and leverage values of the observation set in question are computed by the following equations: The proposed GSPR value is obtained by following equations using equations 9, 10, 11, and 12 (Nurunnabi and West, 2012).
An observation is described as an influential point when its corresponding GSPR value of any observation is 3 points greater than the absolute value. The GDFFITS method suggested by Nurunnabi et al. (2010) is defined in (14) using (13): If the GDFFITS value corresponding to the ith observation is 1 2 or 3 means that the observation is the influential point (Nurunnabi et al., 2010). Another proposed method for diagnostic of the influential point is GSDFBETA method, suggested by Nurunnabi and Nasser (2011). The GSDFBETA is defined as: To detect influential points, the dataset was analyzed with Maximum Likelihood (ML) and then the diagnostic and detection methods were analyzed. R version 3.5.1 (R Development Core Team 2018) software was used for both analysis and detection.

Results
Descriptive statistics and histogram graphs of the predictor variables used in the study are given in Figure 1. Histogram graphs in Figure 1 show that the distributions of the predictor variables are skewed, and outliers have an effect on the dataset. The most common graphical methods used to determine whether there are outliers in the dataset before analysis are scatter plot and box plot. The scatter plots of FR against WW, YW and FW is shown in Figures 2 and 3. The plots evidence the presence of suspicious observations (between Observations 85 and 100) that can be regarded as multiple influential points.   The plots of FR against WW, YW, and FW clearly present that the observations between 85 and 100 may severely distort the covariate pattern. However, scatter plots and box plots alone are incompetent at the diagnosis of suspicious observations. For this reason, we need analytical methods to determine the extent of the influence of suspicious observations determined by graphical methods. Until the development of new approaches, diagnostic methods (CD, PR, SPR and DFFITS) worked functionally in the presence of a single outlier, whereas they were inadequate when multiple influential points were observed. Table 1 shows the CD, PR, SPR, and DFFITS results of suspicious observations graphically detected in Figures 2, 3, and 4. Table 1 reveals that the degree of other suspected observations' effects on the dataset except for the 99 th observation value in PR and SPR is below the cut-off limit and the most commonly used one of these diagnostics fail to determine the influential points. It seems that CD (using 1 as the cut-off value) and DFFITS (using 0.613 as the cut-off value) fail to determine any influential points in the dataset, whereas PR (using 3 as the cut-off value) and SPR (using 3 as the cut-off value) can correctly determine only the 99 th observation as influential point. The index plot in Figure 5 shows that PR and SPR can correctly and clearly determine the influential point compared to CD and DFFITS. This is due to the masking problem of these methods when there are multiple influential points. The use of these methods may mislead researchers and continuing the analysis without removing suspicious observation points may lead to misinterpretation of parameter estimates.  The results of GSPR, GSDFFITS, and GSDFBETA methods proposed as new approaches in this study are available in Table 2. It is clear from the table that the GSPR, GSDFFITS, and GSDFBETA values for the suspected observations were much larger than the others and all exceeded the cut-off values of 3.00, 0.651, and 0.474, respectively. The advantage of these methods over other methods is that they are robust to the masking problem. Similar conclusions may be drawn from the index plots of GSPR, GSDFFITS, and GSDFBETA as presented in Figure 6. All these 15 suspected observations are separated from the other data and correctly determined as influential points.  Figure 6. Index plots of GSPR, GSDFFITS, and GSDFBETA for the dataset.
It is crucial for researchers when analyzing data to be able to determine influential points. The results of the analysis with ML estimator of the dataset with influential points and without influential points are presented in Table 3. In Table 3 on the results of both datasets, it was observed that FW variable had no statistically significant contribution to FR (p>0.10), whereas WW and YW variables contribute significantly to FR (p<0.05). Furthermore, the coefficients of the dataset without influential points and the dataset with influential points differ. As a result, researchers can remove observations that they detect, both graphically and using suggested methods, from the dataset by looking at the size of the dataset and the percentage of influential points in the dataset.

Discussion and Conclusion
The aim of this study is to comparatively examine the performances of the detection and diagnostic methods (graphical and analytical methods) used in the presence of multiple influential points in a dataset where the effect of WW, YW, and FW variables on FR is modeled. Outlier/influential points occur in almost all research studies and this type of observations are a problem in statistical analysis. Therefore, their detection and diagnosis are a crucial issue that needs to be addressed before further analysis is performed. Analysis without determining the location and amount of these observation points adversely affects parameter estimates, particularly data analysis with outliers/influential points. Results of a breeding program and management strategy plan to be carried out using the predicted parameters may differ from the expected outcome, which in turn directly affects the economic situation. Therefore, it is necessary to identify influential points in the datasets in order to obtain accurate parameter estimates. In this study, four of the most commonly used methods and three novel methods for the diagnosis of multiple influential points in BLR are introduced and their performances are comparatively examined. Evaluating the diagnostic methods in terms of performance shows that the proposed method (GSPR, GSDFFITS, and GSDFBETA) is highly competent at determining multiple influential points in the case of failure of the existing commonly used diagnostic methods (CD, PR, SPR, and DFFITS).