Vulnerability of the Tukey M Robust Regression Method Against Multicollinearity

: In this study, we investigate whether the Tukey M robust regression method provides a solution for the data sets suffering from multicollinearity problem. It is observed that high values of variance inflation factors (VIF) which is a sign of the multiple linear link among the explanatory variables, cannot be controlled by the robust methods which work through the residual values. The reason for this fact is that multicollinearity and high values of VIF which is a result of multicollinearity do not produce extreme residuals. For this reason, the robust methods cannot provide a solution for the high VIF problem. This fact is shown by an extensive simulation study. In the simulation study, the explanatory variables were derived from trivariate normal distribution for three different correlation values. In this study, we also used two real-life data examples and we observed that the results support the findings of the simulation study. For all these reasons, we can conclude that specialized methods should be utilized in the case of multicollinearity.


Introduction
Multicollinearity can be defined as the high linear relationship among two or more explanatory variables.It is crucial to understand the causes and the extent of the multicollinearity.Thus, it should be determined whether the cause of the multicollinearity is the nature of the variables or the consequence of the data collection method which can be helpful in finding the remedies for this problem [1].
In a multiple regression analysis, first, it should be detected if multicollinearity exists because it has many adverse effects on the regression analysis.In the case of the existence of multicollinearity, the regression coefficients, extra sum of squares, the variability of the estimated regression coefficients, the fitted values, the predictions and the simultaneous tests of beta can be negatively affected [2].Additionally, even if a definite statistical relationship exists between the dependent variable and the set of the predictor variables, many of the estimated regression coefficients individually may be statistically insignificant [3].
The measurement of the marginal effect of the explanatory variables is not easy since the marginal contribution of a predictor variable in reducing the error sum of squares can be affected by other variables which are already in the regression model.This is due to the fact that under multicollinearity, the explanatory variables that are already included in the model contain almost the same information [3].The most well-known effect of the multicollinearity is its capability to inflate the variances of the estimators of the regression coefficients which also constitutes a barrier to establish the regression model correctly [2].
There are many tools to detect multicollinearity.Checking the scatterplots and correlations between the explanatory variables can be useful but we should keep in mind that correlation and multicollinearity are not the same things, thus, there can still be multicollinearity even when all the correlations are low.A similar diagnostic tool is to examine the whole correlation matrix of the explanatory variables.This gives the opportunity to see all the correlations at once but as we mentioned before, this is not enough to determine the existence of multicollinearity.Fortunately, in addition to these, several multicollinearity detection methods have been developed [3].Now let us sort the eigenvalues of the variance-covariance matrix of the p explanatory variables (Σ) in descending order as  1 ≥  2 ≥ ⋯ ≥   (see Chatterjee and Hadi [4] for details).If at least one of the eigenvalues is close to zero, there is a serious multicollinearity [5].
Many other symptoms of multicollinearity can be observed including a small determinant of the correlation matrix, improbable signs or size of the estimators of the regression coefficients, unexpected magnitudes of the standard errors of the regression coefficients and large confidence intervals of the regression coefficients [2,5].The sum of the reciprocals of   , k=1,2,…,p is also used as a multicollinearity diagnostic measure.If the sum of them is larger than 5p, then multicollinearity is present.This rule is given below Another measure of multicollinearity is the condition index and the kth condition index is found by The greater the condition index, the higher the multicollinearity is.If the condition index is between 10 and 30, a moderate multicollinearity is expected while the condition index being greater than 30 indicates a high multicollinearity [4,6,7].
There is another detection method suggested by Marquaridt [5] which is called variance inflation factors (VIF).VIF are the diagonal elements of the inverse of the variance-covariance matrix of the explanatory variables after the correlation transformation.Usage of VIF is widely recognized for detecting the presence of multicollinearity.VIF can be accepted as a tool in measuring the amount of inflation in the variances of the regression coefficients when the explanatory variables are linearly related compared to the case when they are linearly independent [3].
VIF can be mathematically shown below where  * is the matrix of the explanatory variables after correlation transformation and   2 is the coefficient of the multiple determination of   on the remaining explanatory variables.The larger the VIF, the more variance of the estimators of the regression coefficients is inflated and so higher the severity of the multicollinearity is [3].
In order to handle the multicollinearity problem, there are many suggestions in the literature.One approach to dealing with multicollinearity is to collect more information or additional data but this may not be possible in most of the situations and even if it is possible it may not solve the problem if the additional data also possess the same problem.Removing one or more explanatory variables from the model, defining new predictors or respecification of the model are other remedies [8].Using alternative estimation methods which are not influenced unfavorably by multicollinearity as Least Square (LS) is another remedy.One is the ridge regression method which was proposed by Hoerl and Kennard [9] as an alternative to the LS method.There is another method called principle component approach which is based on working with the eigenvalues and eigenvectors of the correlation matrix of the explanatory variables [4].There are also some studies focused on using robust ridge regression methods but in this study, we investigate whether the Tukey M robust estimation for the regression coefficients provides a simpler solution for the adverse effects of multicollinearity in regression analysis [10,11].To do so, we conducted a simulation study including the LS method and the Tukey M robust estimation method for the regression coefficients and examine the effects of multicollinearity on the regression analysis based on them.As a classical robust estimation method we used the Tukey M estimators by utilizing MATLAB Robustfit module.Basically, we checked the differences in the variances of the regression coefficients produced by these methods.We will give more detailed information about the methods used in this study in Section 2. Section 3 will present the simulation results and the related comments.Two real-life data examples are given in Section 4 for illustration.The final section includes discussion and some concluding remarks.

Material and Method
The general linear regression (GLR) model can be given as follows Here,  is the vector of the response variable, is the matrix of the explanatory variables,  is the vector of the regression parameters,  is the vector of the error term, n is the sample size and p is the number of slope parameters.The assumptions related to Eq. ( 5) are , where  is the identity matrix.Many estimators for the regression parameters were suggested in the literature.In this study, we include two of them, the LS estimators and one of the most commonly used robust estimators, the Tukey M estimators, for the regression parameters [12].It is also reported by Yu and Yao [12] that the Tukey M estimators achieve both robustness and high efficiency for regression models.Here, we intend to observe the differences if any between the classical estimators and the Tukey M robust estimators for the regression parameters.
The philosophy of the LS method is obtaining the estimators by minimizing the sum of the squared errors.Theoretically, the LS method can be defined as follows Since ε = Y − Xβ from Eq. ( 5), we can also express Eq. ( 6) by using the matrix format as Taking derivative of Eq. ( 7) w.r.t.β and equating it to zero gives the following estimator as the LS estimator of β  ̂= ( ′ ) −1  ′ .
The variance-covariance matrix of the LS estimator of The M estimators were found by Huber [13].The principle of the M estimation is the minimization of the sum of a selected ρ function of the errors instead of the sum of squares of them.More specifically, the M estimators are found by minimizing the following expression The M estimate for a given sample can be obtained by solving the equation given below We used the following bisquare function  in this study where   ′ are the standardized residuals.There are many proposals in the standardization of the residuals.One needs to select a robust estimator of scale to do so.The most popular one is the re-scaled median absolute deviation (MAD).The procedure used in this study is given below which is the default option of the robustfit module of Matlab where k is the tuning constant,   ′ are the raw residuals and ℎ  ′ are the leverage values.The constant 0.6745 makes the scale estimation unbiased under normal distribution [13,14].

Simulation Results
In order to compare the estimators mentioned in this paper, a simulation study was conducted including two different levels of sample sizes and several correlation levels for the explanatory variables.In this study, all the programs were written in Matlab for the GLR model given in Eq. ( 5) but for simplicity, the simulations were conducted for the model given below In this model,  0 is the intercept,  1 ,  2 and  3 are the slope parameters and ε is the error term.We took  0 = 0 and  1 =  2 =  3 = 1 without loss of generality.Simulations were conducted for nn=(10000/n) Monte Carlo runs and for the sample sizes n=50 and 100 with the correlation levels  = 0, 0.95 and 0.98.Since we have observed that high VIF values can be obtained at least with the correlation value of 0.95, we did not conduct simulations between 0 and 0.95.We simulated the samples with independent and identically distributed error terms from standard normal distribution and with the explanatory variables having trivariate standard normal distribution with several correlation levels as specified earlier.The simulation results are given in Tables 1 and 2. We conducted the simulations for two different levels of sample sizes to observe the possible effect of the increase in the sample size.
Depending on the simulation results, first we should note that other than the natural effect of the sample size on the variance of the estimators, we did not observe any difference between the results for two different sample sizes.Second, as expected, we did not observe any bias for the estimators for any situation.Regarding to the correlation levels, we observe that as the correlation increases, the variances of the estimators of the slope parameters ( 1 ,  2 ,  3 ) have a tendency to increase for both LS and robust estimators.For  0 , we do not observe any difference between the correlation levels.When we compare the LS and robust estimators in terms of their variances, the best performance is shown by the LS estimators even for high correlation levels.This fact shows that the Tukey M robust method cannot be a remedy for the multicollinearity problem.The reason of this fact is that the Tukey M robust estimators are based on the residuals but based on our observations, multicollinearity does not affect residuals.
Table 1.The simulated values for n=50 with three levels of correlations

Applications
In this section, we give two real-life data examples for the illustration and comparison between the LS and Tukey M robust estimators.

Body Fat Data
The data set which is based on body fats was investigated in detail by Kutner et al. [3].Body fat data set contains three explanatory variables (triceps skinfold thickness ( 1 ), thigh circumference ( 2 ) and midarm circumference ( 3 ), all in cm.) with a sample size of 20.The dependent variable is the body fat percentage ().We obtained the maximum condition index value as 53.33 and the VIF values as 708.84, 564.34 and 104.61.We also examined the scatter plots and the correlation matrix of the explanatory variables.All the information gathered here indicates that multicollinearity exists for this data set.Now we give the regression coefficient estimates and the standard errors of the regression estimators for the LS and robust estimators in Tables 3 and 4 respectively.In Table 3 we do not see much difference between the LS and robustfit estimates.Table 4 shows that the standard errors of the LS estimators are smaller than their counterparts.This result is very consistent with the simulations we have conducted.In the simulations we have found that the LS estimators have a better performance than the Tukey M robust estimators in terms of their variances..42.The processes followed in the previous example are also followed for this example.Tables 5 and 6 show the regression coefficient estimates and the standard errors of the regression estimators for the LS and robust estimators, respectively.According to Table 6, the results are consistent with the simulation results and the previous real-life data example.We observe that the standard errors of the Tukey M estimators are larger than the LS estimators.

Discussion and Conclusion
The main focus of this study is to investigate whether the Tukey M robust estimation method enables us to handle the regression analysis in the presence of multicollinearity.First, in the simulations we have found that one has to take at least a correlation value of 0.95 between the explanatory variables to obtain high VIF values.The most important result is that the classical robust estimators such as the Tukey M estimators cannot be a remedy for the multicollinearity problem.The real-life data examples also supported this fact.The reason of this fact is that the Tukey M robust regression method is focused on the residuals but we observed that multicollinearity does not cause an increase in the residuals (in magnitude) and thus the Tukey M robust regression estimators cannot cope with this problem.This shows that specialized methods should be utilized for the data sets possessing multicollinearity.

Table 2 .
The simulated values for n=100 with three levels of correlations

Table 3 :
The

Table 4 :
The standard errors of the regression estimators LS Robust

Table 5 :
The regression estimates for "the Longley data

Table 6 :
The standard errors of the regression estimators LS Robust