PARAMETER ESTIMATION IN MULTIPLE LINEAR REGRESSION MODELS USING RANKED SET SAMPLING

In statistical surveys, if the measurements of sampling units according to the variable under consideration is expensive in all sense, and if it is possible to rank sampling units according to the same variable by means of a method which is not expensive at all, in those cases, Ranked Set Sampling (RSS) is a more e¢ cient sampling method than the Simple Random Sampling (SRS) to estimate the population mean. In this study, the e¤ects of using RSS in multiple linear regression analysis are considered in terms of estimation of model parameters. Firstly, according to RSS and SRS the estimates of multiple regression model parameters are obtained and then the e¤ects concerning the variances of the estimators are investigated by Monte Carlo simulation study based on Relative E¢ ciency (RE) measure. It is shown that the estimators obtained based on RSS are more e¢ cient than the estimators based on SRS when the sample size is small.


Introduction
Recently, especially in the studies relating to areas of ecology, agriculture and medicine, it is a widely encountered situation that the measurement of the variable under consideration is too costly or too di¢ cult in aspects of time and labor.So, in such areas, it is needed to prefer a sampling method in which the population is represented by the smallest sample size in a best way possible.Such a sampling method aiming at this is initially suggested by McIntyre [7] with the name of Ranked Set Sampling (RSS).McIntyre indicates that RSS is a more e¢ cient sampling method than Simple Random Sampling (SRS) method for estimation of the population mean.
In RSS, the sample selection is composed of two stages.At the …rst stage of sample selection, m simple random samples of size m are selected from an in…nite population and each sample is called as set.Equivalently, m 2 units may be drawn from the population and randomly partitioned into m equal samples.Then, each

YAPRAK ARZU ÖZDEM • IR, A. ALPTEKIN ES • IN
of observations are ranked from the smallest to the largest according to variable of interest, say Y , in each set.This measurement is such a low level measurement which does not cost too much and the ranking of the units can be done by the help of previous experiences, a visual ranking or by the help of a concomitant variable.At the second stage, the …rst observation unit from the …rst set, the second observation unit from the second set and going on like this m th observation unit from the m th set are taken and measured according to the variable Y with a high level of measurement satisfying the desired sensitivity.To provide necessary measurements for inference, the entire process (or cycle) can be replicated r times, thus yielding n = mr measured units out of m 2 r selected units.Under the assumption that there is no ranking error, these n measured units constitute the ranked set sample which is denoted by Y (i)j ; i = 1; 2; :::; m; j = 1; 2; :::; r .So, the unbiased estimator of the population mean is de…ned as Y (i)j [7].Dell and Clutter [6] showed that, even if there are an errors in ranking, Y RSS is an unbiased estimator of the population mean and furthermore that, where Y SRS is the SRS sample mean with size n and equality occurs only if the ranking is so poor as to yield a random sample.RSS is preferred for obtaining e¢ cient parameter estimations in many statistical analyses.The use of RSS in the linear regression model for the …rst time is introduced by Stokes [12].Stokes estimated the population mean by modeling the relation between concomitant variable and variable of interest.Muttlak [8,9] considered RSS for simple and multiple regression models.For both cases, he adopted the unequal variance assumption for the regression errors.However; he did the parameter estimation by using the Least Squares Estimation (LSE) method.So, under his model assumptions there is no improvement in estimating the model parameters.Barnett and Barreto [1] considers simple linear regression model with replicated observations obtained from RSS. Optimal-L estimators of model parameters are obtained under the assumption that dependent variable has normal distribution and the ranking is perfect.They showed that the RE gains can be high for the simple lineer regression coe¢ cients.Samawi and Ababneh [11] examined RSS method in simple linear regression model and they assume that variables in the model (X; Y ) have a bivariate normal distribution.Also, parameter estimation is done with LSE method under the assumption that the error term of the model is constant variance like SRS.They conclude that, RSS provides a more e¢ cient way to do regression analysis.Chen and Wang [5] considered optimal sampling schemes with di¤erent optimality criteria in the application of RSS for e¢ cient regression analysis.
In this study, the study of Samawi and Ababneh [11] is bene…ted and multiple linear regression model is considered in case that units are selected by RSS with ranking according to one of independent variables.Estimators of model parameters (regression coe¢ cients) and their variances are examined under the assumption that error terms have constant variance based on the Lemma given by Bhattacharya [2].Then, for the comparision of the e¢ ciency of the estimated model parameters obtained by using RSS with those obtained by using SRS, we use RE measure.Also, a Monte Carlo simulation study is realized to obtain RE values for di¤erent sample size and correlation coe¢ cients under the assumption that the dependent variable and independent variables have multivariate normal distribution.

Estimation of Model Parameters Using RSS in a Multiple Regression Model
In regression analysis, since the variation of dependent variable is tried to be explained with the help of independent variables, in the application phase, such situations can be faced as that the measurement of dependent variable can be quite costly or so di¢ cult in aspects of time and labor.In those cases, RSS can be used for obtaining the sample units; however, the method used in ranking process of the units according to dependent variable is important.Chen [3] suggested the adaptive RSS method in derivation of the regression estimator of population mean by ranking with multiple concomitant variables.However, the method needs a ranking criterion function for using all independent variables in ranking.Ranking of the units can be done according to dependent variable, for instance visual ranking, however model assumptions can be violated and also the obligation of expressing the regression model in terms of order statistics can arise.On the other hand, the most appropriate ranking related to dependent variable can be done by using one of the independent variables.For example; the ages of animals need to be determined in animal growth studies but aging an animal is usually time consuming and costly.However, variables on the physical size of an animal, which are closely related to age, can be measured easily and cheaply [5].So, independent variable which will be used in ranking will also take as a concomitant variable.If sensitive measurement of the units associated with the independent variable is cheap and easy, the units are ranked according to independent variable X with sensitive measurement techniques and if not, with low-level measurement techniques are used which do not bring about ranking error.Here, the primary aim is to reach the most e¤ective information about the dependent variable of which the measurement is the most di¢ cult and expensive, by the help of the independent variable of which measurement is the easiest and cheapest one among independent variables in the regression model and to estimate the more e¢ cient estimators of model parameters.
The regression model between dependent variable Y and independent variables (X 1 ; X 2 ; :::; X p ) can be written as; where B and X are p-dimensional parameters and random independent variables vectors, respectively.If sample selection is done with RSS method for the model 2.1, let X k be the independent variable which will be used as a concomitant variable for ranking.At the …rst stage of the sample selection, random sample of size m 2 are selected from the population as (Y; X 1 ; X 2 ; :::; X k ; :::; X p ) and each unit is partitioned randomly into m sets.Then, sample units in each set are ranked according to X k variable.Thus, dependent variable Y which is still not measured sensitively and remaining independent variables (X 1 ; X 2 ; :::; X k 1 ; X k+1 ; :::; X p ) will have been ranked according to X k .At the second stage, the unit which has the smallest X k value from the …rst set, the unit which has the second smallest X k value from the second set and going on like this the unit which has the m th largest X k value from the m th set are selected and measured sensitively according both to independent variables included in the model and dependent variable Y .This cycle can be replicated r times until the desired sample size is obtained.Thus, multiple linear regression model for ranked set sample of size n = mr is de…ned as  In this model; X R is a nx(p + 1)-dimensional independent variables matrix the elements of which are (1; X 1[i]j ; :::; X k(i)j ; :::; X p[i]j )(i = 1; 2; :::; m; j = 1; 2; :::; r).X k(i)j denotes i th order statistic of X k in the i th set and j th cycle and X p[i]j is the X k -induced i th order statistic of X p in the i th set and j th cycle.Y R is a nx1dimensional dependent variable vector the elements of which are Y [i]j .Y [i]j is the value of the dependent variable for i th independent variables (1; X 1[i]j ; :::; X k(i)j ; :::; X p[i]j ) in the i th set and j th cycle.is a (p+1)x1-dimensional parameter vector.R is a nx1-dimensional random error vector with E( R ) = 0, V ar( R ) = 2 I and Cov( R ; X R ) = 0. Constant variance assumption related to error vector can be written from the Lemma about induced order statistics given by Bhatacharya [2].From this Lemma, denote by f Y =X (y=x) the conditional distribution of Y given X.If (X i ; Y i ) independent and identical distributed random variables then the conditional probability density function of Y given X (i) = x is written as where, X (i) is the i th order statistic and Y [i] is the X-induced i th order statistic of Y .This lemma can be also extended to multivariate case [4].In this study, this structure is used in the estimation of model parameters.
Under these model assumptions, parameter vector is estimated with RSS by using LSE method as below The unbiased estimator of 2 is written as

Relative Efficiencies of Estimators
In regression analysis, generally, sample selection is done with SRS.In this study, it is assumed that the units are randomly selected with SRS without any predetermination of independent variables and the measurement of the values of dependent variable and independent variables taken from each observation unit.By taking a simple random sample of size n from a multivariate normal distribution of (Y; X 1 ; X 2 ; :::; X p ), the model 2.1 can be denoted in matrix notation as where, In this model; Y S is a nx1-dimensional dependent variable vector obtained with SRS.X S is a nx(p + 1)-dimensional independent variables matrix obtained with SRS. is a (p+1)x1-dimensional parameter vector.S denoted as nx1-dimensional random error vector with E( S ) = 0, V ar( S ) = 2 I and Cov( S ; X S ) = 0.
When sample selection is done with SRS, parameter vector is estimated with LSE method from the model 3.1, like that The variance of this estimator is written as The unbiased estimator of 2 according to SRS is As it can be seen from equations 2.2 and 3.2, the variances of estimators depend on the expressions (X 0 R X R ) 1 and (X 0 S X S ) 1 .So, to compare RSS and SRS method based on RE measure, it is necessary to consider matrix of independent variables X R and X S as random matrices.Since all possible values of independent variables are considered in this way, what e¤ects RSS creates in parameter estimation can be explained better.In this condition, the variances of estimators according to SRS and RSS are respectively can be written as; should be obtained theoretically.However, derivation of these expected values seems to be possible with asymptotic theory.Also, it is not appropriate to put forward any opinion by using asymptotic theory due to small sample sizes.Thus, RE values for regression coe¢ cients in multiple linear regression model with two independent variables are observed by the help of the simulation study.

Monte Carlo Simulation Study to Estimate the Relative Efficiency Values
The simulation study is done with Matlab package program with the algorithm given below: Sample selection for RSS; 1. Data of (Y; X 1 ; X 2 ) are generated from a multivariate normal distribution according to di¤erent R 2 coe¢ cient of determination values and di¤erent x1y , x2y and x1x2 correlation coe¢ cients which satisfy these R 2 values.They are given in Table 1.

2.
Sample units with size m 2 are randomly selected from the generated data.

3.
Sample is partitioned randomly into m sets including m sample units.

4.
Sample units in each set are ranked according to variable X 1 which represents a concomitant variable in practice.In the process of ranking …rst the unit with the smallest X 1 value from the …rst set is taken.Then, the unit with the second smallest X 1 value from the second set is taken and this process continues until the unit with the largest X 1 value is taken from the m th set.

5.
When the cycle number is r > 1, relatively 2 nd , 3 rd and 4 th steps of the program are replicated r times. 6.
At the end of r cycles, ranked set sample of size n = mr is obtained.Sample selection for SRS; 7.
Sample of size n = mr is selected randomly from the population which is generated in 12.
At the 4 th step, ranking is done according to X 2 and remaining steps are repeated.
In this simulation study, for all possible (m; r) pairs satisfying the sample size n = 5; 6; 10; 16 are taken.Correlation values which will be used in the simulation study are de…ned according to coe¢ cient of determination R 2 .Di¤erent x1y , x2y and x1x2 combinations which have the same value of R 2 are used to explore the source of the main e¤ect on RE.R 2 is de…ned as x1y , x2y and x1x2 values satisfying 3 di¤erent R 2 values are taken in 15 di¤erent correlation cases.These cases are given in Table 1.The results related to RE values obtained from the simulation study are given at Figure 1-6.as set size m increases and cycle size r decreases, the RE values of estimated regression coe¢ cient related to independent variable used in ranking and RE( b 0 ) increase (see Figure 1,2,4,6) but the RE values of estimated regression coe¢ cient related to independent variable which was not used in ranking approximates to 1(see Figure 3,5).For the …xed (m; r) values, REs are not a¤ected from the coe¢ cient of determination R 2 and correlation coe¢ cients x1y and x2y .However, when x1x2 > 0; 25 and the sample size is small, RE( b 1 ) and RE( b 2 ) values decrease when the ranking is done according to X 1 and X 2 respectively (see Figure 2, 6).RE( b 0 ) is not a¤ected from whether the ranking is done according either to X 1 or X 2 and takes similar values for the …xed (m; r) and for di¤erent correlation coe¢ cient cases taken into consideration (see Figure 1,4).For the …xed (m; r) and all correlation coe¢ cient cases, RE( b 1 ) values obtained from the ranking done according to X 1 and RE( b2 ) values obtained from the ranking done according to X 2 give closer results (see Figure 2 ,6).
As it can be seen in Table 1, in this simulation study just the positive values of correlation coe¢ cients are examined.The reason is that simulation studies done for negative values give the same results with positive ones.For more detail, see Özdemir [10].

Conclusion
In this study, regression coe¢ cients of multiple linear regression model are estimated by using RSS and RE values of these estimations according to SRS are investigated.By the help of simulation study, multiple linear regression model with two independent variables is taken into consideration and ranking is done according to each independent variable in RSS.Also, di¤erent correlation cases for speci…ed R 2 are considered.So, the e¤ects of ranking and correlation cases on the parameter estimation are examined based on RE measure.Based on the simulation results, RSS method in the estimation of model parameters 0 and 1 when the ranking is done according to X 1 and in the estimation of model parameters 0 and 2 when the ranking is done according to X 2 is more e¤ective than SRS method under the small sample size.However, when the sample size increases, RE values approximate to 1. So, it is expected that the di¤erence between RSS and SRS disappears by means of RE.When x1x2 < 0:25 and sample size n is small, RE will take the highest values.In regression analysis, for the high values of x1x2 the collinearity problem may arise.So, in general, it is desired that just one of the independent variables which have high relationship between each other is included in the model.In this case, ranking can be done by the help of the remaining independent variable.In a regression model at which x1x2 does not have a high value, if the sensitive measurement of both of the independent variables is cheap and easy, the independent variable which has a higher correlation with the dependent variable may be preferred in the ranking process.In this way, RE values of estimated regression coe¢ cient related to independent variable used in ranking process will increase, and so, the signi…cance of the related regression coe¢ cient will be supported since the variance of the estimated regression coe¢ cient related to independent variable which has the highest contribution in explaining the model decreases.
In case when studying with a small sample size is desired due to the budget limitations, estimations of model parameters in regression analysis with RSS are more e¢ cient than the estimations obtained with SRS.However, after the selection of the maximum set size m and the minimum repetition number r according to the method which will be used in the ranking process of the units with RSS to satisfy n sample size needed for the application, the e¢ ciencies relating to parameter estimations will take their highest values.

As seen from Figure 1 - 6 ,
RE( b 0 ), RE( b 1 ) and RE( b 2 ) gets close to 1 as n increases and the highest values of REs are obtained when m = 5 and r = 1 with the desired sample size of n = 5: For the …xed sample size n,

Figure 1 .Figure 2 .Figure 3 .
Figure 1.Plots of RE( b 0 ) values for all considered correlation cases and all possible (m; r) values which satisfy n = 5; 6; 10; 16 when the ranking is done according to X 1

Figure 4 .Figure 5 .Figure 6 .
Figure 4. Plots of RE( b 0 ) values for all considered correlation cases and all possible (m; r) values which satisfy n = 5; 6; 10; 16 when the ranking is done according to X Step 1 th .This sample constitute of simple random sample of size n = mr.SRS vector is obtained by using simple random sample of size n which is derived in Step 7 th .10.All steps from 1 st one to 10 th one are repeated 300000 times and after 300000 replications, means and variances of estimations obtained in 8 th and 9 th steps are calculated.

Table 1
Correlation coe¢ cient values which is used in Monte Carlo simulation study for multiple regression model with two independent variables.