The Impact of Missing Data on the Performances of DIF Detection Methods

This study analyzed the impact of missing data techniques on performances of two differential item functioning (DIF) detection methods (Mantel Haenszel and Multiple Indicator and Multiple Causes) under missing completely at random missing data mechanism. Percentage of missing data was set at 5% and 15%. Zero imputation, listwise deletion and fractional hot-deck imputation were used to handle missing data. The data set of the study consisted of 17 items in the S12 item cluster of Programme for International Student Assessment (PISA) 2015 science test. Results showed that fractional hot-deck imputation produced the best results in identifying DIF items in all conditions and it had also the closest DIF values to the values obtained from complete data set. It was also found that multiple indicator and multiple causes method was more adversely affected than Mantel Haenszel by the presence of missing data.


Introduction
Missing data is a frequently encountered problem in quantitative research studies.Since standard statistical methods were designed for complete data sets, missing values create a significant problem for the researchers.Generally, researchers use various ad hoc methods to handle missing data before the analysis.An example of these strategies is discarding the cases with missing data (i.e., listwise deletion).Replacing missing values with variable mean is another method.Yet, these traditional methods can lead to significant bias in sample statistics (Peugh & Enders, 2004).
The rate of missing data, missing data mechanism and patterns of missing data should be considered in order to decide on the method to handle missing data.Rate of missing data is directly associated with the quality of statistical inferences.There is not a specified criterion in the literature with respect to a reasonable missing data rate to get valid statistical inferences (Dong & Peng, 2013).However, it is seen that the rate of missing data has mostly varied between 0% and 30% in previous studies (Banks & Walker, 2006;Finch, 2011a;Finch, 2011b;Robitzsch & Rupp, 2009;Rousseau et al., 2004).
As previously stated, another aspect of handling missing data is to take the missing data mechanism into account.Rubin (1976) classified missing data mechanisms into three types: Missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).Within the context of item responses, MCAR indicates that some examinees leave the item blank in a completely random way without a systematic mechanism related to the missingness.Data are MAR when the probability of an observation which includes missing data is directly connected with a measurable variable.The fact that male students' probability of leaving an item blank is higher than female students would be an illustration of MAR mechanism.MNAR mechanism refers to the case in which probability of being missing is related to the value of the variable itself.In this case, an examinee might leave the item unanswered as they do not know the answer (Finch, 2011b).

96
Missing data can affect quantitative research severely and may cause bias in parameter estimates, reduced statistical power, inflated standard errors and information loss (Dong & Peng, 2013).Therefore, it is essential for the researchers to investigate the impact of missing data on statistical techniques.Of particular concern in this research is the effect of missing data on the detection of differential item functioning (DIF), which causes systematic errors and reduces validity, with different methods.What follows is a brief overview of DIF and the methods used in this study.
DIF has received considerable attention as a result of the increased reliance on standardized achievement testing for evaluating the progress in education.The need to provide accurate assessment for all the examinees comes with great responsibility for psychometricians.Items intended to measure reading skills, for instance, must be suitable for the use with the students from various groups (e.g.gender, ethnicity etc.) to get meaningful score interpretations (Finch & French, 2007).If an item functions differently in a focal group compared with a reference group after controlling for differences in levels of performance on a latent trait (e.g., ability) of interest, it means the item shows DIF (Holland & Wainer, 1993;Scheuneman, 1979).DIF can be categorized into two broad types: Uniform and nonuniform.Uniform DIF is present when one of two groups has uniformly greater probability of answering an item correctly across all ability levels (Finch, 2005).Nonuniform DIF occurs when members of one group have greater probability in responding to an item correctly for some levels of the ability being measured, while they have lower probability for the other levels of the ability (Camilli & Shepard, 1994).
DIF detection methods can broadly be examined under two headings: (1) Classical Test Theory (CTT) and Item Response Theory (IRT).However, Camilli and Shepard (1994) highlighted that Confirmatory Factor Analysis (CFA) methods can be used to identify DIF as well.Previous studies in the field of DIF in the presence of missing data have mostly focused on CTT and IRT methods rather than CFA (Banks & Walker, 2006;Finch, 2011a;Finch, 2011b;Robitzsch & Rupp, 2009;Rousseau et al., 2004).In this respect, we decided to use Mantel Haenzsel, a widely accepted method in literature based on CTT, and multiple indicator and multiple causes which is a CFA method becoming popular recently.

MIMIC
Multiple indicator and multiple causes (MIMIC) method is based on CFA and has received growing attention on DIF detection.The fundamental technique underlying DIF assessment with MIMIC models includes estimation of both direct and indirect effects for a grouping variable.The indirect effect shows whether there is a difference in the mean of latent trait across the groups, thereby explains the group differences on the latent trait.The direct effect shows whether response probabilities differ across the groups.In the DIF framework, MIMIC model can be written as (Finch, 2005): where Previous simulation studies investigating DIF with MIMIC method have shown that under most circumstances, MIMIC method performed as efficiently as or better than the other methods (SIBTEST, MH, LR etc.) with regard to type I error rate and power (e.g., Finch, 2005;Uğurlu & Atar, 2020;Woods, 2009).Missing data is a significant factor in the performances of statistical methods.Therefore, the impact of missing data on the DIF detection with MIMIC model is an important issue to be considered.(Clauser & Mazor, 1998).

Table 1 Data Organization in MH Method
MH statistic gives odds ratio (α), the ratio of the odds that reference group will respond to the studied item correctly to those for the focal group (Clauser & Mazor, 1998).Odds ratio is given in the equation (2).
Holland & Thayer (1988) recommended a logistic transformation to make interpretation of odds ratio easier.First, log of α is taken in order that the scale is symmetric around zero.Then, resulting value is multiplied by -2.35 which produces ∆ MH (Clauser & Mazor, 1998).Zieky (1993)  Returning briefly to missing data, it is obvious that presence of missing data is an important issue with regard to the DIF detection.However, commonly used DIF detection methods such as MH, SIBTEST and Logistic Regression (LR) are not capable of handling missing data.Hence, missing data handling methods used for the analysis might cause bias.Choice of missing data method may create DIF when there is no DIF in the item or eliminate DIF when it is actually present (Banks, 2015).When the choice of missing data handling method is inconvenient, erroneous decisions can be made based on DIF results which may prevent meaningful test score interpretations.
Researchers have attempted to assess the impact of missing data on DIF detection via simulation studies (Banks & Walker, 2006;Finch, 2011a;Finch, 2011b;Garrett, 2009;Robitzsch & Rupp, 2009) or studies with real data (Rousseau et al., 2004;Tamcı, 2018).Most of these studies have focused on the widely used DIF detection methods such as SIBTEST, MH or LR.Emenogu et al. (2010) used both real and simulated data to investigate the impact of zero imputation (ZI), listwise deletion (LD) and analysis wise deletion on MH method.They reported that ZI produced false DIF regardless of the matching criterion used in the study and LD led to a significant decrease in sample size and the power of MH method.
Finch (2011b) also included IRT-LR in his study along with crossing SIBTEST and LR.This study has assessed the efficacy of ZI, LD, multiple imputation (MI) and stochastic regression imputation (SRI) on DIF detection.LD was recommended as a traditional missing data handling method for each DIF method and MI was the imputation method recommended in the study.

98
In recent years, there has been a growing amount of literature on the DIF detection with MIMIC, a CFAbased DIF detection method (Finch, 2005;Jin & Chen, 2020;Montoya & Jeon, 2020;Shih & Wang, 2009;Uğurlu & Atar, 2020;Woods, 2009).Missing data can affect any type of analysis including CFA (Harrington, 2009).Therefore, this study uses MIMIC method along with MH which is a broadly accepted method in the literature.
Zero imputation, listwise deletion and fractional hot-deck imputation (FHDI) were chosen as missing data handling method in the current study because the first two were widely used in prior research and far too little attention was paid to the last one.For ZI, all missing responses were replaced with 0. For LD, all individuals who had incomplete data responses were deleted.In FHDI, proposed by Kalton and Kish (1984) and investigated by Kim and Fuller (2004), M imputed values are created for each missing value, however, after fractional imputation a single data set is obtained as the output.Fractional weights are assigned to imputed values.The purpose of FHDI is to perform hot deck imputation efficiently (Im et al., 2015).FHDI was extended by Im et al. (2015) in two ways.First, in this new version of FHDI imputation cells are not required to be made in advance.Second, the proposed FHDI method is applied multivariate missing data with arbitrary missing patterns.In this paper, we used extension of FHDI proposed by Im et al. (2015) which is available in R software.

Purpose of the Study
DIF detection is an increasingly important area in test development and validity of standardized achievement tests which contribute to the development of educational policies (Zumbo, 2007).PISA (The Program for International Student Assessment), which enables comparison of students' achievement from different countries and languages and directs educational policies of these countries, is one of the important international standardized tests.Missing data can also be a problem in PISA application as with many other tests (e.g., Emenogu et al., 2010;Tamcı, 2018).
As already stated, traditional DIF detection methods cannot handle missing data.However, it is natural to have missing data in many educational or psychological tests.In this case, solving the missing data problem before DIF analysis becomes essential.Several studies investigating the missing data and DIF detection demonstrated that choice of missing data treatment method or type of missing data can have an influence on the DIF detection methods' performances (Finch, 2011a;Robitzsch & Rupp, 2009).This study therefore set out to assess the performances of DIF detection methods in PISA application in the presence of missing data.The leading research question in this investigation was as follows: What is the impact of (a) different missing data handling methods under (b) MCAR missing data mechanism and (c) different missing data percentages on the performances of the MH and MIMIC DIF detection methods?

Methods
This study aims to determine the impact of three missing data techniques on the performances of DIF detection methods under MCAR missing data mechanism.In this respect, this study is a descriptive study as it describes the existing situation as precisely as possible (Fraenkel et al., 2012).

Data Set
The data set consists of 17 items in the S12 item cluster of PISA 2015 science test.1099 students from Finland who responded to all the items in the test were recruited as the sample of the study.Gender DIF studies are commonly carried out in international tests.However, gender DIF was not studied to make inferences on gender in this study.Different size of focal and reference groups might be another variable and affect the performances of missing data handling methods.As a result of this, Finland data set (1362 students) was chosen as the sample in the present study because the number of reference (female) and focal (male) groups was almost equal after discarding missing data.

Data Analysis
Data set includes 16 binary scored items and a partially scored item (CS637Q02S).This item was coded as 1-0 (full and partial point coded were as 1 and others were coded as 0) by the researchers and analyses were carried out on 17 items.A complete data set of 1,099 people (550 female and 549 male students) was obtained by discarding the missing data from the data set.After gender-based DIF analyses on complete data set were conducted with MH and MIMIC methods, results were recorded to be used as reference.Following DIF analyses, missing responses were created on complete data set by deleting data under MCAR.Missing responses under MCAR mechanism were created by selecting responses randomly from all items and all responses (0-1) for both groups.As the percentage of missing data mostly ranged between 0% and 30% in prior research (Banks & Walker, 2006;Finch, 2011a;Finch, 2011b;Robitzsch & Rupp, 2009;Tamcı, 2018), the percentage of missing data in the current research was set at 5% and 15%.Missing data were then dealt with ZI, LD and FHDI methods.DIF analyses were performed on these data sets.Finally, a comparison was made between reference DIF results and the results obtained from data sets that were completed with missing data handling methods.Whether numbers, levels or directions of DIF items in complete data set have changed or not was investigated.
Pearson correlations of MH and MIMIC DIF statistics in all conditions were also examined."MplusAutomation" (Hallquist & Wiley, 2018) and "difR" (Magis et al., 2010) packages were used for DIF analyses with MIMIC and MH methods respectively.Missing responses were generated in R through adapting the missing data codes written by Doğanay Erdoğan (2012).Imputation with FHDI method was conducted with "FHDI" (Im et al.,2018) package.

Results
Reference DIF results obtained from complete data set appear in Table 2.Those results were compared with DIF results of all combinations included in the study.We examined whether numbers, levels or directions of DIF items in complete data set have changed.Table 3 illustrates DIF items with MH method in all combinations.DIF results were not reported for 15% missing condition with LD as it reduced sample size (70 students in total) dramatically.Sample size for 5% condition with LD was 457 students (232 students for the reference group and 225 students for the focal group).As shown in Table 3, directions of DIF items in complete data set did not change in all conditions.However, there have been differences in number of DIF items and DIF magnitude.Three missing data methods produced following results for 5% condition.DIF items remained the same with ZI, but DIF magnitude of one item (item10) decreased.When LD was used, two DIF items (item10 and item 17) did not change except that they had higher DIF value than their actual value.Item14 displayed DIF with LD although it was not among DIF items in complete data set.Four DIF items were identified correctly with FHDI, yet three of them were overestimated.Item14 showed false DIF in favor of focal group.
When the missing data percentage was 15%, ZI and FHDI both obtained false DIF.ZI identified three of the four DIF items in complete data set while FHDI identified them all.DIF magnitude of two items (item2 and item17) were underestimated with ZI.FHDI produced overestimated DIF magnitude for item11 while it underestimated the DIF magnitude of item10.Table 4 presents DIF items with MIMIC method in all combinations.As in MH method directions of DIF items in complete data set did not change in all conditions for MIMIC method.Three missing data methods produced following results for 5% condition.ZI could not identify only one DIF item which had DIF in complete data set analysis.LD obtained false DIF for item9.Two out of six DIF items showing DIF in complete data set were determined as DIF items with LD.FHDI identified all DIF items accurately; however, it produced false DIF in favor of focal group in two items.In the case of 15% condition, both ZI and FHDI methods were unable to correctly identify all items indicating DIF in complete data set.Nevertheless, FHDI produced false DIF for this condition while ZI did not.Table 5 shows percentage of correctly identified DIF items and DIF free items by missing data handling methods with MH and MIMIC.When examined in terms of the percentage of correctly identified DIF items and DIF free items in complete data set, it was found that for 5% condition with MH method, ZI and FHDI identified all DIF items in complete data set correctly.On the other hand, percentage of DIF items which were correctly identified by LD was 50%.DIF free items were the same with ZI.For this condition, 92% of DIF free items did not display DIF with LD and FHDI methods.FHDI determined all DIF items accurately for 15% missing case whereas percentage of DIF items obtained with ZI was 75%.Percentage of DIF free items which were correctly identified was 92% and 85% for ZI and FHDI methods respectively.When MIMIC method was used, it was found that for 5% condition, FHDI identified all DIF items in complete data set accurately.Percentage of DIF items which were correctly identified were was 33% and 83% for LD and ZI respectively.Items that did not show DIF in complete data set were determined correctly with ZI.However, the percentage of DIF free items that were correctly identified by LD and FHDI were 90% and 81%.
FHDI was able to identify correctly 67% of DIF items for %15 missing case.The result was 50% for ZI in the same condition.ZI was better than FHDI in detecting DIF free items.ZI identified all DIF free items in complete data set correctly.On the other hand, the percentage of DIF free items correctly identified with FHDI was 81%.

Discussion
This study was designed to examine the impact of missing data techniques (ZI, LD and FHDI) on performances of MH and MIMIC DIF detection methods under MCAR missing data mechanism.Missing data percentage was set at 5% and 15%.The current study found that the percentage of identifying DIF items with LD was quite low for both DIF detection methods.It also produced the lowest correlations with reference DIF values regardless of the DIF detection method used.When the missing data percentage increased, sample size was reduced considerably with LD which resulted in no clear DIF results and could not be reported.This limitation was also reported by Emenogu et al. (2010) who could not calculate all DIF statistics with LD in their research.
Another important finding was that for both DIF detection methods, FHDI was the best in identifying the percentage of DIF items in all conditions while ZI was more successful than the other two methods in finding DIF free items.In terms of the correlations between the DIF statistics obtained from complete data set and the other conditions, FHDI had the highest correlations meaning it had the closest DIF values to the nonresponse data.ZI produced slightly lower correlations than FHDI.As regards to DIF detection methods, the results of the study indicated that the correlations are slightly higher for MH method than MIMIC in all conditions which suggests MIMIC method was more adversely affected than MH by the presence of missing data.
In the present study, the percentage of correctly identified DIF items with ZI was lower for the cases with higher missing data percentage regardless of the DIF detection method.Finch (2011b) reported that power rates for ZI decreased as the percentage of missing data increased in the study investigating the impact of missing data on nonuniform DIF detection.The most obvious finding of the current study was that LD was the least optimal method for both identifying DIF items and DIF free items in complete data set.FHDI performed well in correctly identifying DIF items whereas ZI performed better than the other two methods in determining DIF free items in complete data set.In this case, the choice of missing data method should be based upon whether it is more essential to correctly identify items as DIF or falsely do so.
In this investigation, we aimed to study only with real data, which was a limitation of the research.There was only one sample size used in the study as we could not reach larger samples appropriate for our research.Relatively small sample size did not allow us to vary missing data rate; however, there might be missing data more than 15% in real life situations.Research is also needed to determine the performances of missing data handling methods (especially FHDI as it was the best of all) with larger samples and missing data rates.
As mentioned before there has been an increasing attention on DIF detection with MIMIC method.
impact of missing data on the performances of DIF detection methods ___________________________________________________________________________________ ___________________________________________________________________________________________________________________ ISSN: 1309 -6575Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology -6575Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology 100 MIMIC method identified six DIF items.Four items (Item2, Item5, Item10 and Item11) favored reference group and two items (Item8 and Item17) favored focal group.
-6575Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology 102 -6575Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology 103

Akcan, R., Atalay-Kabasakal, K./The impact of missing data on the performances of DIF detection methods ___________________________________________________________________________________ ___________________________________________________________________________________________________________________
ISSN: 1309 -6575Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education

and Psychology 97 Mantel Haenszel Mantel
Holland and Thayer (1988)proposed byHolland and Thayer (1988), might be the most commonly used among contingency table methods.With this method, probability of success on the item is compared for the members of two groups that are matched on the ability being measured.Firstly, respondents are divided into levels depending on the ability.Total test score is generally used for matching the respondents.For each score level, a 2x2 table is then created as in Table1

of Measurement and Evaluation in Education and Psychology ____________________________________________________________________________________
___________________________________________________________________________________________________________________ ISSN: 1309 -6575Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology

Table 2
DIF Results for MH and MIMIC Methods in Complete Data setAs can be seen from the Table2, two items displayed B level DIF and one item displayed A level DIF favoring reference group with MH method.One B level DIF item favoring focal group was also detected.

Table 3
DIF Results for MH Under MCAR Mechanism *:Item showing A level DIF **:Item showing B level DIF ***:Item showing C level DIF Significance level:0.05

Table 4
DIF Results for MIMIC Under MCAR Mechanism ___________________________________________________________________________________________________________________ ISSN: 1309 -6575Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology 101

Table 5
Percentage of Correctly Identified DIF Items and DIF Free Items by Missing Data Handling Methods Table 6 provides correlations of MH and MIMIC DIF statistics in all conditions.As Table6shows, all coefficients are positive and significant at p<.01.FHDI has the highest correlations for 5% and %15 missing case with both DIF methods.This result indicates that FHDI produces the closest DIF values to the values obtained from the complete data set.LD has the lowest correlation with both DIF methods.The correlations are slightly higher for MH method than MIMIC in all conditions.
However, most studies in the literature have not dealt with DIF detection with CFA-based methods in detail when missing data is present.The aim of this study was to contribute to the literature on DIF detection with missing data by comparing two different methods based on CTT and CFA respectively.Since the study was limited to MH and MIMIC methods, it was not possible to see the performances of other methods based on CTT or IRT.Further work needs to be done to examine the performance of MIMIC method with missing data.Researchers might explore the effect of sample size, DIF magnitude and other missing data treatment methods on DIF detection with MIMIC and compare those results with DIF detection methods other than MH.RabiaAkcan-Conceptualization, investigation, methodology, analysis, writing  & editing.Kübra Atalay Kabasakal-Conceptualization, investigation, methodology, analysis, writing &  editing, supervision.