Performances of MIMIC and Logistic Regression Procedures in Detecting DIF

In this study, differential item functioning (DIF) detection performances of multiple indicators, multiple causes (MIMIC) and logistic regression (LR) methods for dichotomous data were investigated. Performances of these two methods were compared by calculating the Type I error rates and power for each simulation condition. Conditions covered in the study were: sample size (2000 and 4000 respondents), ability distribution of focal group [N(0, 1) and N(-0.5, 1)], and the percentage of items with DIF (10% and 20%). Ability distributions of the respondents in the reference group [N(0, 1)], ratio of focal group to reference group (1:1), test length (30 items), and variation in difficulty parameters between groups for the items that contain DIF (0.6) were the conditions that were held constant. When the two methods were compared according to their Type I error rates, it was concluded that the change in sample size was more effective for MIMIC method. On the other hand, the change in the percentage of items with DIF was more effective for LR. When the two methods were compared according to their power, the most effective variable for both methods was the sample size.


INTRODUCTION
Test items may be biased since they may contain constructs that are undesired to be measured along with the desired ones. Any item may also be in relation with a second or more factors other than the one which is of interest. Those factors that are irrelevant to the construct being measured may affect the performances of individuals. This issue is known as test bias. While test bias focuses on test scores and is interested in fairness of a test, item bias focuses on the relationship between answering an item correctly and group membership. And hence, item bias is related to a specific item. Differential item functioning (DIF), which is a statistical method used in item bias analysis, has been the subject of a vast majority of recent studies (Zumbo, 1999).
DIF occurs when respondents who are at the same ability level but from different groups have different item response probabilities on a specific item (Crane, Belle & Larson, 2004;Mazor, Kanjee & Clauser, 1995). In other words, the expression of DIF is that an item displays different statistical properties in different groups for individuals who are at the same ability levels (Holland & Wainer, 1993). Many methods have been developed for detecting test items with DIF. Some DIF detection methods used for dichotomously scored items are; chi-square test based on item response theory (Lord, 1980), standardization (Dorans & Kulick, 1986), Mantel-Haenszel (MH) (Holland & Thayer, 1988), item response theory likelihood ratio test (IRT-LRT) (Thissen, Steinberg & Wainer, 1988), logistic regression (LR) (Swaminathan & Rogers, 1990), simultaneous item bias test (SIBTEST) (Shealy & Stout, 1993), and multiple indicators, multiple causes (MIMIC) model (Finch, 2005;Oort, 1998). Fleishman, Spector, and Altman (2002) mentioned in their study that when there are more than two groups, methods get very complicated for testing DIF in IRT framework. As they mentioned in their study, the MIMIC model has an advantage of including multiple exogenous variables to the analysis simultaneously. Because of allowing a simultaneous analysis of several groups in a single framework, MIMIC model seems to be very useful (Muthen, 1988). This method has become an interesting research subject when its advantages on DIF researches are considered. MIMIC method is quite new with respect to the other methods mentioned above, and especially regarding dichotomous data, there are few studies in the literature involving MIMIC method (see Finch, 2005). Some recent studies on this method were conducted by Fleishman et al. (2002), Woods (2009), Wang, Shih, and Yang, (2009), Woods, Oltmanns and Turkheimer (2009), and Wang and Shih, (2010. Considering these studies, it is reasonable to investigate that under which circumstances MIMIC method is more effective in DIF detection. The aim of the current study is to compare the performance of MIMIC method with LR method -a commonly used method -in detecting items with DIF and interpret the results of these two methods. The DIF detection methods used in this study was explained in detail in the following sections:

Logistic Regression DIF Detection Method
As specified by Swaminathan and Rogers (1990), in detection of differential item functioning, LR model for the two groups of interest can be expressed as: (1) u ij : response of ith individual in jth group to the item, β 0j : intercept parameter for jth group, β 1j : slope parameter for jth group, θ ij : ability of ith individual in jth group.

MIMIC DIF Detection Method
MIMIC method, which is newer than LR, is based on confirmatory factor analysis (CFA) (Finch, 2005). As outlined by Finch (2005), in DIF context, MIMIC model is as Equation 2: where y i * is the latent response variable for ith item (when y i * > τ i , y i is equal to 1, otherwise y i is equal to 0; τ i is the threshold parameter and is related to item difficulty for ith item), η is latent trait variable that is aimed to be measured by the test, λ i is the factor loading, ε i is random error, z k is grouping variable that indicates the group membership and β i is the slope that relates z k with y i * (Finch, 2005;Wang et al., 2009).
MIMIC is a method that allows conducting DIF analyses with multiple grouping variables, and the z symbol in Figure 1 is defined as a vector of the aforementioned multiple grouping variables. The z vector may have continuous or categorical values. Thus, it can be said that MIMIC method is more flexible than traditional DIF detection methods (MH, SIBTEST, IRT-LRT, etc.) that use just only one categorical grouping variable (Wang et al., 2009). The underlying base method for DIF detection by MIMIC method involves evaluation of both direct and indirect effects for a grouping variable. By investigating the indirect effect of the grouping variable (z) on item responses through the latent trait (η), it is indicated whether the mean of this latent variable differs across the groups or not; thus, computations are carried out for group differences on the latent trait. By investigating the direct effect of the grouping variable (z) on item responses (Yi), i.e. β1 ≠ 0, it is indicated whether any difference in response probabilities exists across the groups or not. This relation, after checking the differences in the mean of latent trait for groups, is the test of uniform DIF (Finch, 2005).

Journal of Measurement and Evaluation in Education and Psychology
DIF detection models to be used in bias studies must be appropriate for the test used and for the properties of the groups to which the test is applied. This study used different conditions for dichotomous data to investigate the circumstances under which the MIMIC method produces more accurate results in DIF detection. The conditions used in the current study differ from previous studies in terms of the levels of these three conditions: sample size, ability distribution across groups, and percentage of items with DIF. It is an important question whether the MIMIC method works similarly in cases with different sample sizes (Wang & Shih, 2010). Therefore, different sample sizes in the study were compared. The data used in the study were produced according to the three-parameter logistic model (3PLM), and the test length was taken as 30 items to show similarity with actual applications. In addition, the focus of this study was on the assessment of uniform DIF.
In this study, the MIMIC method was compared to the LR method, which is a relatively more traditional method. This study compared how Type I error rates and power of MIMIC and LR DIF detection methods changed according to sample size, ability distributions of the groups, and percentage of items with DIF. In summary, the goal of this study was to investigate the performances of MIMIC and LR methods under various conditions according to their type I error rates and power when detecting DIF items on dichotomous tests. The research questions were as the following: 1. How do Type I error rates and power of MIMIC and LR DIF detection methods differ according to sample size?
2. How do Type I error rates and power of MIMIC and LR DIF detection methods differ according to ability distributions of the groups?
3. How do Type I error rates and power of MIMIC and LR DIF detection methods differ according to percentage of items with DIF?

Simulation Conditions and Data Generation
This study is a DIF detection research using MIMIC and logistic regression methods for dichotomous data based on various simulation conditions. In this simulation study, conditions different from those of previous studies in which the MIMIC model was used were investigated.

483
The conditions that were kept constant throughout the study For all conditions, the ability parameters of the individuals in the reference group were generated based on the standard normal distribution, N(0, 1). Furthermore, 30 dichotomously scored (either 0 or 1) responses for each individual were produced. The change in the item difficulty parameters between the groups for the items with DIF was set to a constant value as 0.6 units against the focal group to form medium DIF. The ratio of the focal group to the reference group (1:1) is another condition that was kept constant.
The conditions that were varied throughout the study One of the conditions that was varied in this study was the sample size. Two levels of large sample size were used: 2000 (R: 1000, F: 1000) and 4000 (R: 2000, F: 2000). Finch (2005) found in his study that MIMIC method produces type I error rates higher than .05 nominal alpha level for a shorter test (i.e., 20 items) responded by a sample of 1000 (R: 500, F: 500) individuals under 3PL model. Based on the findings of Finch (2005), for a test with 30 items under 3PL model considered in this study, larger sample sizes were taken into account. In addition to sample size, ability distribution of the focal group was also a condition that was varied. Two levels of ability distribution of focal group were used: N(0, 1) and N(-0.5, 1). For the first level of the ability distribution of focal group condition, the cases where the distribution of the reference group and the focal group is the same were considered. For the second level of the ability distribution of focal group condition, the cases where the distribution of the focal group is lower than the reference group were considered Another condition that was varied in this study was the percentages of items with DIF. Two levels were used for this condition: 10% (3 items) and 20% (6 items). Items with DIF were kept the same throughout the test. In 10% of items with DIF condition, DIF was formed for items 4, 15, and 27 and in 20% of items with DIF condition, it was formed for items 1, 4, 15, 18, 26, and 27. By crossing the levels of each condition, total of 8 simulation conditions were created.
For each simulation condition, the data were derived for dichotomously scored (0/1) items using a 3PLM via R 3.0.2 program (R Core Team, 2013). The derivation of the data was performed 100 times for each condition. The item parameters used in this study were selected randomly from the item parameters used in Finch's (2005) study. The selected parameters are shown in Table 1.

Data Analysis Procedures and Evaluation Criteria
In the DIF analyses of the data, Mplus 6.12 (Muthén & Muthén, 1998 program was used for the MIMIC method and SAS 9.1.3 (SAS Institute, 2007) program was used for the logistic regression method. The DIF analyses were conducted using a pairwise approach in which the groups are compared with each other (i.e., focal group compared with reference group) (Sari & Huggins, 2014).
In the study, the effects of sample size, ability distribution of focal group, and the percentage of items with DIF on Type I error rates and power were investigated. The level of significance (α level) was assumed to be .05 in detecting items with DIF. Type I error is defined as a misclassification of an item without DIF as an item with DIF. Under 10% of items with DIF condition, there were 27 non-DIF items whereas under 20% of items with DIF condition, there were 24 non-DIF items. The percentage of non-DIF items that were falsely detected as DIF items was calculated for Type I error rate. The concept of power, on the other hand, is correct classification of an item with DIF as an item with DIF. Under 10% of items with DIF condition, there were 3 DIF items whereas under 20% of items with DIF condition, there were 6 DIF items. The percentage of DIF items that were correctly detected as DIF items was calculated for power. Both Type I error and power are equally important for DIF researches (Vaughn & Wang, 2010). According to Cohen and Cohen (1983) when investigators need to set the power, it is reasonable for them to choose a value in the .70 -.90 range. In the current study, the desired value for power rate was considered as .70 and above.

Type I Error Rate
Type I error rates are calculated for each condition, namely sample size, ability distribution of focal group, and percentage of items with DIF and given in Table 2. The main finding of this study was that the sample size was an important factor in DIF analyses conducted with MIMIC and LR methods. As the sample size increased from 2000 to 4000, the type I error rates decreased for MIMIC method but increased for the LR method when other conditions of the study were equal. For the MIMIC method, while the lowest rate was calculated under the condition where the sample size was 4000, percentage of items with DIF was 10%, and the ability distribution of both groups showed a standard normal distribution N(0, 1), the highest rate was calculated under the condition where the sample size was 2000, percentage of items with DIF was 20%, and the ability distribution of both groups showed a standard normal distribution N(0, 1). On the other hand for the LR method, while the lowest rate was calculated under the condition where the sample size was 2000, percentage of items with DIF was 10%, and ability distribution of the focal group was N(-0.5, 1), the highest rate was calculated under the condition where the sample size was 4000, percentage of items with DIF was 20%, and the ability distribution of both groups showed a standard normal distribution N(0, 1).
The second important finding was that the percentage of DIF items was an important factor that effected the type I error rates. As the percentage of DIF items increased from 10% to 20%, type I error rates were very similar in MIMIC method, however, increased in LR method when other conditions of the study were equal. According to the study results, in terms of type I error rates, the percentage of DIF items was more effective factor for the LR method.
The third finding was that the change in the ability distribution of focal group did not have an important effect on type I error rates for both methods. Table 3 presents the power values for the two DIF detection methods for all conditions included in the study. The acceptable power rate for this study was .70 and above. In general, both methods had power rates above acceptable levels for all conditions.

Power
The power rate of the MIMIC method was quite high for conditions with a sample size of 4000 respondents. The power rate of the LR method, on the other hand, was quite high for conditions wherein the sample size was large and the ability distribution of both groups showed a standard normal distribution N(0, 1). The standard definition of power at a specified level of alpha is not meaningful in cases where Type I error rates are high (Finch, 2005). However, all power results were included in this study for comparison purposes. The power rates were shown in italics for cases where Type I error rate was higher than .10. Considering all conditions, both methods had power high enough and these results reached a higher value when sample size increased. The condition in which the power was closest to perfect for the MIMIC method was the one in which the sample size was 4000 respondents, ability distributions of the reference and focal groups showed a standard normal distribution, and percentage of items with DIF was 20%. The power results of the MIMIC method were larger than those of the LR method, except for a single condition. This condition was the one in which the sample comprised 2000 respondents, ability distributions of the reference and focal groups showed a standard normal distribution, and percentage of items with DIF was 10%. The differentiation of the ability distributions for the focal group affected the power of the LR method more than the power of the MIMIC method for almost all conditions. In addition, the change in the percentages of items with DIF did not substantially change the power of both methods.

DISCUSSION and CONCLUSION
In this study, the performances of MIMIC and LR methods were compared according to their type I error rate and power. It can be concluded in this study that the MIMIC method produced lower Type I error rates than the LR method in conditions where the sample size was larger (4000 respondents); the LR method produced lower Type I error rates than the MIMIC method in conditions where the percentage of items with DIF was lower (10%) with smaller sample size (2000 respondents). In general, the Type I error rates of the MIMIC method were observed to be lower than those of the LR method. However, for both methods, Type I error rates exceeded acceptable alpha level (α = .05) in all conditions. Specifically, while the increase in the sample size substantially reduced the Type I error rate of the MIMIC method for all conditions, its effect on the type I error rate of the LR method changed according to the percentage of items with DIF. While the change in the sample size had a very small effect on the Type I error rate of the LR method for 10% DIF items conditions, it caused a substantial increase in the Type I error rate of this method for 20% DIF items conditions. In the study conducted by Finch and French (2007), Type I error rates of the LR and CFA methods in detecting items with nonuniform DIF were not substantially affected by the increase in the sample size. Based on this results, it can be concluded that similar results obtained from current study for the LR method with only the 10% DIF items conditions. As can be understood from this current research, in the conditions where the percentage of items with DIF is high the LR method is more sensitive to the sample size condition. But the MIMIC method is affected by the sample size in the same manner for all conditions. The difference based on CFA between current and Finch and French's (2007) study can be attributed to the type of DIF. In their study they focused on nonuniform DIF and emphasized the question of the usefulness of CFA method for identifying this type of DIF. MIMIC method is also based on CFA and it is capable of detecting uniform DIF as also stated by Woods (2009), andWoods et al. (2009).
On the other hand, in the current study the increase in the percentage of items with DIF did not affect the Type I error rate of the MIMIC method importantly but increased that of the LR method. It can be seen in Finch's (2005) results that for the MIMIC method, in the bigger test length condition the effect of percentage of items with DIF was reduced for both sample size conditions, 600 and 1000 respondents. In the current study for both sample size (2000 and 4000 examinees) the effect of percentage of items with DIF was already quite low but still the type one error rates were not small enough as they were desired. By combining the result of these two studies it can be concluded for the MIMIC method that, big sample sizes or relatively small sample sizes with bigger test lengths are needed to reduce the effect of percentage of items with DIF.
The other result obtained from this study is that, the difference in the ability distribution of the focal group did not substantially affect the Type I error rates of both methods. In conclusion, when these two methods were compared in terms of Type I error rates, the change in the sample sizes was more effective for the MIMIC method while the change in the percentages of items with DIF was more effective for the LR method.
When the results were examined in general, the power of both methods for all conditions was above the acceptable level (.70). For conditions where the sample size was higher, the power results of the MIMIC method were quite high. The power of the LR method, on the other hand, was quite high for conditions where the sample size was large and the ability distribution of both groups showed a standard normal distribution. The power results of the MIMIC method were higher than those of the LR method, except for a single condition. This condition was the one in which the sample comprised 2000 respondents, the ability distributions of the reference and focal groups showed a standard normal distribution, and the percentage of items with DIF was 10%.
The increase in the sample size increased the power for both methods. The fact that the ability distribution of the focal group differed from the ability distribution of the reference group decreased the power of both methods. The amount of reduction that this change in the ability distribution caused was more for the LR method for almost every condition. The increase in the percentage of items with DIF increased the power of both methods to a small extent. As a result, considering the change in the power, the sample size was the most effective variable for both methods.
Specifically, the change in the sample size was very effective in changing the power of the MIMIC method. The power of the MIMIC method increased as the sample size increased. Finch (2005) concluded in his study that the power results of the MIMIC method for 2PLM were generally as high as the power results of the classical methods or even in some conditions higher than those of the SIBTEST and MH methods. Similar results were obtained in this study for 3PLM, the power results of the MIMIC method were higher than those of the LR method for almost all conditions. In the study conducted by Finch and French (2007), the power results of the LR and CFA methods in detecting items with nonuniform DIF were below .70 for all conditions. In current study, the power results were over .70 for both methods for all conditions. Finch and French (2007) reported in their study that the power of the LR method increased as the sample size increased. But, according to their results the power of the CFA method decreased or stayed the same while the sample size increased. In current study, as the sample size increased, the power of both LR and MIMIC methods increased. These two studies support each other in terms of the increase in power of the LR method according to the sample size condition. However, the results differed in terms of the change in the power of the MIMIC method, which is a method based on CFA. As mentioned before this difference between two studies can be attributed to the difference of the type of DIF (uniform or nonuniform) used in these studies.
In this study, three main conditions and eight sub-conditions were considered, with two different sample sizes, two different ability distributions for the focal group, and two different percentages of items with DIF. The number of items in the test was kept constant for all conditions. In future studies, the number of items in the test can be increased to see how the results are affected in long tests. As seen in the comparison of recent and previous research, test length may have an important effect on MIMIC method.
It is an important issue how the MIMIC method performs in terms of DIF at different sample sizes. Two different sample sizes, 2000 and 4000 individuals, were used in the study. However, the desired Type I error rates could not be achieved even with a sample size of 4000 individuals. This points out an important issue. And hence, future studies can be conducted on larger sample sizes to investigate the ideal sample size for the MIMIC method.
In the study, the ratio between the reference and focal group sizes was taken as 1:1. However, during the actual examinations, there can be different situations regarding the proportions of sample size of these two groups. Therefore, studies can be done using different ratios. Furthermore, the study was conducted with 3PL model-based data. Similar work can be conducted with 2PL model-based data, and comparisons can be made between these studies.
It is thought that this study will be a reference to the studies on DIF detection through the MIMIC method and that it will make it easy for researchers to decide the appropriate DIF detection method according to sample size and ability distributions in the analysis of the actual test results.
The aim of this study is to provide a reliable source to researchers in selecting DIF detection techniques that are appropriate for the test to be used and the properties of the test group. Thus, with the help of more reliable DIF detection techniques, tests can be made fairer.
Based on the results obtained from this research, it can be suggested to choose the LR method in DIF analysis studies performed on small samples such as the one comprising 2000 respondents and with small amount of DIF items such as 10% of test items; and the MIMIC method in DIF analysis studies performed on samples as large as approximately 4000 respondents and higher. Subsequent to the detection of items with DIF using these methods, it is advisable to refer to expert's opinion to conduct a study to detect bias in these items.