A Comparison of the efficacies of differential item functioning detection methods

Münevver Başman

doi:10.21449/ijate.1135368

EN TR

A Comparison of the efficacies of differential item functioning detection methods

Abstract

To ensure the validity of the tests is to check that all items have similar results across different groups of individuals. However, differential item functioning (DIF) occurs when the results of individuals with equal ability levels from different groups differ from each other on the same test item. Based on Item Response Theory and Classic Test Theory, there are some methods, with different advantages and limitations to identify items that show DIF. This study aims to compare the performances of five methods for detecting DIF. The efficacies of Mantel-Haenszel (MH), Logistic Regression (LR), Crossing simultaneous item bias test (CSIBTEST), Lord's chi-square (LORD), and Raju's area measure (RAJU) methods are examined considering conditions of the sample size, DIF ratio, and test length. In this study, to compare the detection methods, power and Type I error rates are evaluated using a simulation study with 100 replications conducted for each condition. Results show that LR and MH have the lowest Type I error and the highest power rate in detecting uniform DIF. In addition, CSIBTEST has a similar power rate to MH and LR. Under DIF conditions, sample size, DIF ratio, test length and their interactions affect Type I error and power rates.

Keywords

A Comparison of the efficacies of differential item functioning detection methods

Öz

To ensure the validity of the tests is to check that all items have similar results across different groups of individuals. However, differential item functioning (DIF) occurs when the results of individuals with equal ability levels from different groups differ from each other on the same test item. Based on Item Response Theory and Classic Test Theory, there are some methods, with different advantages and limitations to identify items that show DIF. This study aims to compare the performances of five methods for detecting DIF. The efficacies of Mantel-Haenszel (MH), Logistic Regression (LR), Crossing simultaneous item bias test (CSIBTEST), Lord's chi-square (LORD), and Raju's area measure (RAJU) methods are examined considering conditions of the sample size, DIF ratio, and test length. In this study, to compare the detection methods, power and Type I error rates are evaluated using a simulation study with 100 replications conducted for each condition. Results show that LR and MH have the lowest Type I error and the highest power rate in detecting uniform DIF. In addition, CSIBTEST has a similar power rate to MH and LR. Under DIF conditions, sample size, DIF ratio, test length and their interactions affect Type I error and power rates.

Anahtar Kelimeler

References

Apinyapibal, S., Lawthong, N., & Kanjanawasee, S. (2015). A comparative analysis of the efficacy of differential item functioning detection for dichotomously scored items among logistic regression, SIBTEST and raschtree methods. Procedia-Social and Behavioral Sciences, 191, 21-25. https://doi.org/10.1016/j.sbspro.2015.04.664
Atalay Kabasakal, K., Arsan, N., Gök, B., & Kelecioğlu, H. (2014). Comparing performances (type I error and power) of IRT likelihood ratio SIBTEST and Mantel-Haenszel methods in the determination of differential item functioning, Educational Sciences: Theory and Practice, 14(6), 2175-2193. https://doi.org/10.12738/estp.2014.6.2165
Atar, B. (2007). Differential item functioning analyses for mixed response data using IRT likelihood-ratio test, logistic regression, and GLLAMM procedures [Unpublished doctoral dissertation]. University of Florida State.
Ayva Yörü, F.G., & Atar, H.Y. (2019). Determination of differential item functioning (DIF) according to SIBTEST, Lord's [Chi-squared], Raju's area measurement and Breslow-Day Methods. Journal of Pedagogical Research, 3(3), 139 150. https://doi.org/10.33902/jpr.v3i3.137
Camilli, G., & Shepard, L.A. (1994). Methods for identifying biased test items. Sage Publications.
Camilli, G. (2006). Test fairness. In R.L. Brennan (Ed), Educational Measurement (4th ed., pp. 221–257). Rowman & Littlefield.
De Ayala, R.J. (2009). The theory and practice of item response theory. The Guilford Press.
DeMars, C.E. (2009). Modification of the Mantel-Haenszel and logistic regression DIF procedures to incorporate the SIBTEST regression correction. Journal of Educational and Behavioral Statistics, 34, 149-170. https://doi.org/10.3102/1076998607313923

DeMars, C.E., & Lau, A. (2011). Differential item functioning detection with latent classes: how accurately can we detect who is responding differentially?. Educational and Psychological Measurement, 71(4), 597 616. https://doi.org/10.1177/0013164411404221
Dorans, N.J., & Holland, P.W. (1992). DIF detection and description: Mantel‐Haenszel and standardization. ETS Research Report Series, 1992(1), i 40. https://doi.org/10.1002/j.2333-8504.1992.tb01440.x
Embretson, S.E., & Reise, S.T. (2000). Item response theory for psychologists. Lawrance Erlbaum Associates.
Erdem Keklik, D. (2014). Değişen madde fonksiyonunu belirlemede Mantel-Haenszel ve lojistik regresyon tekniklerinin karşılaştırılması [Comparison of Mantel-Haenszel and logistic regression techniques in detecting differential item functioning]. Journal of Measurement and Evaluation in Education and Psychology, 5(2), 12 25. https://doi.org/10.21031/epod.71099
Fidalgo, A.M., Mellenbergh, G.J., & Muñiz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures. Methods of Psychological Research Online, 5(3), 43-53.
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29, 278-295. https://doi.org/10.1177/0146621605275728
Gao, X. (2019). A comparison of six DIF detection methods [Unpublished master’s thesis]. University of Connecticut.
Gierl, M.J., Jodoin, M.G., & Ackerman, T.A. (2000, April 24-27). Performance of Mantel-Haenszel, simultaneous item bias test, and logistic regression when the proportion of DIF items is large [Paper presentation] In Annual Meeting of the American Educational Research Association (AERA), New Orleans, LA, United States.
Glas, C.A., & Meijer, R.R. (2003). A Bayesian approach to person fit analysis in item response theory models. Applied Psychological Measurement, 27(3), 217 233. https://doi.org/10.1177/0146621603027003003
Guilera, G., Gomez-Benito, J., Hidalgo, M.D. & Sanchez-Meca, J. (2013). Type I error and statistical power of the Mantel-Haenszel procedure for detecting DIF: A meta-analysis. Psychological Methods, 18(4), 553-71. https://doi.org/10.1037/a0034306
Güler, N., & Penfield, R.D. (2009). A comparison of the logistic regression and contingency table methods for simultaneous detection of uniform and nonuniform DIF. Journal of Educational Measurement, 46(3), 314 329. https://doi.org/10.1111/j.1745 3984.2009.00083.x
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory. Sage.
Hambleton, R.K., Clauser, B.E., Mazor, K.M., & Jones, R.W. (1993). Advances in the detection of differentially functioning test items. European Journal of Psychological Assessment, 9(1), 1-18.
Han, K.T., & Hambleton, R.K. (2014). User's manual for WinGen3: Windows software that generates IRT model parameters and item responses (Center for Educational Assessment Report No. 642). Amherst, MA: University of Massachusetts, Center for Educational Assessment.
Herrera, A., & Gomez, J. (2008). Influence of equal or unequal comparison group sample sizes on the detection of differential item functioning using the Mantel-Haenszel and logistic regression techniques. Quality & Quantity, 42(6), 739 755. https://doi.org/10.1007/s11135-006-9065-z
Hidalgo, M.D., López-Martínez, M.D., Gómez-Benito, J., & Guilera, G. (2016). A comparison of discriminant logistic regression and Item Response Theory Likelihood-Ratio Tests for Differential Item Functioning (IRTLRDIF) in polytomous short tests. Psicothema, 28(1), 83-88. https://doi.org/10.7334/psicothema2015.142
Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H.I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Erlbaum
Holmes Finch, W., & French, B.F. (2007). Detection of crossing differential item functioning: A comparison of four methods. Educational and Psychological Measurement, 67(4), 565-582. https://doi.org/10.1177/0013164406296975
Jodoin, M.G., & Gierl, M.J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied measurement in education, 14(4), 329-349. https://doi.org/10.1207/S15324818AME1404_2
Kane, M.T. (2006). Validation. In R.L. Brennan (Ed.), Educational measurement (4th ed., pp. 17– 64). Rowman & Littlefield.
Karasar, N. (2021). Bilimsel araştırma yöntemleri [Scientific research methods]. Nobel Yayınları.
Kaya, Y., Leite, W., & Miller, M.D. (2015). A comparison of logistic regression models for DIF detection in polytomous items: the effect of small sample sizes and non-normality of ability distributions. International Journal of Assessment Tools in Education, 2(1), 22-39. https://doi.org/10.21449/ijate.239563
Kelecioğlu, H., Karabay, B., & Karabay, E. (2014). Seviye belirleme sınavı’nın madde yanlılığı açısından incelenmesi [Investigation of placement test in terms of item biasness]. Elementary Education Online, 13(3), 934-953.
Kim, J. (2010). Controlling Type I error rate in evaluating differential item functioning for four DIF methods: Use of three procedures for adjustment of multiple item testing. Dissertation, Georgia State University.
Li, Y., Brooks, G.P., & Johanson, G.A. (2012). Item discrimination and Type I error in the detection of differential item functioning. Educational and Psychological Measurement, 72(5), 847-861. https://doi.org/10.1177/0013164411432333
Li, H.H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61(4), 647-677. https://doi.org/10.1007/BF02294041
Linn, R. L., & Gronlund, N. E. (2000). Measurement and assessment in teaching (8th Ed.). Upper Saddle River.
Lopez, G.E. (2012). Detection and classification of DIF types using parametric and nonparametric methods: A comparison of the IRT-likelihood ratio test, crossing-SIBTEST, and logistic regression procedures [Unpublished doctoral dissertation]. University of South Florida.
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Routledge.
Magis, D., Beland, S., &Raiche, G. (2022). Collection of methods to detect dichotomous differential item functioning (DIF). Package ‘difR’.
Marañón, P.P., Garcia, M.I.B., & Costas, C.S.L. (1997). Identification of nonuniform differential item functioning: A comparison of Mantel-Haenszel and item response theory analysis procedures. Educational and Psychological Measurement, 57(4), 559-568. https://doi.org/10.1177/0013164497057004002
Mellenbergh, G.J. (1983). Conditional item bias methods. In S.H. Irvine & J.W. Berry (Eds.), Human assessment and cultural factors (pp. 293-302). Springer.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (pp. 13-103). MacMillan.
Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel and simultaneous item bias procedures for detecting differential item functioning. Applied Psychological Measurement, 18(4), 315 328. https://doi.org/10.1177/014662169401800403
Narayanan, P., & Swaminathan, H. (1996). Identification of items that show non-uniform DIF. Applied Psychological Measurement, 20(3), 257 274. https://doi.org/10.1177/014662169602000306
Oshima, T.C., & Morris, S.B. (2008). Raju's differential functioning of items and tests (DFIT). Educational Measurement: Issues and Practice, 27(3), 43 50. https://doi.org/10.1111/j.1745-3992.2008.00127.x
Osterlind, S.J., & Everson, H.T. (2009). Differential Item Functioning. Sage.
R Core Team. (2022). R: A language and environment for statistical computing [Computer software manual]. http://www.R-project.org/
Raju, N.S. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495-502. https://doi.org/10.1007/BF02294403
Reise, S.P., & Waller, N.G. (2002). Item response theory for dichotomous assessment data. In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing behavior in organizations: Advances in measurement and data analysis (pp. 88–122). Jossey-Bass.
Rockoff, D. (2018). A randomization test for the detection of differential item functioning [Unpublished doctoral dissertation]. The University of Arizona.
Rogers, H.J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105–116. https://doi.org/10.1177/014662169301700201
Roussos, L.A., & Stout, W.F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33, 215-230. https://doi.org/10.1111/j.1745-3984.1996.tb00490.x
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159-194. https://doi.org/10.1007/BF02294572
Swaminathan, H., & Rogers, H.J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361 370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x
Uttaro, T., & Millsap, R.E. (1994). Factors influencing the Mantel-Haenszel procedure in the detection of differential item functioning. Applied Psychological Measurement, 18(1), 15–25. https://doi.org/10.1177/014662169401800102
Uyar, Ş. (2015). Gözlenen gruplara ve örtük sınıflara göre belirlenen değişen madde fonksiyonunun karşılaştırılması [Comparing differential item functioning based on manifest groups and latent classes] [Unpublished doctoral dissertation]. University of Hacettepe.
Uysal, İ., Ertuna, L., Ertaş, F.G. & Kelecioğlu, H. (2019). Performances based on ability estimation of the methods of detecting differential item functioning: A simulation study. Journal of Measurement and Evaluation in Education and Psychology, 10(2), 133-148. https://doi.org/10.21031/epod.534312
Vaughn, B.K., & Wang, Q. (2010). DIF trees: using classifications trees to detect differential item functioning. Educational and Psychological Measurement, 70(6) 941–952. https://doi.org/10.1177/0013164410379326
Zumbo, B.D.A. (1999). Handbook on the theory and methods of differential item functioning: Logistic regression modeling as a unitary framework for binary and likert type item scores. Ottowa.

Details

Primary Language

English

Subjects

Other Fields of Education

Journal Section

Research Article

Authors

Münevver Başman ^*
0000-0003-3572-7982
Türkiye

Publication Date

March 20, 2023

Submission Date

June 24, 2022

Acceptance Date

March 2, 2023

Published in Issue

Year 2023 Volume: 10 Number: 1

DOI

https://doi.org/10.21449/ijate.1135368

IZ

https://izlik.org/JA88JK99BF

Cite

RIS / Bibtex

APA

Başman, M. (2023). A Comparison of the efficacies of differential item functioning detection methods. International Journal of Assessment Tools in Education, 10(1), 145-159. https://doi.org/10.21449/ijate.1135368

Cited By

Type I error and power rates: A comparative analysis of techniques in differential item functioning

International Journal of Assessment Tools in Education

https://doi.org/10.21449/ijate.1368341