Detecting Differential Item Functioning: Item Response Theory Methods Versus the Mantel-Haenszel Procedure

Emily Diaz; Gordon Brooks; George Johanson

doi:10.21449/ijate.730141

EN TR

Detecting Differential Item Functioning: Item Response Theory Methods Versus the Mantel-Haenszel Procedure

Abstract

This Monte Carlo study assessed Type I error in differential item functioning analyses using Lord’s chi-square (LC), Likelihood Ratio Test (LRT), and Mantel-Haenszel (MH) procedure. Two research interests were investigated: item response theory (IRT) model specification in LC and the LRT and continuity correction in the MH procedure. This study enhances the literature by investigating LC and the LRT using correct and incorrect model-data fit and comparing those results to the MH procedure. There were three fixed factors (number of test items, IRT parameter estimation method, and item parameter equating) and four varied factors (IRT model used to generate data and fit the data, sample size, and impact). The findings suggested the MH procedure without the continuity correction is best based on Type I error rate.

Keywords

Detecting Differential Item Functioning: Item Response Theory Methods Versus the Mantel-Haenszel Procedure

Öz

This Monte Carlo study assessed Type I error in differential item functioning analyses using Lord’s chi-square (LC), Likelihood Ratio Test (LRT), and Mantel-Haenszel (MH) procedure. Two research interests were investigated: item response theory (IRT) model specification in LC and the LRT and continuity correction in the MH procedure. This study enhances the literature by investigating LC and the LRT using correct and incorrect model-data fit and comparing those results to the MH procedure. There were three fixed factors (number of test items, IRT parameter estimation method, and item parameter equating) and four varied factors (IRT model used to generate data and fit the data, sample size, and impact). The findings suggested the MH procedure without the continuity correction is best based on Type I error rate.

Anahtar Kelimeler

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association.
Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15(2), 113-141. https://doi.org/10.1207/S15324818AME1502_01
Bolt, D. M., Deng, S., & Lee, S. (2014). IRT model misspecification and measurement of growth in vertical scaling. Journal of Educational Measurement, 51(2), 141-162. https://doi.org/10.1111/jedm.12039
Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31(2), 144-152. https://doi.org/10.1207/S15324818AME1502_01
Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 220-256). American Council on Education.
Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Sage.
Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12(3), 253-260. https://doi.org/10.1177/014662168801200304
Cohen, A. S., & Kim, SH. (1993). A comparison of Lord’s χ2 and Raju’s area measures in detection of DIF. Applied Psychological Measurement, 17(1), 39 52. https://doi.org/10.1177/014662169301700109

Cohen, A. S., Kim, S. H., & Wollack, J. A. (1996). An investigation of the likelihood ratio test for detection of differential item functioning. Applied Psychological Measurement, 20(1), 15-26. https://doi.org/10.1177/014662169602000102
Cook, L. L., & Eignor, D. R. (1991). IRT equating methods. Educational Measurement: Issues and Practice, 10(3), 37-45. https://doi.org/10.1111/j.1745-3992.1991.tb00207.x
de Ayala, R. J. (2009). The theory and practice of item response theory. Guilford Press.
Creswell, J. W. (2009). Research Design: Qualitative, Quantitative, and Mixed Methods Approaches. Sage.
DeMars, C. E. (2009). Modification of the Mantel-Haenszel and Logistic Regression DIF procedures to incorporate the SIBTEST regression correction. Journal of Educational and Behavioral Statistics, 34(2), 149-170. https://doi.org/10.3102/1076998607313923
DeMars, C. E. (2010). Type I Error inflation for detecting DIF in the presence of impact. Educational and Psychological Measurement, 70(6), 961 972. https://doi.org/10.1177/0013164410366691
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35-66). Lawrence Erlbaum.
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278-295. https://doi.org/10.1177/0146621605275728
Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning a comparison of four methods. Educational and Psychological Measurement, 67(4), 565-582. https://doi.org/10.1177/0013164406296975
Güler, N., & Penfield, R. D. (2009). A Comparison of the Logistic Regression and Contingency Table Methods for Simultaneous Detection of Uniform and Nonuniform DIF. Journal of Educational Measurement, 46(3), 314 329. https://doi.org/10.1111/j.17453984.2009.00083.x
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.
Herrera, A. N., & Gómez, J. (2008). Influence of equal or unequal comparison group sample sizes on the detection of differential item functioning using the Mantel-Haenszel and logistic regression techniques. Quality & Quantity, 42(6), 739 755. https://doi.org/10.1007/s11135-006-9065-z
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Lawrence Erlbaum.
Kane, M. (2009). Validating the interpretations and uses of test scores. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 39-64). Information Age Publishing.
Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73. https://doi.org/10.1111/jedm.12000
Kim, S. H., & Cohen, A. S. (1992). Effects of linking methods on detection of DIF. Journal of Educational Measurement, 29(1), 51 66. https://doi.org/10.1111/j.17453984.1992.tb00367.x
Kim, S. H., & Cohen, A. S. (1995). A comparison of Lord’s chi-square, Raju’s area measures, and the likelihood ratio test on detection of differential item functioning. Applied Measurement in Education, 8(4), 291 312. https://doi.org/10.1207/s15324818ame0804_2
Kim, S. H., Cohen, A. S., & Kim, H. O. (1994). An investigation of Lord’s procedure for the detection of differential item functioning. Applied Psychological Measurement, 18(3), 217-228. https://doi.org/10.1177/014662169401800303
Köse, I. A. (2014). Assessing model data fit of unidimensional item response theory models in simulated data. Educational Research and Reviews, 9(17), 642 649. https://doi.org/10.5897/ERR2014.1729
Lautenschlager, G. J., & Park, D. G. (1988). IRT item bias detection procedures: Issues of model misspecification, robustness, and parameter linking. Applied Psychological Measurement, 12(4), 365-376. https://doi.org/10.1177/014662168801200404
Li, Y., Brooks, G. P., & Johanson, G. A. (2012). Item discrimination and Type I error in the detection of differential item functioning. Educational and Psychological Measurement, 72(5), 847-861. https://doi.org/10.1177/0013164411432333
Lim, R. G., & Drasgow, F. (1990). Evaluation of two methods for estimating item response theory parameters when assessing differential item functioning. Journal of Applied Psychology, 75(2), 164-174. https://doi.org/10.1037/0021-9010.75.2.164
Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applied Psychological Measurement, 5(2), 159-173. https://doi.org/10.1177/014662168100500202
Lord, F. M. (1968). An analysis of the verbal Scholastic Aptitude Test using Birnbaum’s three-parameter logistic model. Educational and Psychological Measurement, 28, 989-1020. https://doi.org/10.1177/001316446802800401
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Erlbaum.
Luecht, R. M. (2005). Some useful cost-benefit criteria for evaluating computer-based test delivery models and systems. Journal of Applied Testing Technology, 7(2), 1-31.
Magis, D., Beland, S., Tuerlinckx, S., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847-862. https://doi.org/10.3758/BRM.42.3.847
Marañón, P. P., Garcia, M. I. B., & Costas, C. S. L. (1997). Identification of nonuniform differential item functioning: A comparison of Mantel-Haenszel and item response theory analysis procedures. Educational and Psychological Measurement, 57(4), 559-568. https://doi.org/10.1177/0013164497057004002
Maydeu-Olivares, A. (2013). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research and Perspectives, 11(3), 71 101. https://doi.org/10.1080/15366367.2013.831680
Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational and Psychological Measurement, 54(2), 284 291. https://doi.org/10.1177/0013164494054002003
McLaughlin, M. E., & Drasgow, F. (1987). Lord’s chi-square test of item bias with estimated and with known person parameters. Applied Psychological Measurement, 11(2), 161-173. https://doi.org/10.1177/014662168701100205
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741-749. https://doi.org/10.1037/0003-066X.50.9.741
Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17(4), 297-334. https://doi.org/10.1177/014662169301700401
Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20(3), 257 274. https://doi.org/10.1177/014662169602000306
National Research Council. (2007). Lessons learned about testing: Ten years of work at the National Research Council. The National Academies Press.
Paek, I. (2010). Conservativeness in rejection of the null hypothesis when using the continuity correction in the MH chi-square test in DIF applications. Applied Psychological Measurement, 34(7), 539-548. https://doi.org/10.1177/0146621610378288
Paek, I., & Wilson, M. (2011). Formulating the Rasch differential item functioning model under the marginal maximum likelihood estimation context and its comparison with Mantel-Haenszel procedure in short test and small sample conditions. Educational and Psychological Measurement, 71(6), 1023 1046. https://doi.org/10.1177/0013164411400734
R Core Team (2013). R: A language and environment for statistical computing. [Computer software]. R Foundation for Statistical Computing: Vienna, Austria. http://www.R-project.org/.
Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14(2), 197-207. https://doi.org/10.1177/014662169001400208
Raju, N. S., Drasgow, F., & Slinde, J. A. (1993). An empirical comparison of the area methods, Lord’s chi-square test, and the Mantel-Haenszel technique for assessing differential item functioning. Educational and Psychological Measurement, 53(2), 301 314. https://doi.org/10.1177/0013164493053002001
Rizopoulos, D. (2006). Ltm: An R package for latent variable modelling and item response theory analyses. Journal of Statistical Software, 17(5), 1 25. https://doi.org/10.18637/jss.v017.i05
Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17(2), 105-116. https://doi.org/10.1177/014662169301700201
Roussos, L. A., & Stout, W. F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33(2), 215-230. https://doi.org/10.1111/j.1745-3984.1996.tb00490.x
Rudner, L. M., Getson, P. R., & Knight, D. L. (1980). A Monte Carlo comparison of seven biased item detection techniques. Journal of Educational Measurement, 17(1), 1-10. https://doi.org/10.1111/j.1745-3984.1980.tb00810.x
Rupp, A. A., & Zumbo, B. D. (2006). Understanding parameter invariance in unidimensional IRT models. Educational and Psychological Measurement, 66(1), 63 84. https://doi.org/10.1177/0013164404273942
Sari, H. I., & Huggins, A. C. (2015). Differential item functioning detection across two methods of defining group comparisons: Pairwise and composite group comparisons. Educational and Psychological Measurement, 75(4), 648 676. https://doi.org/10.1177/0013164414549764
Shepard, L., Camilli, G., & Williams, D. (1985). Validity of approximation techniques for detecting item bias. Journal of Educational Measurement, 22(2), 77-105. https://doi.org/10.1111/j.1745-3984.1985.tb01050.x
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6), 1292 1306. https://doi.org/10.1037/00219010.91.6.1292
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201 210. https://doi.org/10.1177/014662168300700208
Thissen, D. (2001). IRTLRDIF user’s guide: software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Retrieved from http://www.unc.edu/~dthissen/dl.html
Thissen, D., Steinberg, L., Pyszczynski, T., & Greenberg, J. (1983). An item response theory for personality and attitude scales item analysis using restricted factor analysis. Applied Psychological Measurement, 7(2), 211 226. https://doi.org/10.1177/014662168300700209
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147-169). Lawrence Erlbaum.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67-113). Lawrence Erlbaum.
Wainer, H. (2010). 14 conversations about three things. Journal of Educational and Behavioral Statistics, 35(1), 5-25. https://doi.org/10.3102/1076998609355124
Wang, WC., & Yeh, YL. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27(6), 479-498. https://doi.org/10.1177/0146621603259902
Wells, C. S., Cohen, A. S., & Patton, J. (2009). A range-null hypothesis approach for testing DIF under the Rasch model. International Journal of Testing, 9(4), 310-332. https://doi.org/10.1080/15305050903352073
Zimowski, M., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3: Item analysis and test scoring with binary logistic models. [Computer software]. Scientific Software.
Zumbo, B. (1999). A handbook on the theory and methods of differential item functioning: Logistic regression modeling as a unitary framework for binary and Likert-type item scores. Directorate of Human Resource Research and Evaluation, National Defense Headquarters.
Zwick, R. (1990). When do item response function and Mantel-Haenszel definitions of differential item functioning coincide? Journal of Educational and Behavioral Statistics, 15(3), 185-197. https://doi.org/10.3102/10769986015003185

Details

Primary Language

English

Subjects

Studies on Education

Journal Section

Research Article

Authors

Emily Diaz ^*
0000-0001-9460-8647
United States

Gordon Brooks This is me
0000-0002-2704-2505
United States

George Johanson This is me
0000-0002-4253-1841
United States

Publication Date

June 10, 2021

Submission Date

April 30, 2020

Acceptance Date

April 4, 2021

Published in Issue

Year 2021 Volume: 8 Number: 2

DOI

https://doi.org/10.21449/ijate.730141

IZ

https://izlik.org/JA32XS77SX

Cite

RIS / Bibtex

APA

Diaz, E., Brooks, G., & Johanson, G. (2021). Detecting Differential Item Functioning: Item Response Theory Methods Versus the Mantel-Haenszel Procedure. International Journal of Assessment Tools in Education, 8(2), 376-393. https://doi.org/10.21449/ijate.730141

Cited By

Gender-based Differential Item Functioning Analysis of the Medical Specialization Education Entrance Examination

Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi

https://doi.org/10.21031/epod.998592

Gender bias in first-year multiple-choice physics examinations

Physical Review Physics Education Research

https://doi.org/10.1103/PhysRevPhysEducRes.19.020109

Differential item functioning for the Tendency of Avoiding Physical Activity and Sport Scale across two subculture samples: Taiwanese and mainland Chinese university students

Heliyon

https://doi.org/10.1016/j.heliyon.2023.e22583

Performance differences with and without differential item functioning in the post graduate admission test in Saudi Arabia based on gender and ability level

Frontiers in Psychology

https://doi.org/10.3389/fpsyg.2025.1515316

Pre-service teachers’ situation-specific skills regarding classroom management in physical education: validation of a video-based test instrument

Frontiers in Education

https://doi.org/10.3389/feduc.2026.1751591

Controlling the False Discovery Rate in DIF Detection With e-Values: Evidence From Multidimensional and Testlet Simulations

Educational and Psychological Measurement

https://doi.org/10.1177/00131644261433236

Detecting Differential Item Functioning: Item Response Theory Methods Versus the Mantel-Haenszel Procedure

Detecting Differential Item Functioning: Item Response Theory Methods Versus the Mantel-Haenszel Procedure

Abstract

Keywords

Detecting Differential Item Functioning: Item Response Theory Methods Versus the Mantel-Haenszel Procedure

Öz

Anahtar Kelimeler

References

Details

Primary Language

Subjects

Journal Section

Authors

Publication Date

Submission Date

Acceptance Date

Published in Issue

DOI

IZ

Cite

Cited By

Gender-based Differential Item Functioning Analysis of the Medical Specialization Education Entrance Examination

Gender bias in first-year multiple-choice physics examinations

Differential item functioning for the Tendency of Avoiding Physical Activity and Sport Scale across two subculture samples: Taiwanese and mainland Chinese university students

Purification procedures used for the detection of gender DIF: Item bias in a foreign language test

Extended test time for English learners: Does use correspond to score comparability?

Effects of dimensionality and covariate on items with DIF in mixture models

Performance differences with and without differential item functioning in the post graduate admission test in Saudi Arabia based on gender and ability level

Pre-service teachers’ situation-specific skills regarding classroom management in physical education: validation of a video-based test instrument

Controlling the False Discovery Rate in DIF Detection With e-Values: Evidence From Multidimensional and Testlet Simulations