Year 2021, Volume 8 , Issue 2, Pages 376 - 393 2021-06-10

Detecting Differential Item Functioning: Item Response Theory Methods Versus the Mantel-Haenszel Procedure

Emily DİAZ [1] , Gordon BROOKS [2] , George JOHANSON [3]

This Monte Carlo study assessed Type I error in differential item functioning analyses using Lord’s chi-square (LC), Likelihood Ratio Test (LRT), and Mantel-Haenszel (MH) procedure. Two research interests were investigated: item response theory (IRT) model specification in LC and the LRT and continuity correction in the MH procedure. This study enhances the literature by investigating LC and the LRT using correct and incorrect model-data fit and comparing those results to the MH procedure. There were three fixed factors (number of test items, IRT parameter estimation method, and item parameter equating) and four varied factors (IRT model used to generate data and fit the data, sample size, and impact). The findings suggested the MH procedure without the continuity correction is best based on Type I error rate.
Differential item functioning, Monte Carlo simulation, Type I error rate, Mantel-Haenszel
  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association.
  • Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15(2), 113-141.
  • Bolt, D. M., Deng, S., & Lee, S. (2014). IRT model misspecification and measurement of growth in vertical scaling. Journal of Educational Measurement, 51(2), 141-162.
  • Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31(2), 144-152.
  • Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 220-256). American Council on Education.
  • Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Sage.
  • Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12(3), 253-260.
  • Cohen, A. S., & Kim, SH. (1993). A comparison of Lord’s χ2 and Raju’s area measures in detection of DIF. Applied Psychological Measurement, 17(1), 39 52.
  • Cohen, A. S., Kim, S. H., & Wollack, J. A. (1996). An investigation of the likelihood ratio test for detection of differential item functioning. Applied Psychological Measurement, 20(1), 15-26.
  • Cook, L. L., & Eignor, D. R. (1991). IRT equating methods. Educational Measurement: Issues and Practice, 10(3), 37-45.
  • de Ayala, R. J. (2009). The theory and practice of item response theory. Guilford Press.
  • Creswell, J. W. (2009). Research Design: Qualitative, Quantitative, and Mixed Methods Approaches. Sage.
  • DeMars, C. E. (2009). Modification of the Mantel-Haenszel and Logistic Regression DIF procedures to incorporate the SIBTEST regression correction. Journal of Educational and Behavioral Statistics, 34(2), 149-170.
  • DeMars, C. E. (2010). Type I Error inflation for detecting DIF in the presence of impact. Educational and Psychological Measurement, 70(6), 961 972.
  • Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35-66). Lawrence Erlbaum.
  • Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278-295.
  • Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning a comparison of four methods. Educational and Psychological Measurement, 67(4), 565-582.
  • Güler, N., & Penfield, R. D. (2009). A Comparison of the Logistic Regression and Contingency Table Methods for Simultaneous Detection of Uniform and Nonuniform DIF. Journal of Educational Measurement, 46(3), 314 329.
  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.
  • Herrera, A. N., & Gómez, J. (2008). Influence of equal or unequal comparison group sample sizes on the detection of differential item functioning using the Mantel-Haenszel and logistic regression techniques. Quality & Quantity, 42(6), 739 755.
  • Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Lawrence Erlbaum.
  • Kane, M. (2009). Validating the interpretations and uses of test scores. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 39-64). Information Age Publishing.
  • Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73.
  • Kim, S. H., & Cohen, A. S. (1992). Effects of linking methods on detection of DIF. Journal of Educational Measurement, 29(1), 51 66.
  • Kim, S. H., & Cohen, A. S. (1995). A comparison of Lord’s chi-square, Raju’s area measures, and the likelihood ratio test on detection of differential item functioning. Applied Measurement in Education, 8(4), 291 312.
  • Kim, S. H., Cohen, A. S., & Kim, H. O. (1994). An investigation of Lord’s procedure for the detection of differential item functioning. Applied Psychological Measurement, 18(3), 217-228.
  • Köse, I. A. (2014). Assessing model data fit of unidimensional item response theory models in simulated data. Educational Research and Reviews, 9(17), 642 649.
  • Lautenschlager, G. J., & Park, D. G. (1988). IRT item bias detection procedures: Issues of model misspecification, robustness, and parameter linking. Applied Psychological Measurement, 12(4), 365-376.
  • Li, Y., Brooks, G. P., & Johanson, G. A. (2012). Item discrimination and Type I error in the detection of differential item functioning. Educational and Psychological Measurement, 72(5), 847-861.
  • Lim, R. G., & Drasgow, F. (1990). Evaluation of two methods for estimating item response theory parameters when assessing differential item functioning. Journal of Applied Psychology, 75(2), 164-174.
  • Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applied Psychological Measurement, 5(2), 159-173.
  • Lord, F. M. (1968). An analysis of the verbal Scholastic Aptitude Test using Birnbaum’s three-parameter logistic model. Educational and Psychological Measurement, 28, 989-1020.
  • Lord, F. M. (1980). Applications of item response theory to practical testing problems. Erlbaum.
  • Luecht, R. M. (2005). Some useful cost-benefit criteria for evaluating computer-based test delivery models and systems. Journal of Applied Testing Technology, 7(2), 1-31.
  • Magis, D., Beland, S., Tuerlinckx, S., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847-862.
  • Marañón, P. P., Garcia, M. I. B., & Costas, C. S. L. (1997). Identification of nonuniform differential item functioning: A comparison of Mantel-Haenszel and item response theory analysis procedures. Educational and Psychological Measurement, 57(4), 559-568.
  • Maydeu-Olivares, A. (2013). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research and Perspectives, 11(3), 71 101.
  • Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational and Psychological Measurement, 54(2), 284 291.
  • McLaughlin, M. E., & Drasgow, F. (1987). Lord’s chi-square test of item bias with estimated and with known person parameters. Applied Psychological Measurement, 11(2), 161-173.
  • Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741-749.
  • Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17(4), 297-334.
  • Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20(3), 257 274.
  • National Research Council. (2007). Lessons learned about testing: Ten years of work at the National Research Council. The National Academies Press.
  • Paek, I. (2010). Conservativeness in rejection of the null hypothesis when using the continuity correction in the MH chi-square test in DIF applications. Applied Psychological Measurement, 34(7), 539-548.
  • Paek, I., & Wilson, M. (2011). Formulating the Rasch differential item functioning model under the marginal maximum likelihood estimation context and its comparison with Mantel-Haenszel procedure in short test and small sample conditions. Educational and Psychological Measurement, 71(6), 1023 1046.
  • R Core Team (2013). R: A language and environment for statistical computing. [Computer software]. R Foundation for Statistical Computing: Vienna, Austria.
  • Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14(2), 197-207.
  • Raju, N. S., Drasgow, F., & Slinde, J. A. (1993). An empirical comparison of the area methods, Lord’s chi-square test, and the Mantel-Haenszel technique for assessing differential item functioning. Educational and Psychological Measurement, 53(2), 301 314.
  • Rizopoulos, D. (2006). Ltm: An R package for latent variable modelling and item response theory analyses. Journal of Statistical Software, 17(5), 1 25.
  • Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17(2), 105-116.
  • Roussos, L. A., & Stout, W. F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33(2), 215-230.
  • Rudner, L. M., Getson, P. R., & Knight, D. L. (1980). A Monte Carlo comparison of seven biased item detection techniques. Journal of Educational Measurement, 17(1), 1-10.
  • Rupp, A. A., & Zumbo, B. D. (2006). Understanding parameter invariance in unidimensional IRT models. Educational and Psychological Measurement, 66(1), 63 84.
  • Sari, H. I., & Huggins, A. C. (2015). Differential item functioning detection across two methods of defining group comparisons: Pairwise and composite group comparisons. Educational and Psychological Measurement, 75(4), 648 676.
  • Shepard, L., Camilli, G., & Williams, D. (1985). Validity of approximation techniques for detecting item bias. Journal of Educational Measurement, 22(2), 77-105.
  • Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6), 1292 1306.
  • Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201 210.
  • Thissen, D. (2001). IRTLRDIF user’s guide: software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Retrieved from
  • Thissen, D., Steinberg, L., Pyszczynski, T., & Greenberg, J. (1983). An item response theory for personality and attitude scales item analysis using restricted factor analysis. Applied Psychological Measurement, 7(2), 211 226.
  • Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147-169). Lawrence Erlbaum.
  • Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67-113). Lawrence Erlbaum.
  • Wainer, H. (2010). 14 conversations about three things. Journal of Educational and Behavioral Statistics, 35(1), 5-25.
  • Wang, WC., & Yeh, YL. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27(6), 479-498.
  • Wells, C. S., Cohen, A. S., & Patton, J. (2009). A range-null hypothesis approach for testing DIF under the Rasch model. International Journal of Testing, 9(4), 310-332.
  • Zimowski, M., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3: Item analysis and test scoring with binary logistic models. [Computer software]. Scientific Software.
  • Zumbo, B. (1999). A handbook on the theory and methods of differential item functioning: Logistic regression modeling as a unitary framework for binary and Likert-type item scores. Directorate of Human Resource Research and Evaluation, National Defense Headquarters.
  • Zwick, R. (1990). When do item response function and Mantel-Haenszel definitions of differential item functioning coincide? Journal of Educational and Behavioral Statistics, 15(3), 185-197.
Primary Language en
Subjects Education, Scientific Disciplines
Published Date June
Journal Section Articles

Orcid: 0000-0001-9460-8647
Author: Emily DİAZ (Primary Author)
Institution: Westat
Country: United States

Orcid: 0000-0002-2704-2505
Author: Gordon BROOKS
Institution: Ohio University
Country: United States

Orcid: 0000-0002-4253-1841
Author: George JOHANSON
Institution: Ohio University
Country: United States


Publication Date : June 10, 2021

APA Diaz, E , Brooks, G , Johanson, G . (2021). Detecting Differential Item Functioning: Item Response Theory Methods Versus the Mantel-Haenszel Procedure . International Journal of Assessment Tools in Education , 8 (2) , 376-393 . DOI: 10.21449/ijate.730141