Research Article

Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy

Volume: 16 Number: 4 December 31, 2025

Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy

Abstract

This study compares the performance of the Multidimensional Item Response Theory (MIRT), Higher-Order IRT (HO-IRT), and Bifactor models for the simultaneous estimation of total and subscale scores in multidimensional tests. Using both simulated data and real data from an English proficiency exam, model performance was evaluated in terms of accuracy (RMSE), reliability, and classification accuracy. The simulation included 5,000 respondents, 120 items, and a four-dimensional structure, manipulating item format, test difficulty, and inter-dimensional correlation. Results indicated that MIRT consistently outperformed the other models, yielding the lowest RMSE and highest reliability and classification accuracy across conditions. HO-IRT also showed strong performance, while the Bifactor model underperformed, particularly in subscore estimation. Model performance was sensitive to test characteristics and dimensional relationships. Findings from the real data analysis supported the simulation results, underscoring the value of multidimensional modeling for diagnostic feedback and informed decision-making.

Keywords

References

  1. Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional items. Applied Psychological Measurement, 13(2), 113-127. https://doi.org/10.1177/014662168901300201
  2. Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 7(4), 255-278. https://doi.org/10.1207/s15324818ame0704_1
  3. Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x
  4. Alderson, J. (2007). The challenge of (diagnostic) testing: Do we know what we are measuring? In J. Fox, M. Wesche, D. Bayliss, L. Cheng, C. Turner, & C. Doe (Eds.), Language testing reconsidered (pp. 21–40). University of Ottawa Press.
  5. Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Brooks/Cole.
  6. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. AERA.
  7. Ansley, T. N., & Forsyth, R. A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data. Applied Psychological Measurement, 9(1), 37-48. https://doi.org/10.1177/014662168500900104
  8. Brennan, R. L. (2012). Utility indexes for decisions about subscores. Center for Advanced Studies in Measurement and Assessment (CASMA). Research Report, 33.

Details

Primary Language

English

Subjects

Item Response Theory, Testing, Assessment and Psychometrics (Other)

Journal Section

Research Article

Publication Date

December 31, 2025

Submission Date

July 24, 2025

Acceptance Date

November 30, 2025

Published in Issue

Year 2025 Volume: 16 Number: 4

APA
Erdemir, A., & Atar, H. (2025). Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. Journal of Measurement and Evaluation in Education and Psychology, 16(4), 264-282. https://doi.org/10.21031/epod.1748835
AMA
1.Erdemir A, Atar H. Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. JMEEP. 2025;16(4):264-282. doi:10.21031/epod.1748835
Chicago
Erdemir, Ayşenur, and Hakan Atar. 2025. “Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy”. Journal of Measurement and Evaluation in Education and Psychology 16 (4): 264-82. https://doi.org/10.21031/epod.1748835.
EndNote
Erdemir A, Atar H (December 1, 2025) Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. Journal of Measurement and Evaluation in Education and Psychology 16 4 264–282.
IEEE
[1]A. Erdemir and H. Atar, “Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy”, JMEEP, vol. 16, no. 4, pp. 264–282, Dec. 2025, doi: 10.21031/epod.1748835.
ISNAD
Erdemir, Ayşenur - Atar, Hakan. “Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy”. Journal of Measurement and Evaluation in Education and Psychology 16/4 (December 1, 2025): 264-282. https://doi.org/10.21031/epod.1748835.
JAMA
1.Erdemir A, Atar H. Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. JMEEP. 2025;16:264–282.
MLA
Erdemir, Ayşenur, and Hakan Atar. “Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy”. Journal of Measurement and Evaluation in Education and Psychology, vol. 16, no. 4, Dec. 2025, pp. 264-82, doi:10.21031/epod.1748835.
Vancouver
1.Ayşenur Erdemir, Hakan Atar. Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. JMEEP. 2025 Dec. 1;16(4):264-82. doi:10.21031/epod.1748835