Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy
Abstract
This study compares the performance of the Multidimensional Item Response Theory (MIRT), Higher-Order IRT (HO-IRT), and Bifactor models for the simultaneous estimation of total and subscale scores in multidimensional tests. Using both simulated data and real data from an English proficiency exam, model performance was evaluated in terms of accuracy (RMSE), reliability, and classification accuracy. The simulation included 5,000 respondents, 120 items, and a four-dimensional structure, manipulating item format, test difficulty, and inter-dimensional correlation. Results indicated that MIRT consistently outperformed the other models, yielding the lowest RMSE and highest reliability and classification accuracy across conditions. HO-IRT also showed strong performance, while the Bifactor model underperformed, particularly in subscore estimation. Model performance was sensitive to test characteristics and dimensional relationships. Findings from the real data analysis supported the simulation results, underscoring the value of multidimensional modeling for diagnostic feedback and informed decision-making.
Keywords
References
- Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional items. Applied Psychological Measurement, 13(2), 113-127. https://doi.org/10.1177/014662168901300201
- Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 7(4), 255-278. https://doi.org/10.1207/s15324818ame0704_1
- Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x
- Alderson, J. (2007). The challenge of (diagnostic) testing: Do we know what we are measuring? In J. Fox, M. Wesche, D. Bayliss, L. Cheng, C. Turner, & C. Doe (Eds.), Language testing reconsidered (pp. 21–40). University of Ottawa Press.
- Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Brooks/Cole.
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. AERA.
- Ansley, T. N., & Forsyth, R. A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data. Applied Psychological Measurement, 9(1), 37-48. https://doi.org/10.1177/014662168500900104
- Brennan, R. L. (2012). Utility indexes for decisions about subscores. Center for Advanced Studies in Measurement and Assessment (CASMA). Research Report, 33.
Details
Primary Language
English
Subjects
Item Response Theory, Testing, Assessment and Psychometrics (Other)
Journal Section
Research Article
Publication Date
December 31, 2025
Submission Date
July 24, 2025
Acceptance Date
November 30, 2025
Published in Issue
Year 2025 Volume: 16 Number: 4
APA
Erdemir, A., & Atar, H. (2025). Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. Journal of Measurement and Evaluation in Education and Psychology, 16(4), 264-282. https://doi.org/10.21031/epod.1748835
AMA
1.Erdemir A, Atar H. Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. JMEEP. 2025;16(4):264-282. doi:10.21031/epod.1748835
Chicago
Erdemir, Ayşenur, and Hakan Atar. 2025. “Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy”. Journal of Measurement and Evaluation in Education and Psychology 16 (4): 264-82. https://doi.org/10.21031/epod.1748835.
EndNote
Erdemir A, Atar H (December 1, 2025) Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. Journal of Measurement and Evaluation in Education and Psychology 16 4 264–282.
IEEE
[1]A. Erdemir and H. Atar, “Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy”, JMEEP, vol. 16, no. 4, pp. 264–282, Dec. 2025, doi: 10.21031/epod.1748835.
ISNAD
Erdemir, Ayşenur - Atar, Hakan. “Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy”. Journal of Measurement and Evaluation in Education and Psychology 16/4 (December 1, 2025): 264-282. https://doi.org/10.21031/epod.1748835.
JAMA
1.Erdemir A, Atar H. Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. JMEEP. 2025;16:264–282.
MLA
Erdemir, Ayşenur, and Hakan Atar. “Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy”. Journal of Measurement and Evaluation in Education and Psychology, vol. 16, no. 4, Dec. 2025, pp. 264-82, doi:10.21031/epod.1748835.
Vancouver
1.Ayşenur Erdemir, Hakan Atar. Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. JMEEP. 2025 Dec. 1;16(4):264-82. doi:10.21031/epod.1748835