Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy

Ayşenur Erdemir; Hakan Atar

doi:10.21031/epod.1748835

Research Article

Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy

Year 2025, Volume: 16 Issue: 4, 264 - 282, 31.12.2025

Ayşenur Erdemir , Hakan Atar

Abstract

This study compares the performance of the Multidimensional Item Response Theory (MIRT), Higher-Order IRT (HO-IRT), and Bifactor models for the simultaneous estimation of total and subscale scores in multidimensional tests. Using both simulated data and real data from an English proficiency exam, model performance was evaluated in terms of accuracy (RMSE), reliability, and classification accuracy. The simulation included 5,000 respondents, 120 items, and a four-dimensional structure, manipulating item format, test difficulty, and inter-dimensional correlation. Results indicated that MIRT consistently outperformed the other models, yielding the lowest RMSE and highest reliability and classification accuracy across conditions. HO-IRT also showed strong performance, while the Bifactor model underperformed, particularly in subscore estimation. Model performance was sensitive to test characteristics and dimensional relationships. Findings from the real data analysis supported the simulation results, underscoring the value of multidimensional modeling for diagnostic feedback and informed decision-making.

Keywords

Subscores , total score , MIRT , Higher-Order IRT , Bifactor model , reliability , classification accuracy

References

Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional items. Applied Psychological Measurement, 13(2), 113-127. https://doi.org/10.1177/014662168901300201
Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 7(4), 255-278. https://doi.org/10.1207/s15324818ame0704_1
Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x
Alderson, J. (2007). The challenge of (diagnostic) testing: Do we know what we are measuring? In J. Fox, M. Wesche, D. Bayliss, L. Cheng, C. Turner, & C. Doe (Eds.), Language testing reconsidered (pp. 21–40). University of Ottawa Press.
Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Brooks/Cole.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. AERA.
Ansley, T. N., & Forsyth, R. A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data. Applied Psychological Measurement, 9(1), 37-48. https://doi.org/10.1177/014662168500900104
Brennan, R. L. (2012). Utility indexes for decisions about subscores. Center for Advanced Studies in Measurement and Assessment (CASMA). Research Report, 33.
Bulut, O. (2013). Between-person and within-person subscore reliability: Comparison of unidimensional and multidimensional IRT models (Doctoral Dissertation). University Of Minnesota.
Chalmers, R. P. (2012). A Multidimensional item response theory package for the R Environment. Journal of Statistical Software, 48(6), 1-29. https://doi.org/10.18637/jss.v048.i06
Chang, Y.-W., & Davison, M. L. (1992, April). A comparison of unidimensional and multidimensional IRT approaches to test information in a test battery. Paper presented at the annual meeting of the American Educational Research Association, San-Francisco, CA, United States.
Chapelle, C. (2011). Conceptions of validity. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 21–33). Routledge.
Chin, T. Y. (2011). Accuracy and robustness of diagnostic methods: Comparing performance across domain score, multidimensional item response, and diagnostic categorization models (Doctoral Dissertation). University Of Nebraska.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Erlbaum.
de la Torre, J. (2008). Multidimensional scoring of abilities: The ordered polytomous response case. Applied Psychological Measurement, 32(5), 355-370. https://doi.org/10.1177/0146621607303784
de la Torre, J. (2009). Improving the quality of ability estimates through multidimensional scoring and incorporation of ancillary variables. Applied Psychological Measurement, 33(6), 465-485. https://doi.org/10.1177/0146621608329890
de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size a higher-order IRT model approach. Applied Psychological Measurement, 34(4), 267-285. https://doi.org/10.1177/0146621608329501
de la Torre, J., & Patz, R. J. (2005). Making the most of what we have: A practical application of multidimensional item response theory in test scoring. Journal of Educational and Behavioral Statistics, 30(3), 295–311. https://doi.org/10.3102/10769986030003295
de la Torre, J., & Song, H. (2009). Simultaneous estimation of overall and domain abilities: A higher-order IRT model approach. Applied Psychological Measurement, 33(8), 620–639. http://doi.org/10.1177/0146621608326423
de la Torre, J., Song, H., & Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35(4), 296–316. http://doi.org/10.1177/0146621610378653
DeMars, C. E. (2005, August). Scoring subscales using multidimensional item response theory models. Poster presented at the Annual Meeting of the American Psychological Association, Washington, DC, United States.
DeMars, C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13(4), 354-378. https://doi.org/10.1080/15305058.2013.799067
Embretson, S. E. (2007). Impact of measurement scale in modeling development processes and ecological factors. In T. D. Little, J. A. Bovaird, & N. A. Card (Eds.), Modeling contextual effects in longitudinal studies (pp. 63–87). Lawrence Erlbaum Associates Publishers.
Ercikan, K., Schwarz, R. D., Julian, M. W., Burket, G., & Link, V. (1998). Calibration and scoring of tests with multiple-choice and constructed-response item types. Journal of Educational Measurement, 35(2), 137-154. https://doi.org/10.1111/j.1745-3984.1998.tb00531.x
Fan, F. (2016). Subscore reliability and classification consistency: a comparison of five methods (Doctoral Dissertation). University of Massachusetts Amherst.
Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33(2), 204-229. https://doi.org/10.3102/1076998607302636
Haberman, S. J., & Sinharay, S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75(2), 209-227. https://doi.org/10.1007/S11336-010-9158-4
Haladyna, T. M., & Kramer, G. A. (2004). The validity of subscores for a credentialing test. Evaluation & the Health Professions, 27(4), 349–368. https://doi.org/10.1177/0163278704270010
Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12(3), 38–47. https://doi.org/10.1111/j.1745-3992.1993.tb00543.x
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.
Hong, Y., Lam, T. C., & de la Torre, J. (2010). Ancillary variables and multidimensional scoring of polytomous responses. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO, United States.
Jang, E. E., & Roussos, L. (2007). An investigation into the dimensionality of TOEFL using conditional covariance‐based nonparametric approach. Journal of Educational Measurement, 44(1), 1-21. https://doi.org/10.1111/j.1745-3984.2007.00024.x
Kahraman, N., & Kamata, A. (2004). Increasing the precision of subscale scores by using out-of-scale information. Applied Psychological Measurement, 28(6), 407-426. https://doi.org/10.1177/0146621604268736
Lane, S. (2005, April). Status and future directions for performance assessments in education. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada.
Lathrop, Q. (2015). Practical Issues in Estimating Classification Accuracy and Consistency with R Package cacIRT. Practical Assessment, Research, and Evaluation, 20(1), Article 18. https://doi.org/10.7275/43vm-p442
Lee, W. C., Kim, S. Y., Choi, J., & Kang, Y. (2020). IRT approaches to modeling scores on mixed‐format tests. Journal of Educational Measurement, 57(2), 230-254. https://doi.org/10.1111/jedm.12248
Li, C. (2018). Subscore reporting for double-coded innovative items embedded in multiple contexts (Doctoral Dissertation). Graduate School of the University of Maryland, College Park.
Liu, Y., Li, Z., & Liu, H. (2019). Reporting valid and reliable overall scores and domain scores using bi-factor model. Applied psychological measurement, 43(7), 562-576. https://doi.org/10.1177/0146621618813093
Longabach, T. (2015). A comparison of subscore reporting methods for a state assessment of English language proficiency (Doctoral Dissertation). University of Kansas.
Longabach, T., & Peyton, V. (2017). A comparison of reliability and precision of subscore reporting methods for a state English language proficiency assessment. Language Testing, 35(2), 297-317. http://doi.org/10.1177/0265532217689949
Morse, B. J., Johanson, G. A., & Griffeth, R. W. (2012). Using the graded response model to control spurious interactions in moderated multiple regression. Applied Psychological Measurement, 36(2), 122-146. https://doi.org/10.1177/0146621612438725
Nandakumar, R. (1994). Assessing dimensionality of a set of item responses-comparison of different approaches. Journal of Educational Measurement, 31(1), 17-35. https://doi.org/10.1111/j.1745-3984.1994.tb00432.x
Paek, I. (2015). An investigation of the impact of guessing on coefficient α and reliability. Applied Psychological Measurement, 39(4), 264-277. https://doi.org/10.1177/0146621614559516
Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9(4), 401-412. https://doi.org/10.1177/014662168500900409
Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21(1), 25-36. https://doi.org/10.1177/0146621697211002
Reckase, M. D. (2009). Multidimensional item response theory (statistics for social and behavioral sciences). Springer.
Reckase, M. D., & Xu, J. R. (2015). The evidence for a subscore structure in a test of English language competency for English language learners. Educational and Psychological Measurement, 75(5), 805-825. https://doi.org/10.1177/0013164414554416
Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92(6), 544-559. https://doi.org/10.1080/00223891.2010.496477
Reise, S. P., Morizot, J., & Hays, R. D. (2007). The role of the bifactor model in resolving dimensionality issues in health outcomes measures. Quality of Life Research, 16(1), 19-31. https://doi.org/10.1007/s11136-007-9183-7
Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment Research & Evaluation, 7(1). https://doi.org/10.7275/an9m-2035
Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation, 10(1). https://doi.org/10.7275/56a5-6b14
Sass, D. A., Schmitt, D. A., & Walker, C. M. (2008) Estimating non-normal latent trait distributions within item response theory using true and estimated item parameters. Applied Measurement in Education, 21(1), 65-88. https://doi.org/10.1080/08957340701796415
Sheng Y. & Wikle C. K. (2007). Comparing multi-unidimensional and unidimensional item response theory models. Educational and Psychological Measurement, 68(6), 413-430. https://doi.org/10.1177/0013164406296977
Shin, D. (2007). A comparison of methods of estimating subscale scores for mixed-format tests (Pearson Educational Measurement Research Report).
Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47(2), 150-174. https://doi.org/10.1111/j.1745-3984.2010.00106.x
Sinharay, S., Haberman, S. J., & Wainer, H. (2011). Do adjusted subscores lack validity? Don’t blame the messenger. Educational and Psychological Measurement, 71(5), 789–797. http://doi.org/10.1177/0013164410391782
Solano-Flores, G., & Trumbull, E. (2003). Examining language in context: The need for new research and practice paradigms in the testing of English-language learners. Educational Researcher, 32(2), 3–13. https://doi.org/10.3102/0013189X032002003
Soysal, S. (2017). A comparison of the estimation of total test scores and subtest scores using hierarchical item response theory models [Toplam Test Puanı ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması] (Doctoral Dissertation). Hacettepe University
Soysal, S., & Kelecioğlu, H. (2018). A comparison of the estimation of total test scores and subtest scores using hierarchical item response theory models [Toplam test ve alt test puanlarının kestiriminin hiyerarşik madde tepki kuramı modelleri ile karşılaştırılması]. Journal of Measurement and Evaluation in Education and Psychology, 9(2), 178-201. https://doi.org/10.21031/epod.404089
Traub, R. E. (1993). On the equivalence of traits assessed by multiple-choice and constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement (pp. 29-44). Lawrence Erlbaum.
Wainer, H., Sheehan, K. M., & Wang, X. (2000). Some paths toward making Praxis scores more useful. Journal of Educational Measurement, 37(2), 113-140. https://doi.org/10.1111/j.1745-3984.2000.tb01079.x
Wang, H. (2012). A Monte Carlo comparison of polytomous item estimation based on higher-order item response theory models versus higher-order confirmatory factor analysis models (Doctoral Dissertation). University of Pittsburgh.
Wang, W.-C., Chen, P.-H., & Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9(1), 116–136. http://doi.org/10.1037/1082-989X.9.1.116
Way, W. D., Ansley, T. N., & Forsyth, R. A. (1988). The comparative effects of compensatory and non-compensatory two-dimensional data on unidimensional IRT estimation. Applied Psychological Measurement, 12(3), 239-252. https://doi.org/10.1177/014662168801200303
Wedman, J., & Lyrén, P. E. (2015). Methods for examining the psychometric quality of subscores: A review and application. Practical Assessment, Research, and Evaluation, 20(1), 21. https://doi.org/10.7275/ng3q-0d19
Yao, L. (2003). BMIRT: Bayesian multivariate item response theory [Computer Software]. Defense Manpower Data Center.
Yao, L. (2010). Reporting valid and reliable overall scores and domain scores. Journal of Educational Measurement, 47(3), 339–360. http://doi.org/10.1111/j.1745-3984.2010.00117
Yao, L. (2012). Multidimensional item response theory for score reporting. In Y. Cheng, & H.‐ H. Chang (Eds.) Advances in modern international testing: Transition from summative to formative assessment. Information Age.
Yao, L. (2016). The BMIRT toolkit.
Yao, L., & Boughton, K. A. (2007). A Multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31(2), 83–105. http://doi.org/10.1177/0146621606291559
Yao, L., & Schwarz, R. D. (2006). A multidimensional partial credit model with associated item and test statistics: An application to mixed-format tests. Applied Psychological Measurement, 30(6), 469-492. https://doi.org/10.1177/0146621605284537
Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27(1), 119–140. https://doi.org/10.1177/0265532209347363
Zhang, J. (2007). Conditional covariance theory and detect for polytomous items. Psychometrika, 72(1), 69-91. http://doi.org/10.1007/s11336-004-1257-7

There are 73 citations in total.

Details

Primary Language	English
Subjects	Item Response Theory, Testing, Assessment and Psychometrics (Other)
Journal Section	Research Article
Authors	Ayşenur Erdemir 0000-0001-9656-0878 Hakan Atar 0000-0001-5372-1926
Submission Date	July 24, 2025
Acceptance Date	November 30, 2025
Publication Date	December 31, 2025
Published in Issue	Year 2025 Volume: 16 Issue: 4

Cite

APA	Erdemir, A., & Atar, H. (2025). Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. Journal of Measurement and Evaluation in Education and Psychology, 16(4), 264-282. https://doi.org/10.21031/epod.1748835

Article Files

Full Text