Research Article
BibTex RIS Cite

Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy

Year 2025, Volume: 16 Issue: 4, 264 - 282, 31.12.2025

Abstract

This study compares the performance of the Multidimensional Item Response Theory (MIRT), Higher-Order IRT (HO-IRT), and Bifactor models for the simultaneous estimation of total and subscale scores in multidimensional tests. Using both simulated data and real data from an English proficiency exam, model performance was evaluated in terms of accuracy (RMSE), reliability, and classification accuracy. The simulation included 5,000 respondents, 120 items, and a four-dimensional structure, manipulating item format, test difficulty, and inter-dimensional correlation. Results indicated that MIRT consistently outperformed the other models, yielding the lowest RMSE and highest reliability and classification accuracy across conditions. HO-IRT also showed strong performance, while the Bifactor model underperformed, particularly in subscore estimation. Model performance was sensitive to test characteristics and dimensional relationships. Findings from the real data analysis supported the simulation results, underscoring the value of multidimensional modeling for diagnostic feedback and informed decision-making.

References

  • Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional items. Applied Psychological Measurement, 13(2), 113-127. https://doi.org/10.1177/014662168901300201
  • Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 7(4), 255-278. https://doi.org/10.1207/s15324818ame0704_1
  • Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x
  • Alderson, J. (2007). The challenge of (diagnostic) testing: Do we know what we are measuring? In J. Fox, M. Wesche, D. Bayliss, L. Cheng, C. Turner, & C. Doe (Eds.), Language testing reconsidered (pp. 21–40). University of Ottawa Press.
  • Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Brooks/Cole.
  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. AERA.
  • Ansley, T. N., & Forsyth, R. A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data. Applied Psychological Measurement, 9(1), 37-48. https://doi.org/10.1177/014662168500900104
  • Brennan, R. L. (2012). Utility indexes for decisions about subscores. Center for Advanced Studies in Measurement and Assessment (CASMA). Research Report, 33.
  • Bulut, O. (2013). Between-person and within-person subscore reliability: Comparison of unidimensional and multidimensional IRT models (Doctoral Dissertation). University Of Minnesota.
  • Chalmers, R. P. (2012). A Multidimensional item response theory package for the R Environment. Journal of Statistical Software, 48(6), 1-29. https://doi.org/10.18637/jss.v048.i06
  • Chang, Y.-W., & Davison, M. L. (1992, April). A comparison of unidimensional and multidimensional IRT approaches to test information in a test battery. Paper presented at the annual meeting of the American Educational Research Association, San-Francisco, CA, United States.
  • Chapelle, C. (2011). Conceptions of validity. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 21–33). Routledge.
  • Chin, T. Y. (2011). Accuracy and robustness of diagnostic methods: Comparing performance across domain score, multidimensional item response, and diagnostic categorization models (Doctoral Dissertation). University Of Nebraska.
  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Erlbaum.
  • de la Torre, J. (2008). Multidimensional scoring of abilities: The ordered polytomous response case. Applied Psychological Measurement, 32(5), 355-370. https://doi.org/10.1177/0146621607303784
  • de la Torre, J. (2009). Improving the quality of ability estimates through multidimensional scoring and incorporation of ancillary variables. Applied Psychological Measurement, 33(6), 465-485. https://doi.org/10.1177/0146621608329890
  • de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size a higher-order IRT model approach. Applied Psychological Measurement, 34(4), 267-285. https://doi.org/10.1177/0146621608329501
  • de la Torre, J., & Patz, R. J. (2005). Making the most of what we have: A practical application of multidimensional item response theory in test scoring. Journal of Educational and Behavioral Statistics, 30(3), 295–311. https://doi.org/10.3102/10769986030003295
  • de la Torre, J., & Song, H. (2009). Simultaneous estimation of overall and domain abilities: A higher-order IRT model approach. Applied Psychological Measurement, 33(8), 620–639. http://doi.org/10.1177/0146621608326423
  • de la Torre, J., Song, H., & Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35(4), 296–316. http://doi.org/10.1177/0146621610378653
  • DeMars, C. E. (2005, August). Scoring subscales using multidimensional item response theory models. Poster presented at the Annual Meeting of the American Psychological Association, Washington, DC, United States.
  • DeMars, C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13(4), 354-378. https://doi.org/10.1080/15305058.2013.799067
  • Embretson, S. E. (2007). Impact of measurement scale in modeling development processes and ecological factors. In T. D. Little, J. A. Bovaird, & N. A. Card (Eds.), Modeling contextual effects in longitudinal studies (pp. 63–87). Lawrence Erlbaum Associates Publishers.
  • Ercikan, K., Schwarz, R. D., Julian, M. W., Burket, G., & Link, V. (1998). Calibration and scoring of tests with multiple-choice and constructed-response item types. Journal of Educational Measurement, 35(2), 137-154. https://doi.org/10.1111/j.1745-3984.1998.tb00531.x
  • Fan, F. (2016). Subscore reliability and classification consistency: a comparison of five methods (Doctoral Dissertation). University of Massachusetts Amherst.
  • Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33(2), 204-229. https://doi.org/10.3102/1076998607302636
  • Haberman, S. J., & Sinharay, S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75(2), 209-227. https://doi.org/10.1007/S11336-010-9158-4
  • Haladyna, T. M., & Kramer, G. A. (2004). The validity of subscores for a credentialing test. Evaluation & the Health Professions, 27(4), 349–368. https://doi.org/10.1177/0163278704270010
  • Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12(3), 38–47. https://doi.org/10.1111/j.1745-3992.1993.tb00543.x
  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.
  • Hong, Y., Lam, T. C., & de la Torre, J. (2010). Ancillary variables and multidimensional scoring of polytomous responses. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO, United States.
  • Jang, E. E., & Roussos, L. (2007). An investigation into the dimensionality of TOEFL using conditional covariance‐based nonparametric approach. Journal of Educational Measurement, 44(1), 1-21. https://doi.org/10.1111/j.1745-3984.2007.00024.x
  • Kahraman, N., & Kamata, A. (2004). Increasing the precision of subscale scores by using out-of-scale information. Applied Psychological Measurement, 28(6), 407-426. https://doi.org/10.1177/0146621604268736
  • Lane, S. (2005, April). Status and future directions for performance assessments in education. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada.
  • Lathrop, Q. (2015). Practical Issues in Estimating Classification Accuracy and Consistency with R Package cacIRT. Practical Assessment, Research, and Evaluation, 20(1), Article 18. https://doi.org/10.7275/43vm-p442
  • Lee, W. C., Kim, S. Y., Choi, J., & Kang, Y. (2020). IRT approaches to modeling scores on mixed‐format tests. Journal of Educational Measurement, 57(2), 230-254. https://doi.org/10.1111/jedm.12248
  • Li, C. (2018). Subscore reporting for double-coded innovative items embedded in multiple contexts (Doctoral Dissertation). Graduate School of the University of Maryland, College Park.
  • Liu, Y., Li, Z., & Liu, H. (2019). Reporting valid and reliable overall scores and domain scores using bi-factor model. Applied psychological measurement, 43(7), 562-576. https://doi.org/10.1177/0146621618813093
  • Longabach, T. (2015). A comparison of subscore reporting methods for a state assessment of English language proficiency (Doctoral Dissertation). University of Kansas.
  • Longabach, T., & Peyton, V. (2017). A comparison of reliability and precision of subscore reporting methods for a state English language proficiency assessment. Language Testing, 35(2), 297-317. http://doi.org/10.1177/0265532217689949
  • Morse, B. J., Johanson, G. A., & Griffeth, R. W. (2012). Using the graded response model to control spurious interactions in moderated multiple regression. Applied Psychological Measurement, 36(2), 122-146. https://doi.org/10.1177/0146621612438725
  • Nandakumar, R. (1994). Assessing dimensionality of a set of item responses-comparison of different approaches. Journal of Educational Measurement, 31(1), 17-35. https://doi.org/10.1111/j.1745-3984.1994.tb00432.x
  • Paek, I. (2015). An investigation of the impact of guessing on coefficient α and reliability. Applied Psychological Measurement, 39(4), 264-277. https://doi.org/10.1177/0146621614559516
  • Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9(4), 401-412. https://doi.org/10.1177/014662168500900409
  • Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21(1), 25-36. https://doi.org/10.1177/0146621697211002
  • Reckase, M. D. (2009). Multidimensional item response theory (statistics for social and behavioral sciences). Springer.
  • Reckase, M. D., & Xu, J. R. (2015). The evidence for a subscore structure in a test of English language competency for English language learners. Educational and Psychological Measurement, 75(5), 805-825. https://doi.org/10.1177/0013164414554416
  • Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92(6), 544-559. https://doi.org/10.1080/00223891.2010.496477
  • Reise, S. P., Morizot, J., & Hays, R. D. (2007). The role of the bifactor model in resolving dimensionality issues in health outcomes measures. Quality of Life Research, 16(1), 19-31. https://doi.org/10.1007/s11136-007-9183-7
  • Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment Research & Evaluation, 7(1). https://doi.org/10.7275/an9m-2035
  • Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation, 10(1). https://doi.org/10.7275/56a5-6b14
  • Sass, D. A., Schmitt, D. A., & Walker, C. M. (2008) Estimating non-normal latent trait distributions within item response theory using true and estimated item parameters. Applied Measurement in Education, 21(1), 65-88. https://doi.org/10.1080/08957340701796415
  • Sheng Y. & Wikle C. K. (2007). Comparing multi-unidimensional and unidimensional item response theory models. Educational and Psychological Measurement, 68(6), 413-430. https://doi.org/10.1177/0013164406296977
  • Shin, D. (2007). A comparison of methods of estimating subscale scores for mixed-format tests (Pearson Educational Measurement Research Report).
  • Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47(2), 150-174. https://doi.org/10.1111/j.1745-3984.2010.00106.x
  • Sinharay, S., Haberman, S. J., & Wainer, H. (2011). Do adjusted subscores lack validity? Don’t blame the messenger. Educational and Psychological Measurement, 71(5), 789–797. http://doi.org/10.1177/0013164410391782
  • Solano-Flores, G., & Trumbull, E. (2003). Examining language in context: The need for new research and practice paradigms in the testing of English-language learners. Educational Researcher, 32(2), 3–13. https://doi.org/10.3102/0013189X032002003
  • Soysal, S. (2017). A comparison of the estimation of total test scores and subtest scores using hierarchical item response theory models [Toplam Test Puanı ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması] (Doctoral Dissertation). Hacettepe University
  • Soysal, S., & Kelecioğlu, H. (2018). A comparison of the estimation of total test scores and subtest scores using hierarchical item response theory models [Toplam test ve alt test puanlarının kestiriminin hiyerarşik madde tepki kuramı modelleri ile karşılaştırılması]. Journal of Measurement and Evaluation in Education and Psychology, 9(2), 178-201. https://doi.org/10.21031/epod.404089
  • Traub, R. E. (1993). On the equivalence of traits assessed by multiple-choice and constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement (pp. 29-44). Lawrence Erlbaum.
  • Wainer, H., Sheehan, K. M., & Wang, X. (2000). Some paths toward making Praxis scores more useful. Journal of Educational Measurement, 37(2), 113-140. https://doi.org/10.1111/j.1745-3984.2000.tb01079.x
  • Wang, H. (2012). A Monte Carlo comparison of polytomous item estimation based on higher-order item response theory models versus higher-order confirmatory factor analysis models (Doctoral Dissertation). University of Pittsburgh.
  • Wang, W.-C., Chen, P.-H., & Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9(1), 116–136. http://doi.org/10.1037/1082-989X.9.1.116
  • Way, W. D., Ansley, T. N., & Forsyth, R. A. (1988). The comparative effects of compensatory and non-compensatory two-dimensional data on unidimensional IRT estimation. Applied Psychological Measurement, 12(3), 239-252. https://doi.org/10.1177/014662168801200303
  • Wedman, J., & Lyrén, P. E. (2015). Methods for examining the psychometric quality of subscores: A review and application. Practical Assessment, Research, and Evaluation, 20(1), 21. https://doi.org/10.7275/ng3q-0d19
  • Yao, L. (2003). BMIRT: Bayesian multivariate item response theory [Computer Software]. Defense Manpower Data Center.
  • Yao, L. (2010). Reporting valid and reliable overall scores and domain scores. Journal of Educational Measurement, 47(3), 339–360. http://doi.org/10.1111/j.1745-3984.2010.00117
  • Yao, L. (2012). Multidimensional item response theory for score reporting. In Y. Cheng, & H.‐ H. Chang (Eds.) Advances in modern international testing: Transition from summative to formative assessment. Information Age.
  • Yao, L. (2016). The BMIRT toolkit.
  • Yao, L., & Boughton, K. A. (2007). A Multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31(2), 83–105. http://doi.org/10.1177/0146621606291559
  • Yao, L., & Schwarz, R. D. (2006). A multidimensional partial credit model with associated item and test statistics: An application to mixed-format tests. Applied Psychological Measurement, 30(6), 469-492. https://doi.org/10.1177/0146621605284537
  • Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27(1), 119–140. https://doi.org/10.1177/0265532209347363
  • Zhang, J. (2007). Conditional covariance theory and detect for polytomous items. Psychometrika, 72(1), 69-91. http://doi.org/10.1007/s11336-004-1257-7
There are 73 citations in total.

Details

Primary Language English
Subjects Item Response Theory, Testing, Assessment and Psychometrics (Other)
Journal Section Research Article
Authors

Ayşenur Erdemir 0000-0001-9656-0878

Hakan Atar 0000-0001-5372-1926

Submission Date July 24, 2025
Acceptance Date November 30, 2025
Publication Date December 31, 2025
Published in Issue Year 2025 Volume: 16 Issue: 4

Cite

APA Erdemir, A., & Atar, H. (2025). Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy. Journal of Measurement and Evaluation in Education and Psychology, 16(4), 264-282. https://doi.org/10.21031/epod.1748835