Research Article
BibTex RIS Cite
Year 2020, Volume: 11 Issue: 1, 61 - 75, 24.03.2020
https://doi.org/10.21031/epod.645478

Abstract

References

  • Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22, 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x
  • Adams, R., Wilson, M., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23. https://doi.org/10.1177/0146621697211001
  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington, DC: AERA.
  • Brown, A. & Croudace, T. (2015). Scoring and estimating score precision using multidimensional IRT. In Reise, S. P. & Revicki, D. A. (Eds.). Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment (a volume in the Multivariate Applications Series). New York: Routledge/Taylor & Francis Group.
  • Cai, L., Yang, J. S., & Hansen, M. (2011). Generalized full-information item bifactor analysis. Psychological Methods, 16(3), 221–248. http://doi.org/10.1037/a0023350
  • de la Torre, J., & Patz, R. J. (2005). Making the most of what we have : A practical application of multidimensional item response theory in test scoring. Journal of Educational and Behavioral Statistics, 30(3), 295–311. https://doi.org/10.3102/10769986030003295
  • de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size: A higher-order IRT model approach. Applied Psychological Measurement, 34, 267-285. https://doi.org/10.1177/0146621608329501
  • de la Torre, J., & Song, H. (2009). Simultaneous estimation of overall and domain abilities: A higher-order ırt model approach. Applied Psychological Measurement, 33(8), 620–639. http://doi.org/10.1177/0146621608326423
  • de la Torre, J., Song, H., & Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35(4), 296–316. http://doi.org/10.1177/0146621610378653
  • DeMars, C.E. (2005, August). Scoring subscales using multidimensional item response theory models. Poster presented at the annual meeting of the American Psychology Association. Retrieved from https://eric.ed.gov/?id=ED496242
  • DeMars, C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13, 354-378. https://doi.org/10.1080/15305058.2013.799067
  • Fan, F. (2016). Subscore Reliability and Classification Consistency : A Comparison of Five Methods (Doctoral dissertation, University of Massachusetts Amherst). Retrieved from https://scholarworks.umass.edu/dissertations_2/857/
  • Gibbons, R. D., & Hedeker, D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423–436. http://doi.org/10.1007/BF02295430
  • Haberman, S. J. (2008). When can subscale scores have value? Journal of Educational and Behavioral Statistics, 33, 204–229. https://doi.org/10.3102/1076998607302636
  • Haberman, S. J., & Sinharay, S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209– 227. https://doi.org/10.1007/s11336-010-9158-4
  • Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structural analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1- 55. https://doi.org/10.1080/10705519909540118
  • Jang, E. E., & Roussos, L. (2007). An investigation into the dimensionality of TOEFL using conditional covariance-based nonparametric approach. Journal of Educational Measurement, 44, 1-21. https://doi.org/10.1111/j.1745-3984.2007.00024.x
  • Lane, S. (2005, April). Status and future directions for performance assessments in education. Paper presented at the annual meeting of the American Educational Research Association, Montreal.
  • Liu, Y., Li, Z., & Liu, H. (2018). Reporting Valid and Reliable Overall Scores and Domain Scores Using Bi-Factor Model. Applied Psychological Measurement, 43(7), 562–576. https://doi.org/10.1177/0146621618813093
  • Liu, Y., & Liu, H. (2017). Reporting overall scores and domain scores of bi-factor models. Acta Psychologica Sinica, 49(9), 1234. http://doi.org/10.3724/SP.J.1041.2017.01234
  • Longabach, T. (2015). A comparison of subscore reporting methods for a state assessment of English language proficiency (Doctoral dissertation, University of Kansas). Retrieved from https://kuscholarworks.ku.edu/handle/1808/19517
  • Md Desa, Z. N. D. (2012). Bi-factor multidimensional item response theory modeling for subscore estimation, reliability, and classification (Doctoral dissertation, University of Kansas). Retrieved from http://kuscholarworks.ku.edu/dspace/handle/1808/10126
  • Monaghan, W. (2006). The facts about subscale scores (ETS R&D Connections No. 4). Princeton, NJ: Educational Testing Service. Retrieved from https://www.ets.org/Media/Research/pdf/RD_Connections4.pdf
  • Reckase, M. D. (1985). The difficulty of items that measure more than one ability. Applied Psychological Measurement, 9, 401-412. https://doi.org/10.1177/014662168500900409
  • Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden &R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 271–286). NY: Springer.
  • Reckase, M. D. (2009). Multidimensional item response theory (statistics for social and behavioral sciences). New York: Springer.
  • Reckase, M. D., & Xu, J.-R. (2015). The Evidence for a Subscore Structure in a Test of English Language Competency for English Language Learners. Educational and Psychological Measurement, 75(5), 805–825. https://doi.org/10.1177/0013164414554416
  • Robitzsch, A. (2019). Supplementary Item Response Theory Models Version. R-project, Package 'sirt' manual. Retrieved from https://cran.r-project.org/web/packages/sirt/sirt.pdf
  • Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47(2), 150–174. https://doi.org/10.1111/j.1745-3984.2010.00106.x
  • Sinharay, S., Haberman, S. J., & Wainer, H. (2011). Do Adjusted Subscores Lack Validity? Don’t Blame the Messenger. Educational and Psychological Measurement, 71(5), 789–797. https://doi.org/10.1177/0013164410391782
  • Soysal, S., & Kelecioğlu, H. (2018). Toplam Test ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması. Journal of Measurement and Evaluation in Education and Psychology , 9(2), 178-201. https://doi.org/10.21031/epod.404089
  • Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2010). Providing subscale scores for diagnostic information: A case study when the test is essentially unidimensional. Applied Measurement in Education, 23(1), 63–86. https://doi.org/10.1080/08957340903423651
  • Tabachnick B. G. & Fidel, L. S. (2001). Using multivariate statistics (4th ed.). MA: Allyn & Bacon, Inc.
  • Wang, W.-C., Chen, P.-H., & Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional ıtem response models. Psychological Methods, 9(1), 116–136. http://doi.org/10.1037/1082-989X.9.1.116
  • Wedman, J., & Lyren, P.‐E. (2015). Methods for examining the psychometric quality of subscores: A review and application. Practical Assessment, Research and Evaluation, 20 (21). Retrieved from https://pareonline.net/getvn.asp?v=20&n=21
  • Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: Issues and Practices, 15, 22–29. https://doi.org/10.1111/j.1745-3992.1996.tb00803.x
  • Wang, W.-C., Chen, P.-H., & Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9(1), 116–136. http://doi.org/10.1037/1082-989X.9.1.116
  • Yao, L. (2003). BMIRT: Bayesian multivariate item response theory [computer software]. Monterey, CA: DefenseManpower Data Center. Downloaded from https://www.bmirt.com/6271.html
  • Yao, L. (2010). Reporting valid and reliable overall scores and domain scores. Journal of Educational Measurement, 47(3), 339–360. http://doi.org/10.1111/j.1745-3984.2010.00117
  • Yao, L. (2013). The BMIRT toolkit. Monterey. Retrieved from https://www.bmirt.com/8220.html
  • Yao, L. (2014). Multidimensional item response theory for score reporting. In Y. Cheng, & H.‐H. Chang (Eds.) Advances in modern international testing: Transition from summative to formative assessment. Charlotte, NC: Information Age.
  • Yao, L., & Boughton, K. A. (2007). A Multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31(2), 83–105. http://doi.org/10.1177/0146621606291559
  • Yao, L., Lewis, D., & Zhang, L. (2008, April). An introduction to the application of BMIRT: Bayesian multivariate item response theory software. Training session presented at the meeting of the National Council on Measurement in Education, New York, NY.
  • Yao, L., & Schwarz R. (2006). Amultidimensional partial credit model with associated item and test statistics: An application to mixed format tests. Applied Psychological Measurement. 30, 469—492. https://doi.org/10.1177/0146621605284537
  • Zhang, J. (2007). Conditional covariance theory and DETECT for polytomous items. Psychometrika, 72, 69-91. https://doi.org/10.1007/s11336-004-1257-7
  • Zhang, J., & Stout, W. (1999a). Conditional covariance structure of generalized compensatory multidimensional items. Psychometrika, 64, 129-152. https://doi.org/10.1007/BF02294532
  • Zhang, J., & Stout, W. (1999b). The theoretical DETECT index of dimensionality and its applicationto approximate simple structure. Psychometrika, 64, 213-249. https://doi.org/10.1007/BF02294536
  • Zhang, M. (2016). Exploring dimensionality of scores for mixed-format tests (Doctoral Dissertation, University of Iowa). Retrieved from https://ir.uiowa.edu/etd/2171/

Simultaneous Estimation of Overall Score and Subscores Using MIRT, HO-IRT and Bi-factor Model on TIMSS Data

Year 2020, Volume: 11 Issue: 1, 61 - 75, 24.03.2020
https://doi.org/10.21031/epod.645478

Abstract

In educational testing, there is an increasing interest in the simultaneous estimation of the overall scores and subscores. This study aims to compare the reliability and precision of the simultaneous estimation of overall scores and sub-scores using MIRT, HO-IRT and Bi-factor models. TIMSS 2015 mathematics scores have been used as a data set in this study. The TIMSS 2015 mathematics test consists of 35 items, four of which are polytomously scored (0-1-2), and the rest of the items are dichotomously scored (0-1). The four content domains include number (14 items), algebra (9 items), geometry (6 items), and data and change (6 items). Ability parameters were estimated using the BMIRT software. The results showed that the MIRT and HO-IRT methods performed similarly in terms of precision and reliability for subscore estimates. The MIRT maximum information method had the smallest standard error of measurement for the overall score estimates. All three methods performed similarly in terms of the overall score reliability. The findings suggest that among the three methods compared, HO-IRT appears to be a better choice in the simultaneous estimation of the overall score and subscores for the data from TIMSS 2015. Recommendations for the testing practices and future research are provided.

References

  • Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22, 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x
  • Adams, R., Wilson, M., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23. https://doi.org/10.1177/0146621697211001
  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington, DC: AERA.
  • Brown, A. & Croudace, T. (2015). Scoring and estimating score precision using multidimensional IRT. In Reise, S. P. & Revicki, D. A. (Eds.). Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment (a volume in the Multivariate Applications Series). New York: Routledge/Taylor & Francis Group.
  • Cai, L., Yang, J. S., & Hansen, M. (2011). Generalized full-information item bifactor analysis. Psychological Methods, 16(3), 221–248. http://doi.org/10.1037/a0023350
  • de la Torre, J., & Patz, R. J. (2005). Making the most of what we have : A practical application of multidimensional item response theory in test scoring. Journal of Educational and Behavioral Statistics, 30(3), 295–311. https://doi.org/10.3102/10769986030003295
  • de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size: A higher-order IRT model approach. Applied Psychological Measurement, 34, 267-285. https://doi.org/10.1177/0146621608329501
  • de la Torre, J., & Song, H. (2009). Simultaneous estimation of overall and domain abilities: A higher-order ırt model approach. Applied Psychological Measurement, 33(8), 620–639. http://doi.org/10.1177/0146621608326423
  • de la Torre, J., Song, H., & Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35(4), 296–316. http://doi.org/10.1177/0146621610378653
  • DeMars, C.E. (2005, August). Scoring subscales using multidimensional item response theory models. Poster presented at the annual meeting of the American Psychology Association. Retrieved from https://eric.ed.gov/?id=ED496242
  • DeMars, C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13, 354-378. https://doi.org/10.1080/15305058.2013.799067
  • Fan, F. (2016). Subscore Reliability and Classification Consistency : A Comparison of Five Methods (Doctoral dissertation, University of Massachusetts Amherst). Retrieved from https://scholarworks.umass.edu/dissertations_2/857/
  • Gibbons, R. D., & Hedeker, D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423–436. http://doi.org/10.1007/BF02295430
  • Haberman, S. J. (2008). When can subscale scores have value? Journal of Educational and Behavioral Statistics, 33, 204–229. https://doi.org/10.3102/1076998607302636
  • Haberman, S. J., & Sinharay, S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209– 227. https://doi.org/10.1007/s11336-010-9158-4
  • Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structural analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1- 55. https://doi.org/10.1080/10705519909540118
  • Jang, E. E., & Roussos, L. (2007). An investigation into the dimensionality of TOEFL using conditional covariance-based nonparametric approach. Journal of Educational Measurement, 44, 1-21. https://doi.org/10.1111/j.1745-3984.2007.00024.x
  • Lane, S. (2005, April). Status and future directions for performance assessments in education. Paper presented at the annual meeting of the American Educational Research Association, Montreal.
  • Liu, Y., Li, Z., & Liu, H. (2018). Reporting Valid and Reliable Overall Scores and Domain Scores Using Bi-Factor Model. Applied Psychological Measurement, 43(7), 562–576. https://doi.org/10.1177/0146621618813093
  • Liu, Y., & Liu, H. (2017). Reporting overall scores and domain scores of bi-factor models. Acta Psychologica Sinica, 49(9), 1234. http://doi.org/10.3724/SP.J.1041.2017.01234
  • Longabach, T. (2015). A comparison of subscore reporting methods for a state assessment of English language proficiency (Doctoral dissertation, University of Kansas). Retrieved from https://kuscholarworks.ku.edu/handle/1808/19517
  • Md Desa, Z. N. D. (2012). Bi-factor multidimensional item response theory modeling for subscore estimation, reliability, and classification (Doctoral dissertation, University of Kansas). Retrieved from http://kuscholarworks.ku.edu/dspace/handle/1808/10126
  • Monaghan, W. (2006). The facts about subscale scores (ETS R&D Connections No. 4). Princeton, NJ: Educational Testing Service. Retrieved from https://www.ets.org/Media/Research/pdf/RD_Connections4.pdf
  • Reckase, M. D. (1985). The difficulty of items that measure more than one ability. Applied Psychological Measurement, 9, 401-412. https://doi.org/10.1177/014662168500900409
  • Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden &R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 271–286). NY: Springer.
  • Reckase, M. D. (2009). Multidimensional item response theory (statistics for social and behavioral sciences). New York: Springer.
  • Reckase, M. D., & Xu, J.-R. (2015). The Evidence for a Subscore Structure in a Test of English Language Competency for English Language Learners. Educational and Psychological Measurement, 75(5), 805–825. https://doi.org/10.1177/0013164414554416
  • Robitzsch, A. (2019). Supplementary Item Response Theory Models Version. R-project, Package 'sirt' manual. Retrieved from https://cran.r-project.org/web/packages/sirt/sirt.pdf
  • Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47(2), 150–174. https://doi.org/10.1111/j.1745-3984.2010.00106.x
  • Sinharay, S., Haberman, S. J., & Wainer, H. (2011). Do Adjusted Subscores Lack Validity? Don’t Blame the Messenger. Educational and Psychological Measurement, 71(5), 789–797. https://doi.org/10.1177/0013164410391782
  • Soysal, S., & Kelecioğlu, H. (2018). Toplam Test ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması. Journal of Measurement and Evaluation in Education and Psychology , 9(2), 178-201. https://doi.org/10.21031/epod.404089
  • Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2010). Providing subscale scores for diagnostic information: A case study when the test is essentially unidimensional. Applied Measurement in Education, 23(1), 63–86. https://doi.org/10.1080/08957340903423651
  • Tabachnick B. G. & Fidel, L. S. (2001). Using multivariate statistics (4th ed.). MA: Allyn & Bacon, Inc.
  • Wang, W.-C., Chen, P.-H., & Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional ıtem response models. Psychological Methods, 9(1), 116–136. http://doi.org/10.1037/1082-989X.9.1.116
  • Wedman, J., & Lyren, P.‐E. (2015). Methods for examining the psychometric quality of subscores: A review and application. Practical Assessment, Research and Evaluation, 20 (21). Retrieved from https://pareonline.net/getvn.asp?v=20&n=21
  • Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: Issues and Practices, 15, 22–29. https://doi.org/10.1111/j.1745-3992.1996.tb00803.x
  • Wang, W.-C., Chen, P.-H., & Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9(1), 116–136. http://doi.org/10.1037/1082-989X.9.1.116
  • Yao, L. (2003). BMIRT: Bayesian multivariate item response theory [computer software]. Monterey, CA: DefenseManpower Data Center. Downloaded from https://www.bmirt.com/6271.html
  • Yao, L. (2010). Reporting valid and reliable overall scores and domain scores. Journal of Educational Measurement, 47(3), 339–360. http://doi.org/10.1111/j.1745-3984.2010.00117
  • Yao, L. (2013). The BMIRT toolkit. Monterey. Retrieved from https://www.bmirt.com/8220.html
  • Yao, L. (2014). Multidimensional item response theory for score reporting. In Y. Cheng, & H.‐H. Chang (Eds.) Advances in modern international testing: Transition from summative to formative assessment. Charlotte, NC: Information Age.
  • Yao, L., & Boughton, K. A. (2007). A Multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31(2), 83–105. http://doi.org/10.1177/0146621606291559
  • Yao, L., Lewis, D., & Zhang, L. (2008, April). An introduction to the application of BMIRT: Bayesian multivariate item response theory software. Training session presented at the meeting of the National Council on Measurement in Education, New York, NY.
  • Yao, L., & Schwarz R. (2006). Amultidimensional partial credit model with associated item and test statistics: An application to mixed format tests. Applied Psychological Measurement. 30, 469—492. https://doi.org/10.1177/0146621605284537
  • Zhang, J. (2007). Conditional covariance theory and DETECT for polytomous items. Psychometrika, 72, 69-91. https://doi.org/10.1007/s11336-004-1257-7
  • Zhang, J., & Stout, W. (1999a). Conditional covariance structure of generalized compensatory multidimensional items. Psychometrika, 64, 129-152. https://doi.org/10.1007/BF02294532
  • Zhang, J., & Stout, W. (1999b). The theoretical DETECT index of dimensionality and its applicationto approximate simple structure. Psychometrika, 64, 213-249. https://doi.org/10.1007/BF02294536
  • Zhang, M. (2016). Exploring dimensionality of scores for mixed-format tests (Doctoral Dissertation, University of Iowa). Retrieved from https://ir.uiowa.edu/etd/2171/
There are 48 citations in total.

Details

Primary Language English
Journal Section Articles
Authors

Ayşenur Erdemir

Hakan Yavuz Atar 0000-0001-5372-1926

Publication Date March 24, 2020
Acceptance Date February 25, 2020
Published in Issue Year 2020 Volume: 11 Issue: 1

Cite

APA Erdemir, A., & Atar, H. Y. (2020). Simultaneous Estimation of Overall Score and Subscores Using MIRT, HO-IRT and Bi-factor Model on TIMSS Data. Journal of Measurement and Evaluation in Education and Psychology, 11(1), 61-75. https://doi.org/10.21031/epod.645478