Research Article
BibTex RIS Cite

A practical guide to item bank calibration with multiple matrix sampling

Year 2024, Volume: 11 Issue: 4, 647 - 659
https://doi.org/10.21449/ijate.1440316

Abstract

When it is required to estimate item parameters of a large item bank, Multiple Matrix Sampling (MMS) design provides an efficient way while minimizing the test burden on students. The current study exemplifies how to calibrate a large item pool using MMS design for various purposes, such as developing a CAT administration. The purpose of the current study is to explain and provide an example of how to use MMS design for item bank calibration. Two functions of mirt package, mirt() and multipleGroup() were compared using real data. The results of the present study showed that the standard mirt() function is more practical and makes more precise estimations compared to the multipleGroup() function.

Supporting Institution

Bogazici University

Project Number

BAP-SUP 17002

References

  • Blötner, C. (2024). Package ‘diffcor’. https://cran.r project.org/web/packages/diffcor/diffcor.pdf
  • Bock, R.D., & Zimowski, M.F. (1997). Multiple group IRT. In W.J. van der Linden & R.K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 433–448). Springer. https://doi.org/10.1007/978-1-4757-2691-6_25
  • Chalmers, R.P. (2012). mirt: A multidimensional item response theory package for the R Environment. Journal of Statistical Software, 48, 1 29. https://doi.org/10.18637/jss.v048.i06
  • Chalmers, R.P. (2023). Package “mirt”. https://cran.r-project.org/web/packages/mirt/mirt.pdf
  • Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.2307/1165285
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Routledge. https://doi.org/10.4324/9780203771587
  • Gonzalez, E., & Rutkowski, L. (2010). Principles of Multiple Matrix Booklet Designs and Parameter Recovery in Large-Scale Assessments (pp. 125–156). IERI.
  • Gressard, R.P., & Loyd, B.H. (1991). A comparison of item sampling plans in the application of multiple matrix sampling. Journal of Educational Measurement, 28(2), 119–130.
  • Kaplan, D., & Su, D. (2016). On matrix sampling and imputation of context questionnaires with implications for the generation of plausible values in large-scale assessments. Journal of Educational and Behavioral Statistics, 41(1), 57–80.
  • Lord, F.M. (1962). Estimating norms by item-sampling. Educational and Psychological Measurement, 22(2), 259–267. https://doi.org/10.1177/001316446202200202
  • Lord, F.M. (1965). Item sampling in test theory and in research design. ETS Research Bulletin Series, 1965(2), i–39. https://doi.org/10.1002/j.2333-8504.1965.tb00968.x
  • Macdonald, P., & Paunonen, S.V. (2002). A monte carlo comparison of item and person statistics based on item response theory versus classical test theory. Educational and Psychological Measurement, 62(6), 921 943. https://doi.org/10.1177/0013164402238082
  • Munger, G.F., & Loyd, B.H. (1988). The use of multiple matrix sampling for survey research. The Journal of Experimental Education, 56(4), 187–191.
  • OECD. (2020). PISA 2018 Technical Report-PISA. OECD Publishing, Paris. Retrieved from https://www.oecd.org/pisa/data/pisa2018technicalreport/
  • OECD. (2023). PISA 2022 Technical Report-PISA. OECD Publishing, Paris. Retrieved from https://www.oecd.org/pisa/data/pisa2022technicalreport/
  • O’Neill, T.R., Gregg, J.L., & Peabody, M.R. (2020). Effect of sample size on sommon item equating using the dichotomous rasch model. Applied Measurement in Education, 33(1), 10–23. https://doi.org/10.1080/08957347.2019.1674309
  • Rubin, D.B. (2009). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.
  • R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Rutkowski, L. (2014). Sensitivity of achievement estimation to conditioning model misclassification. Applied Measurement in Education, 27(2), 115 132. https://doi.org/10.1080/08957347.2014.880440
  • Rutkowski, L., Gonzalez, E., Davier, M. von, & Zhou, and Y. (2013). Assessment design for international large-scale assessments. In Handbook of International Large-Scale Assessment. Chapman and Hall/CRC.
  • Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data: Issues in Secondary Analysis and Reporting. Educational Researcher, 39(2), 142–151. https://doi.org/10.3102/0013189X10363170
  • Shoemaker, D.M. (1973). Principles and Procedures of Multiple Matrix Sampling. Ballinger Publishing Company.
  • Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 47(4).
  • Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science. O’Reilly Media, Inc.
  • Yin, L., & Foy, P. (2023). TIMSS 2023 Assessment Design. In I.V.S. Mullis, M.O. Martin, & M. von Davier (Eds.), TIMSS 2023 Assessment Frameworks. Boston College, TIMSS & PIRLS International Study Center.
  • Zhou, Y. (2021). Improving Multiple Matrix Sampling Design for Questionnaires. Indiana University.

A practical guide to item bank calibration with multiple matrix sampling

Year 2024, Volume: 11 Issue: 4, 647 - 659
https://doi.org/10.21449/ijate.1440316

Abstract

When it is required to estimate item parameters of a large item bank, Multiple Matrix Sampling (MMS) design provides an efficient way while minimizing the test burden on students. The current study exemplifies how to calibrate a large item pool using MMS design for various purposes, such as developing a CAT administration. The purpose of the current study is to explain and provide an example of how to use MMS design for item bank calibration. Two functions of mirt package, mirt() and multipleGroup() were compared using real data. The results of the present study showed that the standard mirt() function is more practical and makes more precise estimations compared to the multipleGroup() function.

Project Number

BAP-SUP 17002

References

  • Blötner, C. (2024). Package ‘diffcor’. https://cran.r project.org/web/packages/diffcor/diffcor.pdf
  • Bock, R.D., & Zimowski, M.F. (1997). Multiple group IRT. In W.J. van der Linden & R.K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 433–448). Springer. https://doi.org/10.1007/978-1-4757-2691-6_25
  • Chalmers, R.P. (2012). mirt: A multidimensional item response theory package for the R Environment. Journal of Statistical Software, 48, 1 29. https://doi.org/10.18637/jss.v048.i06
  • Chalmers, R.P. (2023). Package “mirt”. https://cran.r-project.org/web/packages/mirt/mirt.pdf
  • Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.2307/1165285
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Routledge. https://doi.org/10.4324/9780203771587
  • Gonzalez, E., & Rutkowski, L. (2010). Principles of Multiple Matrix Booklet Designs and Parameter Recovery in Large-Scale Assessments (pp. 125–156). IERI.
  • Gressard, R.P., & Loyd, B.H. (1991). A comparison of item sampling plans in the application of multiple matrix sampling. Journal of Educational Measurement, 28(2), 119–130.
  • Kaplan, D., & Su, D. (2016). On matrix sampling and imputation of context questionnaires with implications for the generation of plausible values in large-scale assessments. Journal of Educational and Behavioral Statistics, 41(1), 57–80.
  • Lord, F.M. (1962). Estimating norms by item-sampling. Educational and Psychological Measurement, 22(2), 259–267. https://doi.org/10.1177/001316446202200202
  • Lord, F.M. (1965). Item sampling in test theory and in research design. ETS Research Bulletin Series, 1965(2), i–39. https://doi.org/10.1002/j.2333-8504.1965.tb00968.x
  • Macdonald, P., & Paunonen, S.V. (2002). A monte carlo comparison of item and person statistics based on item response theory versus classical test theory. Educational and Psychological Measurement, 62(6), 921 943. https://doi.org/10.1177/0013164402238082
  • Munger, G.F., & Loyd, B.H. (1988). The use of multiple matrix sampling for survey research. The Journal of Experimental Education, 56(4), 187–191.
  • OECD. (2020). PISA 2018 Technical Report-PISA. OECD Publishing, Paris. Retrieved from https://www.oecd.org/pisa/data/pisa2018technicalreport/
  • OECD. (2023). PISA 2022 Technical Report-PISA. OECD Publishing, Paris. Retrieved from https://www.oecd.org/pisa/data/pisa2022technicalreport/
  • O’Neill, T.R., Gregg, J.L., & Peabody, M.R. (2020). Effect of sample size on sommon item equating using the dichotomous rasch model. Applied Measurement in Education, 33(1), 10–23. https://doi.org/10.1080/08957347.2019.1674309
  • Rubin, D.B. (2009). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.
  • R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Rutkowski, L. (2014). Sensitivity of achievement estimation to conditioning model misclassification. Applied Measurement in Education, 27(2), 115 132. https://doi.org/10.1080/08957347.2014.880440
  • Rutkowski, L., Gonzalez, E., Davier, M. von, & Zhou, and Y. (2013). Assessment design for international large-scale assessments. In Handbook of International Large-Scale Assessment. Chapman and Hall/CRC.
  • Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data: Issues in Secondary Analysis and Reporting. Educational Researcher, 39(2), 142–151. https://doi.org/10.3102/0013189X10363170
  • Shoemaker, D.M. (1973). Principles and Procedures of Multiple Matrix Sampling. Ballinger Publishing Company.
  • Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 47(4).
  • Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science. O’Reilly Media, Inc.
  • Yin, L., & Foy, P. (2023). TIMSS 2023 Assessment Design. In I.V.S. Mullis, M.O. Martin, & M. von Davier (Eds.), TIMSS 2023 Assessment Frameworks. Boston College, TIMSS & PIRLS International Study Center.
  • Zhou, Y. (2021). Improving Multiple Matrix Sampling Design for Questionnaires. Indiana University.
There are 26 citations in total.

Details

Primary Language English
Subjects Measurement Theories and Applications in Education and Psychology
Journal Section Articles
Authors

Eren Can Aybek 0000-0003-3040-2337

Serkan Arıkan 0000-0001-9610-5496

Güneş Ertaş 0000-0001-8785-7768

Project Number BAP-SUP 17002
Early Pub Date October 21, 2024
Publication Date
Submission Date February 20, 2024
Acceptance Date August 12, 2024
Published in Issue Year 2024 Volume: 11 Issue: 4

Cite

APA Aybek, E. C., Arıkan, S., & Ertaş, G. (2024). A practical guide to item bank calibration with multiple matrix sampling. International Journal of Assessment Tools in Education, 11(4), 647-659. https://doi.org/10.21449/ijate.1440316

23823             23825             23824