A practical guide to item bank calibration with multiple matrix sampling
Year 2024,
, 647 - 659, 15.11.2024
Eren Can Aybek
,
Serkan Arıkan
,
Güneş Ertaş
Abstract
When it is required to estimate item parameters of a large item bank, Multiple Matrix Sampling (MMS) design provides an efficient way while minimizing the test burden on students. The current study exemplifies how to calibrate a large item pool using MMS design for various purposes, such as developing a CAT administration. The purpose of the current study is to explain and provide an example of how to use MMS design for item bank calibration. Two functions of mirt package, mirt() and multipleGroup() were compared using real data. The results of the present study showed that the standard mirt() function is more practical and makes more precise estimations compared to the multipleGroup() function.
Supporting Institution
Bogazici University
Project Number
BAP-SUP 17002
References
- Blötner, C. (2024). Package ‘diffcor’. https://cran.r project.org/web/packages/diffcor/diffcor.pdf
- Bock, R.D., & Zimowski, M.F. (1997). Multiple group IRT. In W.J. van der Linden & R.K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 433–448). Springer. https://doi.org/10.1007/978-1-4757-2691-6_25
- Chalmers, R.P. (2012). mirt: A multidimensional item response theory package for the R Environment. Journal of Statistical Software, 48, 1 29. https://doi.org/10.18637/jss.v048.i06
- Chalmers, R.P. (2023). Package “mirt”. https://cran.r-project.org/web/packages/mirt/mirt.pdf
- Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.2307/1165285
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Routledge. https://doi.org/10.4324/9780203771587
- Gonzalez, E., & Rutkowski, L. (2010). Principles of Multiple Matrix Booklet Designs and Parameter Recovery in Large-Scale Assessments (pp. 125–156). IERI.
- Gressard, R.P., & Loyd, B.H. (1991). A comparison of item sampling plans in the application of multiple matrix sampling. Journal of Educational Measurement, 28(2), 119–130.
- Kaplan, D., & Su, D. (2016). On matrix sampling and imputation of context questionnaires with implications for the generation of plausible values in large-scale assessments. Journal of Educational and Behavioral Statistics, 41(1), 57–80.
- Lord, F.M. (1962). Estimating norms by item-sampling. Educational and Psychological Measurement, 22(2), 259–267. https://doi.org/10.1177/001316446202200202
- Lord, F.M. (1965). Item sampling in test theory and in research design. ETS Research Bulletin Series, 1965(2), i–39. https://doi.org/10.1002/j.2333-8504.1965.tb00968.x
- Macdonald, P., & Paunonen, S.V. (2002). A monte carlo comparison of item and person statistics based on item response theory versus classical test theory. Educational and Psychological Measurement, 62(6), 921 943. https://doi.org/10.1177/0013164402238082
- Munger, G.F., & Loyd, B.H. (1988). The use of multiple matrix sampling for survey research. The Journal of Experimental Education, 56(4), 187–191.
- OECD. (2020). PISA 2018 Technical Report-PISA. OECD Publishing, Paris. Retrieved from https://www.oecd.org/pisa/data/pisa2018technicalreport/
- OECD. (2023). PISA 2022 Technical Report-PISA. OECD Publishing, Paris. Retrieved from https://www.oecd.org/pisa/data/pisa2022technicalreport/
- O’Neill, T.R., Gregg, J.L., & Peabody, M.R. (2020). Effect of sample size on sommon item equating using the dichotomous rasch model. Applied Measurement in Education, 33(1), 10–23. https://doi.org/10.1080/08957347.2019.1674309
- Rubin, D.B. (2009). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.
- R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
- Rutkowski, L. (2014). Sensitivity of achievement estimation to conditioning model misclassification. Applied Measurement in Education, 27(2), 115 132. https://doi.org/10.1080/08957347.2014.880440
- Rutkowski, L., Gonzalez, E., Davier, M. von, & Zhou, and Y. (2013). Assessment design for international large-scale assessments. In Handbook of International Large-Scale Assessment. Chapman and Hall/CRC.
- Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data: Issues in Secondary Analysis and Reporting. Educational Researcher, 39(2), 142–151. https://doi.org/10.3102/0013189X10363170
- Shoemaker, D.M. (1973). Principles and Procedures of Multiple Matrix Sampling. Ballinger Publishing Company.
- Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 47(4).
- Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science. O’Reilly Media, Inc.
- Yin, L., & Foy, P. (2023). TIMSS 2023 Assessment Design. In I.V.S. Mullis, M.O. Martin, & M. von Davier (Eds.), TIMSS 2023 Assessment Frameworks. Boston College, TIMSS & PIRLS International Study Center.
- Zhou, Y. (2021). Improving Multiple Matrix Sampling Design for Questionnaires. Indiana University.
A practical guide to item bank calibration with multiple matrix sampling
Year 2024,
, 647 - 659, 15.11.2024
Eren Can Aybek
,
Serkan Arıkan
,
Güneş Ertaş
Abstract
When it is required to estimate item parameters of a large item bank, Multiple Matrix Sampling (MMS) design provides an efficient way while minimizing the test burden on students. The current study exemplifies how to calibrate a large item pool using MMS design for various purposes, such as developing a CAT administration. The purpose of the current study is to explain and provide an example of how to use MMS design for item bank calibration. Two functions of mirt package, mirt() and multipleGroup() were compared using real data. The results of the present study showed that the standard mirt() function is more practical and makes more precise estimations compared to the multipleGroup() function.
Project Number
BAP-SUP 17002
References
- Blötner, C. (2024). Package ‘diffcor’. https://cran.r project.org/web/packages/diffcor/diffcor.pdf
- Bock, R.D., & Zimowski, M.F. (1997). Multiple group IRT. In W.J. van der Linden & R.K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 433–448). Springer. https://doi.org/10.1007/978-1-4757-2691-6_25
- Chalmers, R.P. (2012). mirt: A multidimensional item response theory package for the R Environment. Journal of Statistical Software, 48, 1 29. https://doi.org/10.18637/jss.v048.i06
- Chalmers, R.P. (2023). Package “mirt”. https://cran.r-project.org/web/packages/mirt/mirt.pdf
- Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.2307/1165285
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Routledge. https://doi.org/10.4324/9780203771587
- Gonzalez, E., & Rutkowski, L. (2010). Principles of Multiple Matrix Booklet Designs and Parameter Recovery in Large-Scale Assessments (pp. 125–156). IERI.
- Gressard, R.P., & Loyd, B.H. (1991). A comparison of item sampling plans in the application of multiple matrix sampling. Journal of Educational Measurement, 28(2), 119–130.
- Kaplan, D., & Su, D. (2016). On matrix sampling and imputation of context questionnaires with implications for the generation of plausible values in large-scale assessments. Journal of Educational and Behavioral Statistics, 41(1), 57–80.
- Lord, F.M. (1962). Estimating norms by item-sampling. Educational and Psychological Measurement, 22(2), 259–267. https://doi.org/10.1177/001316446202200202
- Lord, F.M. (1965). Item sampling in test theory and in research design. ETS Research Bulletin Series, 1965(2), i–39. https://doi.org/10.1002/j.2333-8504.1965.tb00968.x
- Macdonald, P., & Paunonen, S.V. (2002). A monte carlo comparison of item and person statistics based on item response theory versus classical test theory. Educational and Psychological Measurement, 62(6), 921 943. https://doi.org/10.1177/0013164402238082
- Munger, G.F., & Loyd, B.H. (1988). The use of multiple matrix sampling for survey research. The Journal of Experimental Education, 56(4), 187–191.
- OECD. (2020). PISA 2018 Technical Report-PISA. OECD Publishing, Paris. Retrieved from https://www.oecd.org/pisa/data/pisa2018technicalreport/
- OECD. (2023). PISA 2022 Technical Report-PISA. OECD Publishing, Paris. Retrieved from https://www.oecd.org/pisa/data/pisa2022technicalreport/
- O’Neill, T.R., Gregg, J.L., & Peabody, M.R. (2020). Effect of sample size on sommon item equating using the dichotomous rasch model. Applied Measurement in Education, 33(1), 10–23. https://doi.org/10.1080/08957347.2019.1674309
- Rubin, D.B. (2009). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.
- R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
- Rutkowski, L. (2014). Sensitivity of achievement estimation to conditioning model misclassification. Applied Measurement in Education, 27(2), 115 132. https://doi.org/10.1080/08957347.2014.880440
- Rutkowski, L., Gonzalez, E., Davier, M. von, & Zhou, and Y. (2013). Assessment design for international large-scale assessments. In Handbook of International Large-Scale Assessment. Chapman and Hall/CRC.
- Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data: Issues in Secondary Analysis and Reporting. Educational Researcher, 39(2), 142–151. https://doi.org/10.3102/0013189X10363170
- Shoemaker, D.M. (1973). Principles and Procedures of Multiple Matrix Sampling. Ballinger Publishing Company.
- Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 47(4).
- Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science. O’Reilly Media, Inc.
- Yin, L., & Foy, P. (2023). TIMSS 2023 Assessment Design. In I.V.S. Mullis, M.O. Martin, & M. von Davier (Eds.), TIMSS 2023 Assessment Frameworks. Boston College, TIMSS & PIRLS International Study Center.
- Zhou, Y. (2021). Improving Multiple Matrix Sampling Design for Questionnaires. Indiana University.