The Unit Testlet Dilemma: PISA Sample

Cansu Ayan; Fulya Barış Pekmezci

doi:10.21449/ijate.948734

Research Article

The Unit Testlet Dilemma: PISA Sample

Year 2021, Volume: 8 Issue: 3, 613 - 632, 05.09.2021

Cansu Ayan Fulya Barış Pekmezci

https://doi.org/10.21449/ijate.948734

Cited By: 1

Abstract

Testlets have advantages such as making it possible to measure higher-order thinking skills and saving time, which are accepted in the literature. For this reason, they have often been preferred in many implementations from in-class assessments to large-scale assessments. Because of increased usage of testlets, the following questions are controversial topics to be studied: “Is it enough for the items to share a common stem to be assumed as a testlet?” “Which estimation method should be preferred in implementation containing this type of items?” “Is there an alternative estimation method for PISA implementation which consists of this type of items?” In addition to these, which statistical model to use for the estimations of the items, since they violate the local independence assumption has become a popular topic of discussion. In light of these discussions this study aimed to clarify the unit-testlet ambiguity with various item response theory models when testlets consist of a mixed item type (dichotomous and polytomous) for the science and math tests of the PISA 2018. When the findings were examined, it was seen that while the bifactor model fits the data best, the uni-dimensional model fits quite closely with the bifactor model for both data sets (science and math). On the other hand, the multi-dimensional IRT model has the weakest model fit for both test types. In line with all these findings, the methods used when determining the testlet items were discussed and estimation suggestions were made for implementations using testlets, especially PISA.

Keywords

PISA, Testlet items, Local Dependence, Marginal item parameters

References

Ackerman, T. A. (1987, April). The robustness of LOGIST and BILOG IRT estimation programs to violations of local independence. Paper presented at the annual meeting of the American Educational Research Association. Washington, DC.
Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x
Akoğlu, H. (2018). User's guide to correlation coefficients. Turkish Journal of Emergency Medicine, 18(3), 91-93. https://doi.org/10.1016/j.tjem.2018.08.001
Baldonado, A. A., Svetina, D., & Gorin, J. (2015). Using necessary information to identify item dependence in passage-based reading comprehension tests. Applied Measurement in Education, 28(3), 202-218. https://doi.org/10.1080/08957347.2015.1042154
Bao, H. (2007). Investigating differential item function amplification and cancellation in application of item response testlet models [Doctoral dissertation, University of Maryland]. ProQuest Dissertations and Theses Global.
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.
Cai, L. (2010). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35(3), 307-335. https://doi.org/10.3102/1076998609353115
Cai, L., du Toit, S. H. C., & Thissen, D. (2015). IRTPRO: Flexible professional item response theory modeling for patient reported outcomes (version 3.1) [computer software]. SSIInternational.
Cai, L., & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245-276. https://doi.org/10.1111/j.2044-8317.2012.02050.x
Cai, L., & Monroe, S. (2014). A new statistic for evaluating item response theory models for ordinal data. (CRESST Report 839). National Center for Research on Evaluation, Standards, and Student Testing (CRESST).
Canivez, G. L. (2016). Bifactor modeling in construct validation of multifactored tests: Implications for understanding multidimensional constructs and test interpretation. In K. Schweizer & C. DiStefano (Eds.). Principles and methods of test construction: Standards and recent advancements (pp. 247-271). Hogrefe Publishers.
Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.3102/10769986022003265
Chon, K. H., Lee, W., & Ansley, T. N. (2007). Assessing IRT model-data fit for mixed format tests. (CASMA Research Report 26). Center for Advanced Studies in Measurement and Assessment.
DeMars, C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145–168. https://doi.org/10.1111/j.1745-3984.2006.00010.x
DeMars, C. E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104–121. https://doi.org/10.1177/0146621612437403
Embretson, S., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah: Lawrence Erlbaum Associates Inc.
Fukuhara, H., & Kamata, A. (2011). Functioning analysis on testlet-based items a bifactor multidimensional item response theory model for differential items. Applied Psychological Measurement, 35(8), 604–622. https://doi.org/10.1177/0146621611428447
Gibbons, R. D., & Hedeker, D. R. (1992). Full-information bi-factor analysis. Psychometrika, 57, 423–436.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. SAGE Publications.
Holzinger, K. J., Swineford, F. (1937). The Bi-factor method. Psychometrika, 2, 41–54. https://doi.org/10.1007/BF02287965
Houts, C. R., & Cai, L. (2013). Flexible multilevel multidimensional item analysis and test scoring [FlexMIRT R user’s manual version 3.52]. Vector Psychometric Group.
Ip, E. H. (2010). Interpretation of the three-parameter testlet response model and information function. Applied Psychological Measurement, 34(7), 467 482. https://doi.org/10.1177/0146621610364975
Lee, G., Dunbar, S. B., & Frisbie, D. A. (2001). The relative appropriateness of eight measurement models for analyzing scores from tests composed of testlets. Educational and Psychological Measurement, 61, 958 975. https://doi.org/10.1177/00131640121971590
Li, Y., Bolt. D. M., & Fu, J. (2005). A test characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29(5), 340 356. https://doi.org/10.1177/0146621605276678
Marais, I. D., & Andrich, D. (2008). Effects of varying magnitude and patterns of local dependence in the unidimensional Rasch model. Journal of Applied Measurement, 9, 105–124.
Maydeu-Olivares, A., & Joe, H. (2005). Limited and full information estimation and testing in 2" contingency tables: A unified framework. Journal of the American Statistical Association. https://doi.org/10.1198/016214504000002069
McDonald, R. P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24(2), 99 114. https://doi.org/10.1177/01466210022031552
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1), i 30. https://doi.org/10.1002/j.2333 8504.1992.tb01436.x
OECD (2019a). “PISA 2018 Mathematics Framework”. in PISA 2018 Assessment and Analytical Framework. OECD Publishing. https://doi.org/10.1787/13c8a22c-en
OECD (2019b). “PISA 2018 Science Framework”. in PISA 2018 Assessment and Analytical Framework. OECD Publishing. https://doi.org/10.1787/f30da688-en
OECD (2019c). “Scaling PISA data”. in PISA 2018 Technical Report. OECD Publishing. https://www.oecd.org/pisa/data/pisa2018technicalreport/Ch.09-Scaling-PISA-Data.pdf
Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92(6), 544-559. https://doi.org/10.1080/00223891.2010.496477
Revelle, W., & Revelle, M. W. (2015). Package ‘psych’. The comprehensive R archive network, 337, 338.
Sireci, S. G., Thissen. D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237 247. https://doi.org/10.1111/j.1745 3984.1991.tb00356.x
Stucky, B. D., & Edelen, M. O. (2014). Using hierarchical IRT models to create unidimensional measures from multidimensional data. In S. P. Reise & D. A. Revicki (Eds.) Handbook of item response theory modelling. (pp. 201-224). Routledge.
Stucky, B. D., Thissen, D., & Orlando Edelen, M. (2013). Using logistic approximations of marginal trace lines to develop short assessments. Applied Psychological Measurement, 37(1), 41-57. https://doi.org/10.1177/0146621612462759
Toland, M. D., Sulis, I., Giambona, F., Porcu, M., & Campbell, J. M. (2017). Introduction to bifactor polytomous item response theory analysis. Journal of School Psychology, 60, 41-63. https://doi.org/10.1016/j.jsp.2016.11.001
Tuerlinckx, F., & De Boeck, P. (2001). The effect of ignoring item interactions on the estimated discrimination parameters in item response theory. Psychological Methods, 6(2), 181–195. https://doi.org/10.1037/1082-989X.6.2.181
Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W Glas (Eds.). Computerized adaptive testing: Theory and practice (pp. 245–269). Springer, Dordrecht.
Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27(1), 1–14. https://doi.org/10.1111/j.1745-3984.1990.tb00730.x
Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220. https://doi.org/10.1111/j.1745-3984.2000.tb01083.x
Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126-149. https://doi.org/10.1177/0146621604271053
Yen, W. M. (1993). Scaling performance assessments Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187 213. https://doi.org/10.1111/j.1745-3984.1993.tb00423.x
Yılmaz Kogar, E. (2016). Madde takımları içeren testlerde farklı modellerden elde edilen madde ve yetenek parametrelerinin karşılaştırılması [Comparison of item and ability parameters obtained from different models on tests composed of testlets] [Doctoral dissertation, Hacettepe University]. Hacettepe University Libraries, https://avesis.hacettepe.edu.tr/yonetilen-tez/c2ade6a0-6a2d-4147-beb0-8a3feb0642c5/madde-takimlari-iceren-testlerde-farkli-modellerden-elde-edilen-madde-ve-yetenek-parametrelerinin-karsilastirilmasi

The Unit Testlet Dilemma: PISA Sample

Year 2021, Volume: 8 Issue: 3, 613 - 632, 05.09.2021

Cansu Ayan Fulya Barış Pekmezci

https://doi.org/10.21449/ijate.948734

Cited By: 1

Abstract

Keywords

PISA, Testlet items, Local Dependence, Marginal item parameters

References

Ackerman, T. A. (1987, April). The robustness of LOGIST and BILOG IRT estimation programs to violations of local independence. Paper presented at the annual meeting of the American Educational Research Association. Washington, DC.
Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x
Akoğlu, H. (2018). User's guide to correlation coefficients. Turkish Journal of Emergency Medicine, 18(3), 91-93. https://doi.org/10.1016/j.tjem.2018.08.001
Baldonado, A. A., Svetina, D., & Gorin, J. (2015). Using necessary information to identify item dependence in passage-based reading comprehension tests. Applied Measurement in Education, 28(3), 202-218. https://doi.org/10.1080/08957347.2015.1042154
Bao, H. (2007). Investigating differential item function amplification and cancellation in application of item response testlet models [Doctoral dissertation, University of Maryland]. ProQuest Dissertations and Theses Global.
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.
Cai, L. (2010). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35(3), 307-335. https://doi.org/10.3102/1076998609353115
Cai, L., du Toit, S. H. C., & Thissen, D. (2015). IRTPRO: Flexible professional item response theory modeling for patient reported outcomes (version 3.1) [computer software]. SSIInternational.
Cai, L., & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245-276. https://doi.org/10.1111/j.2044-8317.2012.02050.x
Cai, L., & Monroe, S. (2014). A new statistic for evaluating item response theory models for ordinal data. (CRESST Report 839). National Center for Research on Evaluation, Standards, and Student Testing (CRESST).
Canivez, G. L. (2016). Bifactor modeling in construct validation of multifactored tests: Implications for understanding multidimensional constructs and test interpretation. In K. Schweizer & C. DiStefano (Eds.). Principles and methods of test construction: Standards and recent advancements (pp. 247-271). Hogrefe Publishers.
Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.3102/10769986022003265
Chon, K. H., Lee, W., & Ansley, T. N. (2007). Assessing IRT model-data fit for mixed format tests. (CASMA Research Report 26). Center for Advanced Studies in Measurement and Assessment.
DeMars, C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145–168. https://doi.org/10.1111/j.1745-3984.2006.00010.x
DeMars, C. E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104–121. https://doi.org/10.1177/0146621612437403
Embretson, S., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah: Lawrence Erlbaum Associates Inc.
Fukuhara, H., & Kamata, A. (2011). Functioning analysis on testlet-based items a bifactor multidimensional item response theory model for differential items. Applied Psychological Measurement, 35(8), 604–622. https://doi.org/10.1177/0146621611428447
Gibbons, R. D., & Hedeker, D. R. (1992). Full-information bi-factor analysis. Psychometrika, 57, 423–436.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. SAGE Publications.
Holzinger, K. J., Swineford, F. (1937). The Bi-factor method. Psychometrika, 2, 41–54. https://doi.org/10.1007/BF02287965
Houts, C. R., & Cai, L. (2013). Flexible multilevel multidimensional item analysis and test scoring [FlexMIRT R user’s manual version 3.52]. Vector Psychometric Group.
Ip, E. H. (2010). Interpretation of the three-parameter testlet response model and information function. Applied Psychological Measurement, 34(7), 467 482. https://doi.org/10.1177/0146621610364975
Lee, G., Dunbar, S. B., & Frisbie, D. A. (2001). The relative appropriateness of eight measurement models for analyzing scores from tests composed of testlets. Educational and Psychological Measurement, 61, 958 975. https://doi.org/10.1177/00131640121971590
Li, Y., Bolt. D. M., & Fu, J. (2005). A test characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29(5), 340 356. https://doi.org/10.1177/0146621605276678
Marais, I. D., & Andrich, D. (2008). Effects of varying magnitude and patterns of local dependence in the unidimensional Rasch model. Journal of Applied Measurement, 9, 105–124.
Maydeu-Olivares, A., & Joe, H. (2005). Limited and full information estimation and testing in 2" contingency tables: A unified framework. Journal of the American Statistical Association. https://doi.org/10.1198/016214504000002069
McDonald, R. P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24(2), 99 114. https://doi.org/10.1177/01466210022031552
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1), i 30. https://doi.org/10.1002/j.2333 8504.1992.tb01436.x
OECD (2019a). “PISA 2018 Mathematics Framework”. in PISA 2018 Assessment and Analytical Framework. OECD Publishing. https://doi.org/10.1787/13c8a22c-en
OECD (2019b). “PISA 2018 Science Framework”. in PISA 2018 Assessment and Analytical Framework. OECD Publishing. https://doi.org/10.1787/f30da688-en
OECD (2019c). “Scaling PISA data”. in PISA 2018 Technical Report. OECD Publishing. https://www.oecd.org/pisa/data/pisa2018technicalreport/Ch.09-Scaling-PISA-Data.pdf
Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92(6), 544-559. https://doi.org/10.1080/00223891.2010.496477
Revelle, W., & Revelle, M. W. (2015). Package ‘psych’. The comprehensive R archive network, 337, 338.
Sireci, S. G., Thissen. D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237 247. https://doi.org/10.1111/j.1745 3984.1991.tb00356.x
Stucky, B. D., & Edelen, M. O. (2014). Using hierarchical IRT models to create unidimensional measures from multidimensional data. In S. P. Reise & D. A. Revicki (Eds.) Handbook of item response theory modelling. (pp. 201-224). Routledge.
Stucky, B. D., Thissen, D., & Orlando Edelen, M. (2013). Using logistic approximations of marginal trace lines to develop short assessments. Applied Psychological Measurement, 37(1), 41-57. https://doi.org/10.1177/0146621612462759
Toland, M. D., Sulis, I., Giambona, F., Porcu, M., & Campbell, J. M. (2017). Introduction to bifactor polytomous item response theory analysis. Journal of School Psychology, 60, 41-63. https://doi.org/10.1016/j.jsp.2016.11.001
Tuerlinckx, F., & De Boeck, P. (2001). The effect of ignoring item interactions on the estimated discrimination parameters in item response theory. Psychological Methods, 6(2), 181–195. https://doi.org/10.1037/1082-989X.6.2.181
Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W Glas (Eds.). Computerized adaptive testing: Theory and practice (pp. 245–269). Springer, Dordrecht.
Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27(1), 1–14. https://doi.org/10.1111/j.1745-3984.1990.tb00730.x
Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220. https://doi.org/10.1111/j.1745-3984.2000.tb01083.x
Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126-149. https://doi.org/10.1177/0146621604271053
Yen, W. M. (1993). Scaling performance assessments Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187 213. https://doi.org/10.1111/j.1745-3984.1993.tb00423.x
Yılmaz Kogar, E. (2016). Madde takımları içeren testlerde farklı modellerden elde edilen madde ve yetenek parametrelerinin karşılaştırılması [Comparison of item and ability parameters obtained from different models on tests composed of testlets] [Doctoral dissertation, Hacettepe University]. Hacettepe University Libraries, https://avesis.hacettepe.edu.tr/yonetilen-tez/c2ade6a0-6a2d-4147-beb0-8a3feb0642c5/madde-takimlari-iceren-testlerde-farkli-modellerden-elde-edilen-madde-ve-yetenek-parametrelerinin-karsilastirilmasi

There are 44 citations in total.

Details

Primary Language	English
Subjects	Studies on Education
Journal Section	Articles
Authors	Cansu Ayan This is me 0000-0002-0773-5486 Fulya Barış Pekmezci 0000-0001-6989-512X
Publication Date	September 5, 2021
Submission Date	September 24, 2020
Published in Issue	Year 2021 Volume: 8 Issue: 3

Cite

APA	Ayan, C., & Barış Pekmezci, F. (2021). The Unit Testlet Dilemma: PISA Sample. International Journal of Assessment Tools in Education, 8(3), 613-632. https://doi.org/10.21449/ijate.948734

Cited By

Rescaling of Cognitive Flexibility Inventory by Criticism of Turkish Adaptation Form

International Journal of Cognitive Therapy

https://doi.org/10.1007/s41811-023-00188-8

Article Files

Full Text

23823 23825 23824