Comparison of Item Response Theory Scaling Methods with ROC Analysis

Meltem Yurtçu; Cem Güzeller

doi:10.21031/epod.892079

Research Article

Year 2022, Volume: 13 Issue: 1, 15 - 22, 29.03.2022

Meltem Yurtçu , Cem Güzeller

https://doi.org/10.21031/epod.892079

Abstract

Thanks

Değerli Hocalarım süreçteki emekleriniz için şimdiden teşekkür ederim

References

Aşiret, S., & Sünbül, S. Ö. (2016). Investigating test equating methods in small samples through various factors. Educational Sciences: Theory & Practice, 16(2), 647-668. https://doi.org/10.12738/estp.2016.2.2762
Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28, 147-162. https://www.jstor.org/stable/1434796
Boduroğlu, E. (2017). The study of classification consistency of transition to higher education examination according to the cut-off scores obtained from different [Master’s Thesis, Mersin University]. https://tez.yok.gov.tr/UlusalTezMerkezi/
Branberg, K., & Wiberg, M. (2011). Observed score linear equating with covariates. Journal of Educational Measurement, 48(4), 419-440. https://www.jstor.org/stable/41427533
Carrington, A. M., Manuel, D. G., Fieguth, P. W., Ramsay, T., Osmani, V., Wernly, B., ... & Holzinger, A. (2021). Deep ROC analysis and AUC as balanced average accuracy to improve model selection, understanding and interpretation. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3145392
Çokluk, Ö., Şekercioğlu, G., & Büyüköztürk, Ş. (2012). Sosyal bilimler için çok değişkenli istatistik: SPSS ve Lisrel uygulamaları. Pegem Akademi.
Embretson, S. (1996) The new rules of measurement. Psychological Assessment, 8(4), 341-349. https://doi.org/10.1037/1040-3590.8.4.341
Embretson, S. (1997). Multicomponent response models. In W. van der Linden & R. Hambleton (Eds.), Handbook of modern Item Response Theory (pp. 305-321). Springer-Verlag.
Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for psychologists. Lawrence Elbaum Associates.
Faraggi, D., & Reiser, B. (2002). Estimation of the area under the ROC curve. Statistics in Medicine, 21, 3093-3106. https://doi.org/10.1002/sim.1228
Flach, P., Blockeel, H., Ferri, C., Hernandez-Orallo, J., & Struyf, J. (2003). Decision support for data mining: An introduction to ROC analysis and its applications. In D. Mladenić, N. Lavrač, M. Bohanec & S. Moyle (Eds.), Data mining and decision support: Integration and collaboration (vol. 745, pp. 81-90). Springer. https://doi.org/10.1007/978-1-4615-0286-9_7
Flach, P. (2019). Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 9808–9814. https://doi.org/10.1609/aaai.v33i01.33019808
Gao, X., Zhu, R., Chen, H., & Harris, D.J. (2008, March 25-27). Impact of anchor-item selections on IRT scale transformation and equating [Paper presentation]. Annual meeting of the National Council on Measurement in Education, New York.
Gialluca, K. A., Crichton, L. I., Vale, C. D., & Ree, M. J. (1984). Methods for equating mental tests (Report No. ED251512). ERIC. https://files.eric.ed.gov/fulltext/ED251512.pdf
Gonzalez, J. (2014). SNSequate: Standard and nonstandard statisticalmodels and methods for test equating. Journal of Statistical Software, 59(7), 1-30. https://doi.org/10.18637/jss.v059.i07
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144-149. https://doi.org/10.4992/psycholres1954.22.144
Hajian-Tilaki, K. O., Hanley, J. A., Joseph, L., & Collet, J. P. (1997). A comparison of parametric and nonparametric approaches to ROC analysis of quantitative diagnostic tests. Medical Decision Making, 17(1), 94-102. https://doi.org/10.1177/0272989X9701700111
Hajian-Tilaki, K. (2018). Receiver operator characteristic analysis of biomarkers evaluation in diagnostic research. Journal of Clinical and Diagnostic Research, 12(6), 1-8. https://doi.org/10.7860/JCDR/2018/32856.11609
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Kuluwer-Nijhoff Publisihing.
Hanley J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36. https://doi.org/10.1148/radiology.143.1.7063747
Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3-24. https://doi.org/10.1177/0146621602026001001
Heagerty, P., Lumley, T., & Pepe, M. (2000). Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics, 56, 337-344. https://doi.org/10.1111/j.0006-341x.2000.00337.x
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). John Wiley & Sons
Jones, L., & Rushton, D. (2019, September 1-6). Optimising geotechnical correlations using receiver operating characteristic (ROC) analysis. The XVII European Conference on Soil Mechanics and Geotechnical Engineering (ECSMGE 2019), Reykjavik, Iceland.
Karaismailoğlu, E. (2015). The use of time dependent roc curve for evaluation of the performance of markers during follow-up time (Combined Doctoral Dissertation, Hacettepe University). https://tez.yok.gov.tr/UlusalTezMerkezi/
Kılıç, S. (2013). Klinik karar vermede ROC analizi. Journal of Mood Disorders, 3(3), 135-140. https://doi.org/10.5455/jmood.20130830051624
Kolen, M. J. (1988). An NCME instructional module on traditional equating methodology. Educational Measurement: Issues and Practice, 7, 29-36.
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (3nd ed.). Springer.
Köksal, B. (2011). Model selection with ROC curve estimation in regression analysis [Master’s thesis, Marmara University]. https://tez.yok.gov.tr/UlusalTezMerkezi/
Lasko, T. A., Bhagwat, J. G., Zou, K. H., & Ohno-Machado, L. (2005). The use of receiver operating characteristic curves in biomedical informatics. Journal of Biomedical Informatics, 38(5), 404-415. https://doi.org/10.1016/j.jbi.2005.02.008
Liaw, A., & Wiener, M. (2018). randomForest: Breiman and Cutler's Random Forests for classification and regression. https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Livingston, S. A., & Lewis, C. (2009). Small-sample equating with prior information (Report No. RR-09-25). ETS. https://files.eric.ed.gov/fulltext/ED507811.pdf
Pardo, M. C., & Franco-Pereira, A.M. (2017). Non parametric ROC summary statistics. REVSTAT-Statistical Journal, 15(4), 583-600. https://eprints.ucm.es/id/eprint/46564/1/PardoCarmen29.pdf
Pepe, M., Janes, H., Longton, G., Leisenring, W., & Newcomb, P. (2004). Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. American Journal of Epidemiology, 159, 882-890. https://doi.org/10.1093/aje/kwh101
Pundir, S., & Amala, R. (2015). Evaluation of biomarker using two parameter bi-exponential ROC curve. Pakistan Journal of Statistics and Operation Research, 11(4), 481-496. https://doi.org/10.18187/pjsor.v11i4.992 Revelle, W. (2018). psych: Procedures for personality and psychological research. http://kambing.ui.ac.id/cran/web/packages/psych/psych.pdf
Rizopoulos, D. (2018). ltm: Latent Trait Models under IRT. https://cran.r-project.org/web/packages/ltm/ltm.pdf
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisace k F, Sanchez J, Müller M (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77
Senaratna, D. M., Sooriyarachchim, M. R., & Meyen, N. (2015). Bivariate test for testing the EQUALITY of the average areas under correlated receiver operating characteristic curves (Test for comparing of AUC’s of correlated ROC curves). American Journal of Applied Mathematics and Statistics, 3(5), 190-198. https://doi.org/10.12691/ajams-3-5-3
Swaving, M., van Houwelingen, H., Ottes, F. P., & Steerneman, T. (1996). Statistical comparison of ROC curves from multiple readers. Medical Decision Making, 16(2), 143-152. https://doi.org/10.1177/0272989X9601600206
Swets, J. A., Dawes, R. M., & Monahan, J. (2000). Better decisions through science. Scientific American, 283, 82-87. https://doi.org/10.1038/scientificamerican1000-82
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201-210. https://doi.org/10.1177 / 014662168300700208
Taşdemir, F., & Çokluk, Ö. ( 2013). Angoff (1-0), Nedelsky and examination of classification accuracies of a test by determination methods of limit values. Mediterranean Journal of Humanities, 3(2), 241-261. https://doi.org/10.13114/mjh/201322482
Tian, F. (2011). A comparison of equating / linking using the Stocking-Lord method and concurrent calibration with mixed-format tests in the non-equivalent groups common-item design under IRT [Unpublished doctoral dissertation]. Boston College.
Thissen, D. M., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 47, 397-412. https://doi.org/10.1007/BF02293705
Wang, T. (2006). Standard errors of equating for equipercentile equating with log-linear pre-smoothing using the delta method (Report No. 14). Center for Advanced Studies in Measurement and Assessment, Iowa.
Weeks, J. P. (2010). Plink: An R package for linking mixed-format tests using IRT-based methods. Journal of Statistical Software, 35(12), 1-33. https://cran.r-project.org/web/packages/plink/vignettes/plink-UD.pdf
Wiberg. M., & Branberg, K. (2015). Kernel equating under the non-equivalent groups with covariates design. Applied Psychological Measurement, 39(5), 349–361. https://doi.org/10.1177/0146621614567939
Wiberg, M., & Gonzalez, J (2016). Statistical assessment of estimated transformations in observed-score equating. Journal of Educational Measurement. 53(1), 106–125. http://www.mat.uc.cl/~jorge.gonzalez/papers/TR/Assess_TR.pdf

Comparison of Item Response Theory Scaling Methods with ROC Analysis

Year 2022, Volume: 13 Issue: 1, 15 - 22, 29.03.2022

Meltem Yurtçu , Cem Güzeller

https://doi.org/10.21031/epod.892079

Abstract

In this study, one-dimensional item response theory models were evaluated using different scaling methods. In this context, the equating errors and the area under the curve of four scaling methods (Stocking-Lord, Heabara, Mean-Sigma, Mean-Mean), and one, two, and three parameters logistic models (1PL, 2PL, and 3PL) in non-equivalent groups with anchor test (NEAT) design were examined. Additionally, the equating errors of the scaling methods and the results obtained from ROC analysis were compared. Qatar's and Australia's PISA 2012 mathematical literacy test data were used in the study. The minimum error was obtained from the Mean-Mean method with the 1PL model, and the maximum error was obtained from the Mean-Mean method with the 3PL model. Similar results were observed in all comparisons and supported each other. It is concluded that ROC analysis can be used to compare different conditions, methods and models.

Keywords

test equating , ROC analysis , scaling methods , AUC

References

Aşiret, S., & Sünbül, S. Ö. (2016). Investigating test equating methods in small samples through various factors. Educational Sciences: Theory & Practice, 16(2), 647-668. https://doi.org/10.12738/estp.2016.2.2762
Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28, 147-162. https://www.jstor.org/stable/1434796
Boduroğlu, E. (2017). The study of classification consistency of transition to higher education examination according to the cut-off scores obtained from different [Master’s Thesis, Mersin University]. https://tez.yok.gov.tr/UlusalTezMerkezi/
Branberg, K., & Wiberg, M. (2011). Observed score linear equating with covariates. Journal of Educational Measurement, 48(4), 419-440. https://www.jstor.org/stable/41427533
Carrington, A. M., Manuel, D. G., Fieguth, P. W., Ramsay, T., Osmani, V., Wernly, B., ... & Holzinger, A. (2021). Deep ROC analysis and AUC as balanced average accuracy to improve model selection, understanding and interpretation. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3145392
Çokluk, Ö., Şekercioğlu, G., & Büyüköztürk, Ş. (2012). Sosyal bilimler için çok değişkenli istatistik: SPSS ve Lisrel uygulamaları. Pegem Akademi.
Embretson, S. (1996) The new rules of measurement. Psychological Assessment, 8(4), 341-349. https://doi.org/10.1037/1040-3590.8.4.341
Embretson, S. (1997). Multicomponent response models. In W. van der Linden & R. Hambleton (Eds.), Handbook of modern Item Response Theory (pp. 305-321). Springer-Verlag.
Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for psychologists. Lawrence Elbaum Associates.
Faraggi, D., & Reiser, B. (2002). Estimation of the area under the ROC curve. Statistics in Medicine, 21, 3093-3106. https://doi.org/10.1002/sim.1228
Flach, P., Blockeel, H., Ferri, C., Hernandez-Orallo, J., & Struyf, J. (2003). Decision support for data mining: An introduction to ROC analysis and its applications. In D. Mladenić, N. Lavrač, M. Bohanec & S. Moyle (Eds.), Data mining and decision support: Integration and collaboration (vol. 745, pp. 81-90). Springer. https://doi.org/10.1007/978-1-4615-0286-9_7
Flach, P. (2019). Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 9808–9814. https://doi.org/10.1609/aaai.v33i01.33019808
Gao, X., Zhu, R., Chen, H., & Harris, D.J. (2008, March 25-27). Impact of anchor-item selections on IRT scale transformation and equating [Paper presentation]. Annual meeting of the National Council on Measurement in Education, New York.
Gialluca, K. A., Crichton, L. I., Vale, C. D., & Ree, M. J. (1984). Methods for equating mental tests (Report No. ED251512). ERIC. https://files.eric.ed.gov/fulltext/ED251512.pdf
Gonzalez, J. (2014). SNSequate: Standard and nonstandard statisticalmodels and methods for test equating. Journal of Statistical Software, 59(7), 1-30. https://doi.org/10.18637/jss.v059.i07
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144-149. https://doi.org/10.4992/psycholres1954.22.144
Hajian-Tilaki, K. O., Hanley, J. A., Joseph, L., & Collet, J. P. (1997). A comparison of parametric and nonparametric approaches to ROC analysis of quantitative diagnostic tests. Medical Decision Making, 17(1), 94-102. https://doi.org/10.1177/0272989X9701700111
Hajian-Tilaki, K. (2018). Receiver operator characteristic analysis of biomarkers evaluation in diagnostic research. Journal of Clinical and Diagnostic Research, 12(6), 1-8. https://doi.org/10.7860/JCDR/2018/32856.11609
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Kuluwer-Nijhoff Publisihing.
Hanley J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36. https://doi.org/10.1148/radiology.143.1.7063747
Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3-24. https://doi.org/10.1177/0146621602026001001
Heagerty, P., Lumley, T., & Pepe, M. (2000). Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics, 56, 337-344. https://doi.org/10.1111/j.0006-341x.2000.00337.x
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). John Wiley & Sons
Jones, L., & Rushton, D. (2019, September 1-6). Optimising geotechnical correlations using receiver operating characteristic (ROC) analysis. The XVII European Conference on Soil Mechanics and Geotechnical Engineering (ECSMGE 2019), Reykjavik, Iceland.
Karaismailoğlu, E. (2015). The use of time dependent roc curve for evaluation of the performance of markers during follow-up time (Combined Doctoral Dissertation, Hacettepe University). https://tez.yok.gov.tr/UlusalTezMerkezi/
Kılıç, S. (2013). Klinik karar vermede ROC analizi. Journal of Mood Disorders, 3(3), 135-140. https://doi.org/10.5455/jmood.20130830051624
Kolen, M. J. (1988). An NCME instructional module on traditional equating methodology. Educational Measurement: Issues and Practice, 7, 29-36.
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (3nd ed.). Springer.
Köksal, B. (2011). Model selection with ROC curve estimation in regression analysis [Master’s thesis, Marmara University]. https://tez.yok.gov.tr/UlusalTezMerkezi/
Lasko, T. A., Bhagwat, J. G., Zou, K. H., & Ohno-Machado, L. (2005). The use of receiver operating characteristic curves in biomedical informatics. Journal of Biomedical Informatics, 38(5), 404-415. https://doi.org/10.1016/j.jbi.2005.02.008
Liaw, A., & Wiener, M. (2018). randomForest: Breiman and Cutler's Random Forests for classification and regression. https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Livingston, S. A., & Lewis, C. (2009). Small-sample equating with prior information (Report No. RR-09-25). ETS. https://files.eric.ed.gov/fulltext/ED507811.pdf
Pardo, M. C., & Franco-Pereira, A.M. (2017). Non parametric ROC summary statistics. REVSTAT-Statistical Journal, 15(4), 583-600. https://eprints.ucm.es/id/eprint/46564/1/PardoCarmen29.pdf
Pepe, M., Janes, H., Longton, G., Leisenring, W., & Newcomb, P. (2004). Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. American Journal of Epidemiology, 159, 882-890. https://doi.org/10.1093/aje/kwh101
Pundir, S., & Amala, R. (2015). Evaluation of biomarker using two parameter bi-exponential ROC curve. Pakistan Journal of Statistics and Operation Research, 11(4), 481-496. https://doi.org/10.18187/pjsor.v11i4.992 Revelle, W. (2018). psych: Procedures for personality and psychological research. http://kambing.ui.ac.id/cran/web/packages/psych/psych.pdf
Rizopoulos, D. (2018). ltm: Latent Trait Models under IRT. https://cran.r-project.org/web/packages/ltm/ltm.pdf
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisace k F, Sanchez J, Müller M (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77
Senaratna, D. M., Sooriyarachchim, M. R., & Meyen, N. (2015). Bivariate test for testing the EQUALITY of the average areas under correlated receiver operating characteristic curves (Test for comparing of AUC’s of correlated ROC curves). American Journal of Applied Mathematics and Statistics, 3(5), 190-198. https://doi.org/10.12691/ajams-3-5-3
Swaving, M., van Houwelingen, H., Ottes, F. P., & Steerneman, T. (1996). Statistical comparison of ROC curves from multiple readers. Medical Decision Making, 16(2), 143-152. https://doi.org/10.1177/0272989X9601600206
Swets, J. A., Dawes, R. M., & Monahan, J. (2000). Better decisions through science. Scientific American, 283, 82-87. https://doi.org/10.1038/scientificamerican1000-82
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201-210. https://doi.org/10.1177 / 014662168300700208
Taşdemir, F., & Çokluk, Ö. ( 2013). Angoff (1-0), Nedelsky and examination of classification accuracies of a test by determination methods of limit values. Mediterranean Journal of Humanities, 3(2), 241-261. https://doi.org/10.13114/mjh/201322482
Tian, F. (2011). A comparison of equating / linking using the Stocking-Lord method and concurrent calibration with mixed-format tests in the non-equivalent groups common-item design under IRT [Unpublished doctoral dissertation]. Boston College.
Thissen, D. M., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 47, 397-412. https://doi.org/10.1007/BF02293705
Wang, T. (2006). Standard errors of equating for equipercentile equating with log-linear pre-smoothing using the delta method (Report No. 14). Center for Advanced Studies in Measurement and Assessment, Iowa.
Weeks, J. P. (2010). Plink: An R package for linking mixed-format tests using IRT-based methods. Journal of Statistical Software, 35(12), 1-33. https://cran.r-project.org/web/packages/plink/vignettes/plink-UD.pdf
Wiberg. M., & Branberg, K. (2015). Kernel equating under the non-equivalent groups with covariates design. Applied Psychological Measurement, 39(5), 349–361. https://doi.org/10.1177/0146621614567939
Wiberg, M., & Gonzalez, J (2016). Statistical assessment of estimated transformations in observed-score equating. Journal of Educational Measurement. 53(1), 106–125. http://www.mat.uc.cl/~jorge.gonzalez/papers/TR/Assess_TR.pdf

There are 48 citations in total.

Details

Primary Language	English
Journal Section	Articles
Authors	Meltem Yurtçu 0000-0003-3303-5093 Cem Güzeller 0000-0002-2700-3565
Publication Date	March 29, 2022
Acceptance Date	March 14, 2022
Published in Issue	Year 2022 Volume: 13 Issue: 1

Cite

APA	Yurtçu, M., & Güzeller, C. (2022). Comparison of Item Response Theory Scaling Methods with ROC Analysis. Journal of Measurement and Evaluation in Education and Psychology, 13(1), 15-22. https://doi.org/10.21031/epod.892079

Download Cover Image

Article Files

Full Text