Year 2025,
Volume: 6 Issue: 1, 90 - 105
Nurşah Yakut
,
Emine Önen
References
- Chalmers, P. (2020). Package ‘mirt’. [Computer software]. https://cran.r-project.org/ web/ packages/mirt/mirt.pdf.
- Chau, L. H. (2018). Evaluating the correctness of IRT-based methods in computing classification consistency and accuracy indices in model misspecification. [Doctoral dissertation, University of British Columbia]. http://hdl.handle.net/ 2429/66984
- Chen, J., de la Torre, J., & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123-140. https://doi.org/10.1111/j.1745-3984.2012.00185.x
- Cizek, G. J. & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Sage Publication.
- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/00131644600200010
- Diao, H., & Sireci, S. G. (2018). Item response theory-based methods for estimating classification accuracy and consistency. Journal of Applied Testing Technology, 19(1), 20-25.
- Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, Research, and Evaluation, 11(1), 6.https://doi.org/10.7275/bxba-7466
- Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345-359. https://doi.org/10.1111/j.1745-3984.1990.tb00753.x
- Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253–264. https://doi.org/10.1111/j.1745-3984.1976.tb00016.x
- Kingsbury, G. G., & Weiss, D. J. (1980). A comparison of adaptive sequential, and conventional testing strategies for mastery decisions. (ADA094478). https://apps.dtic.mil/sti/pdfs/ADA094478.pdf.
- Lathrop, Q. N. (2020). Package ‘cacIRT’. [Computer software]. https://cran.r-project.org /web/ packages/cacIRT/cacIRT.pdf.
- Lathrop, Q. N., & Cheng, Y. (2013). Two approaches to the estimation of classification accuracy rate under item response theory. Applied Psychological Measurement, 37(3), 226-241. https://doi.org/10.1177/0146621612471888.
- Lathrop, Q. N., & Cheng, Y. (2014). A nonparametric approach to estimate classification accuracy and consistency. Journal of Educational Measurement, 51(3), 318-334. https://doi.org/10.1111/jedm.12048
- Lee, W. C. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1-17. https://doi.org/10.1111/j.1745-3984.2009.00096.x
- Lee, W. C., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33(5), 374-390. https://doi.org/10.1177/0146621608321759
- Lee, W. C., Hanson, B. A., & Brennan, R. L. (2000). Procedures for computing classification consistency and accuracy indices with multiple categories. https://www.act.org/content/dam/act/unsecured/documents/ACT_ RR2000-10.pdf
- Lee, W. C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412-432. https://doi.org/10.1177/014662102237797
- Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179-197. https://doi.org/10.1111/j.1745-3984.1995.tb00462.x
- Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy. Applied Psychological Measurement, 31(3), 181-194. https://doi.org/10.1177/0146621606291557
- Md Desa, Z. N. D. (2012). Bi-factor multidimensional item response theory modeling for subscores estimation, reliability, and classification [Doctoral dissertation, University of Kansas]. https://kuscholarworks.ku.edu/handle/1808/10126
- Minchen, N., & de la Torre, J. (2018). A general cognitive diagnosis model for continuous-response data. Measurement: Interdisciplinary Research and Perspectives, 16(1), 30-44. https://doi.org/10.1080/15366367.2018.1436817
- Partchev, I. (2017). Package ‘irtoys’. [Computer software] .https://cran.r-project.org/ web/packages/irtoys/irtoys.pdf
- Revelle, W. (2015). Package ‘psych’. [Computer software]. https://cran.r-project.org/ web/ packages/psych/psych.pdf
- Robitzsch, A. (2020). Package ‘sirt’. [Computer software].https://cran.r-project.org/ web/ packages/sirt/sirt.pdf
- Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment, 7(14), 1-5. https://doi.org/10.7275/an9m-2035
- Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment, 10(13), 1-4. https://doi.org/10.7275/56a5-6b14
- Sen, S., & Cohen, A. S. (2020). The impact of test and sample characteristics on model selection and classification accuracy in the multilevel mixture IRT model. Frontiers in Psychology, 11, 197. https://doi.org/10.3389/fpsyg.2020.00197
- Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 265-276. https://doi.org/10.1111/j.1745-3984.1976.tb00017.x
- Terzi, R., & De la Torre, J. (2018). An iterative method for empirically-based Q-matrix validation. International Journal of Assessment Tools in Education, 5(2), 248-262. https://doi.org/10.21449/ijate.407193
- Thompson, N. A. (2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69(5), 778-793. https://doi.org/10.1177/0013164408324460
- Wang, S., & Wang, T. (2001). Precision of Warm’s weighted likelihood estimates for a polytomous model in computerized adaptive testing. Applied Psychological Measurement, 25(4), 317-331. https://doi.org/10.1177/01466210122032163
- Wyse, A. E., & Hao, S. (2012). An evaluation of item response theory classification accuracy and consistency indices. Applied Psychological Measurement, 36(7), 602-624. https://doi.org/10.1177/0146621612451522
- Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125-145. https://doi.org/10.1177/014662168400800201
Comparison of Classification Accuracy a nd Consistency Indices Under the Item Response Theory
Year 2025,
Volume: 6 Issue: 1, 90 - 105
Nurşah Yakut
,
Emine Önen
Abstract
In educational settings, individual diagnostic and placement decisions are made based on several measures, and classification accuracy indicates how accurate these decisions are. In this study, the effectiveness of Lee's, Guo's, and Rudner's methods in assessing classification accuracy and consistency were examined under Dichomotous IRT models in terms of different sample sizes and test lengths. The data were generated using the 'irtoys' package in R Studio. Classification accuracy and consistency indices and bias values related to these indices were calculated using the 'cacIRT' package. As the number of items increased, the classification accuracy and consistency indices showed a remarkable difference; for Kappa values calculated using Lee's method and FP and FN rates calculated using Guo's method, higher bias values were observed. Rudner indices were observed to have lower “absolute values of the bias” than other methods. In terms of classification decisions, it is considered that Rudner's method would work better when applied to large sample sizes.
Ethical Statement
An ethical approval form for the study was obtained from the Gazi University Ethical Committee, Document number and date: E-77082166-302.08.01-357512 11.05.2022
References
- Chalmers, P. (2020). Package ‘mirt’. [Computer software]. https://cran.r-project.org/ web/ packages/mirt/mirt.pdf.
- Chau, L. H. (2018). Evaluating the correctness of IRT-based methods in computing classification consistency and accuracy indices in model misspecification. [Doctoral dissertation, University of British Columbia]. http://hdl.handle.net/ 2429/66984
- Chen, J., de la Torre, J., & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123-140. https://doi.org/10.1111/j.1745-3984.2012.00185.x
- Cizek, G. J. & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Sage Publication.
- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/00131644600200010
- Diao, H., & Sireci, S. G. (2018). Item response theory-based methods for estimating classification accuracy and consistency. Journal of Applied Testing Technology, 19(1), 20-25.
- Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, Research, and Evaluation, 11(1), 6.https://doi.org/10.7275/bxba-7466
- Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345-359. https://doi.org/10.1111/j.1745-3984.1990.tb00753.x
- Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253–264. https://doi.org/10.1111/j.1745-3984.1976.tb00016.x
- Kingsbury, G. G., & Weiss, D. J. (1980). A comparison of adaptive sequential, and conventional testing strategies for mastery decisions. (ADA094478). https://apps.dtic.mil/sti/pdfs/ADA094478.pdf.
- Lathrop, Q. N. (2020). Package ‘cacIRT’. [Computer software]. https://cran.r-project.org /web/ packages/cacIRT/cacIRT.pdf.
- Lathrop, Q. N., & Cheng, Y. (2013). Two approaches to the estimation of classification accuracy rate under item response theory. Applied Psychological Measurement, 37(3), 226-241. https://doi.org/10.1177/0146621612471888.
- Lathrop, Q. N., & Cheng, Y. (2014). A nonparametric approach to estimate classification accuracy and consistency. Journal of Educational Measurement, 51(3), 318-334. https://doi.org/10.1111/jedm.12048
- Lee, W. C. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1-17. https://doi.org/10.1111/j.1745-3984.2009.00096.x
- Lee, W. C., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33(5), 374-390. https://doi.org/10.1177/0146621608321759
- Lee, W. C., Hanson, B. A., & Brennan, R. L. (2000). Procedures for computing classification consistency and accuracy indices with multiple categories. https://www.act.org/content/dam/act/unsecured/documents/ACT_ RR2000-10.pdf
- Lee, W. C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412-432. https://doi.org/10.1177/014662102237797
- Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179-197. https://doi.org/10.1111/j.1745-3984.1995.tb00462.x
- Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy. Applied Psychological Measurement, 31(3), 181-194. https://doi.org/10.1177/0146621606291557
- Md Desa, Z. N. D. (2012). Bi-factor multidimensional item response theory modeling for subscores estimation, reliability, and classification [Doctoral dissertation, University of Kansas]. https://kuscholarworks.ku.edu/handle/1808/10126
- Minchen, N., & de la Torre, J. (2018). A general cognitive diagnosis model for continuous-response data. Measurement: Interdisciplinary Research and Perspectives, 16(1), 30-44. https://doi.org/10.1080/15366367.2018.1436817
- Partchev, I. (2017). Package ‘irtoys’. [Computer software] .https://cran.r-project.org/ web/packages/irtoys/irtoys.pdf
- Revelle, W. (2015). Package ‘psych’. [Computer software]. https://cran.r-project.org/ web/ packages/psych/psych.pdf
- Robitzsch, A. (2020). Package ‘sirt’. [Computer software].https://cran.r-project.org/ web/ packages/sirt/sirt.pdf
- Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment, 7(14), 1-5. https://doi.org/10.7275/an9m-2035
- Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment, 10(13), 1-4. https://doi.org/10.7275/56a5-6b14
- Sen, S., & Cohen, A. S. (2020). The impact of test and sample characteristics on model selection and classification accuracy in the multilevel mixture IRT model. Frontiers in Psychology, 11, 197. https://doi.org/10.3389/fpsyg.2020.00197
- Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 265-276. https://doi.org/10.1111/j.1745-3984.1976.tb00017.x
- Terzi, R., & De la Torre, J. (2018). An iterative method for empirically-based Q-matrix validation. International Journal of Assessment Tools in Education, 5(2), 248-262. https://doi.org/10.21449/ijate.407193
- Thompson, N. A. (2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69(5), 778-793. https://doi.org/10.1177/0013164408324460
- Wang, S., & Wang, T. (2001). Precision of Warm’s weighted likelihood estimates for a polytomous model in computerized adaptive testing. Applied Psychological Measurement, 25(4), 317-331. https://doi.org/10.1177/01466210122032163
- Wyse, A. E., & Hao, S. (2012). An evaluation of item response theory classification accuracy and consistency indices. Applied Psychological Measurement, 36(7), 602-624. https://doi.org/10.1177/0146621612451522
- Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125-145. https://doi.org/10.1177/014662168400800201