Comparison of Classification Accuracy a nd Consistency Indices Under the Item Response Theory

Nurşah Yakut; Emine Önen

doi:10.5281/zenodo.15075438

Research Article

Year 2025, Volume: 6 Issue: 1, 90 - 105, 19.05.2025

Nurşah Yakut , Emine Önen

https://doi.org/10.5281/zenodo.15075438

Abstract

References

Chalmers, P. (2020). Package ‘mirt’. [Computer software]. https://cran.r-project.org/ web/ packages/mirt/mirt.pdf.
Chau, L. H. (2018). Evaluating the correctness of IRT-based methods in computing classification consistency and accuracy indices in model misspecification. [Doctoral dissertation, University of British Columbia]. http://hdl.handle.net/ 2429/66984
Chen, J., de la Torre, J., & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123-140. https://doi.org/10.1111/j.1745-3984.2012.00185.x
Cizek, G. J. & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Sage Publication.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/00131644600200010
Diao, H., & Sireci, S. G. (2018). Item response theory-based methods for estimating classification accuracy and consistency. Journal of Applied Testing Technology, 19(1), 20-25.
Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, Research, and Evaluation, 11(1), 6.https://doi.org/10.7275/bxba-7466
Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345-359. https://doi.org/10.1111/j.1745-3984.1990.tb00753.x
Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253–264. https://doi.org/10.1111/j.1745-3984.1976.tb00016.x
Kingsbury, G. G., & Weiss, D. J. (1980). A comparison of adaptive sequential, and conventional testing strategies for mastery decisions. (ADA094478). https://apps.dtic.mil/sti/pdfs/ADA094478.pdf.
Lathrop, Q. N. (2020). Package ‘cacIRT’. [Computer software]. https://cran.r-project.org /web/ packages/cacIRT/cacIRT.pdf.
Lathrop, Q. N., & Cheng, Y. (2013). Two approaches to the estimation of classification accuracy rate under item response theory. Applied Psychological Measurement, 37(3), 226-241. https://doi.org/10.1177/0146621612471888.
Lathrop, Q. N., & Cheng, Y. (2014). A nonparametric approach to estimate classification accuracy and consistency. Journal of Educational Measurement, 51(3), 318-334. https://doi.org/10.1111/jedm.12048
Lee, W. C. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1-17. https://doi.org/10.1111/j.1745-3984.2009.00096.x
Lee, W. C., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33(5), 374-390. https://doi.org/10.1177/0146621608321759
Lee, W. C., Hanson, B. A., & Brennan, R. L. (2000). Procedures for computing classification consistency and accuracy indices with multiple categories. https://www.act.org/content/dam/act/unsecured/documents/ACT_ RR2000-10.pdf
Lee, W. C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412-432. https://doi.org/10.1177/014662102237797
Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179-197. https://doi.org/10.1111/j.1745-3984.1995.tb00462.x
Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy. Applied Psychological Measurement, 31(3), 181-194. https://doi.org/10.1177/0146621606291557
Md Desa, Z. N. D. (2012). Bi-factor multidimensional item response theory modeling for subscores estimation, reliability, and classification [Doctoral dissertation, University of Kansas]. https://kuscholarworks.ku.edu/handle/1808/10126
Minchen, N., & de la Torre, J. (2018). A general cognitive diagnosis model for continuous-response data. Measurement: Interdisciplinary Research and Perspectives, 16(1), 30-44. https://doi.org/10.1080/15366367.2018.1436817
Partchev, I. (2017). Package ‘irtoys’. [Computer software] .https://cran.r-project.org/ web/packages/irtoys/irtoys.pdf
Revelle, W. (2015). Package ‘psych’. [Computer software]. https://cran.r-project.org/ web/ packages/psych/psych.pdf
Robitzsch, A. (2020). Package ‘sirt’. [Computer software].https://cran.r-project.org/ web/ packages/sirt/sirt.pdf
Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment, 7(14), 1-5. https://doi.org/10.7275/an9m-2035
Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment, 10(13), 1-4. https://doi.org/10.7275/56a5-6b14
Sen, S., & Cohen, A. S. (2020). The impact of test and sample characteristics on model selection and classification accuracy in the multilevel mixture IRT model. Frontiers in Psychology, 11, 197. https://doi.org/10.3389/fpsyg.2020.00197
Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 265-276. https://doi.org/10.1111/j.1745-3984.1976.tb00017.x
Terzi, R., & De la Torre, J. (2018). An iterative method for empirically-based Q-matrix validation. International Journal of Assessment Tools in Education, 5(2), 248-262. https://doi.org/10.21449/ijate.407193
Thompson, N. A. (2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69(5), 778-793. https://doi.org/10.1177/0013164408324460
Wang, S., & Wang, T. (2001). Precision of Warm’s weighted likelihood estimates for a polytomous model in computerized adaptive testing. Applied Psychological Measurement, 25(4), 317-331. https://doi.org/10.1177/01466210122032163
Wyse, A. E., & Hao, S. (2012). An evaluation of item response theory classification accuracy and consistency indices. Applied Psychological Measurement, 36(7), 602-624. https://doi.org/10.1177/0146621612451522
Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125-145. https://doi.org/10.1177/014662168400800201

Comparison of Classification Accuracy a nd Consistency Indices Under the Item Response Theory

Year 2025, Volume: 6 Issue: 1, 90 - 105, 19.05.2025

Nurşah Yakut , Emine Önen

https://doi.org/10.5281/zenodo.15075438

Abstract

In educational settings, individual diagnostic and placement decisions are made based on several measures, and classification accuracy indicates how accurate these decisions are. In this study, the effectiveness of Lee's, Guo's, and Rudner's methods in assessing classification accuracy and consistency were examined under Dichomotous IRT models in terms of different sample sizes and test lengths. The data were generated using the 'irtoys' package in R Studio. Classification accuracy and consistency indices and bias values related to these indices were calculated using the 'cacIRT' package. As the number of items increased, the classification accuracy and consistency indices showed a remarkable difference; for Kappa values calculated using Lee's method and FP and FN rates calculated using Guo's method, higher bias values were observed. Rudner indices were observed to have lower “absolute values of the bias” than other methods. In terms of classification decisions, it is considered that Rudner's method would work better when applied to large sample sizes.

Keywords

Classification accuracy, classification consistency, Item Response Theory

Ethical Statement

An ethical approval form for the study was obtained from the Gazi University Ethical Committee, Document number and date: E-77082166-302.08.01-357512 11.05.2022

References

Chalmers, P. (2020). Package ‘mirt’. [Computer software]. https://cran.r-project.org/ web/ packages/mirt/mirt.pdf.
Chau, L. H. (2018). Evaluating the correctness of IRT-based methods in computing classification consistency and accuracy indices in model misspecification. [Doctoral dissertation, University of British Columbia]. http://hdl.handle.net/ 2429/66984
Chen, J., de la Torre, J., & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123-140. https://doi.org/10.1111/j.1745-3984.2012.00185.x
Cizek, G. J. & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Sage Publication.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/00131644600200010
Diao, H., & Sireci, S. G. (2018). Item response theory-based methods for estimating classification accuracy and consistency. Journal of Applied Testing Technology, 19(1), 20-25.
Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, Research, and Evaluation, 11(1), 6.https://doi.org/10.7275/bxba-7466
Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345-359. https://doi.org/10.1111/j.1745-3984.1990.tb00753.x
Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253–264. https://doi.org/10.1111/j.1745-3984.1976.tb00016.x
Kingsbury, G. G., & Weiss, D. J. (1980). A comparison of adaptive sequential, and conventional testing strategies for mastery decisions. (ADA094478). https://apps.dtic.mil/sti/pdfs/ADA094478.pdf.
Lathrop, Q. N. (2020). Package ‘cacIRT’. [Computer software]. https://cran.r-project.org /web/ packages/cacIRT/cacIRT.pdf.
Lathrop, Q. N., & Cheng, Y. (2013). Two approaches to the estimation of classification accuracy rate under item response theory. Applied Psychological Measurement, 37(3), 226-241. https://doi.org/10.1177/0146621612471888.
Lathrop, Q. N., & Cheng, Y. (2014). A nonparametric approach to estimate classification accuracy and consistency. Journal of Educational Measurement, 51(3), 318-334. https://doi.org/10.1111/jedm.12048
Lee, W. C. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1-17. https://doi.org/10.1111/j.1745-3984.2009.00096.x
Lee, W. C., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33(5), 374-390. https://doi.org/10.1177/0146621608321759
Lee, W. C., Hanson, B. A., & Brennan, R. L. (2000). Procedures for computing classification consistency and accuracy indices with multiple categories. https://www.act.org/content/dam/act/unsecured/documents/ACT_ RR2000-10.pdf
Lee, W. C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412-432. https://doi.org/10.1177/014662102237797
Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179-197. https://doi.org/10.1111/j.1745-3984.1995.tb00462.x
Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy. Applied Psychological Measurement, 31(3), 181-194. https://doi.org/10.1177/0146621606291557
Md Desa, Z. N. D. (2012). Bi-factor multidimensional item response theory modeling for subscores estimation, reliability, and classification [Doctoral dissertation, University of Kansas]. https://kuscholarworks.ku.edu/handle/1808/10126
Minchen, N., & de la Torre, J. (2018). A general cognitive diagnosis model for continuous-response data. Measurement: Interdisciplinary Research and Perspectives, 16(1), 30-44. https://doi.org/10.1080/15366367.2018.1436817
Partchev, I. (2017). Package ‘irtoys’. [Computer software] .https://cran.r-project.org/ web/packages/irtoys/irtoys.pdf
Revelle, W. (2015). Package ‘psych’. [Computer software]. https://cran.r-project.org/ web/ packages/psych/psych.pdf
Robitzsch, A. (2020). Package ‘sirt’. [Computer software].https://cran.r-project.org/ web/ packages/sirt/sirt.pdf
Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment, 7(14), 1-5. https://doi.org/10.7275/an9m-2035
Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment, 10(13), 1-4. https://doi.org/10.7275/56a5-6b14
Sen, S., & Cohen, A. S. (2020). The impact of test and sample characteristics on model selection and classification accuracy in the multilevel mixture IRT model. Frontiers in Psychology, 11, 197. https://doi.org/10.3389/fpsyg.2020.00197
Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 265-276. https://doi.org/10.1111/j.1745-3984.1976.tb00017.x
Terzi, R., & De la Torre, J. (2018). An iterative method for empirically-based Q-matrix validation. International Journal of Assessment Tools in Education, 5(2), 248-262. https://doi.org/10.21449/ijate.407193
Thompson, N. A. (2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69(5), 778-793. https://doi.org/10.1177/0013164408324460
Wang, S., & Wang, T. (2001). Precision of Warm’s weighted likelihood estimates for a polytomous model in computerized adaptive testing. Applied Psychological Measurement, 25(4), 317-331. https://doi.org/10.1177/01466210122032163
Wyse, A. E., & Hao, S. (2012). An evaluation of item response theory classification accuracy and consistency indices. Applied Psychological Measurement, 36(7), 602-624. https://doi.org/10.1177/0146621612451522
Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125-145. https://doi.org/10.1177/014662168400800201

There are 33 citations in total.

Details

Primary Language	English
Subjects	Measurement and Evaluation in Education (Other)
Journal Section	Articles
Authors	Nurşah Yakut 0000-0002-2983-0329 Emine Önen 0000-0002-0398-3191
Early Pub Date	March 25, 2025
Publication Date	May 19, 2025
Submission Date	December 4, 2024
Acceptance Date	March 20, 2025
Published in Issue	Year 2025 Volume: 6 Issue: 1

Cite

APA	Yakut, N., & Önen, E. (2025). Comparison of Classification Accuracy a nd Consistency Indices Under the Item Response Theory. International Journal of Educational Studies and Policy, 6(1), 90-105. https://doi.org/10.5281/zenodo.15075438

Download Cover Image

Article Files

Full Text

Creative Commons License

All content published in the International Journal of Educational Studies and Policy (IJESP) is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).