Research Article
BibTex RIS Cite
Year 2025, Volume: 6 Issue: 1, 90 - 105
https://doi.org/10.5281/zenodo.15075438

Abstract

References

  • Chalmers, P. (2020). Package ‘mirt’. [Computer software]. https://cran.r-project.org/ web/ packages/mirt/mirt.pdf.
  • Chau, L. H. (2018). Evaluating the correctness of IRT-based methods in computing classification consistency and accuracy indices in model misspecification. [Doctoral dissertation, University of British Columbia]. http://hdl.handle.net/ 2429/66984
  • Chen, J., de la Torre, J., & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123-140. https://doi.org/10.1111/j.1745-3984.2012.00185.x
  • Cizek, G. J. & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Sage Publication.
  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/00131644600200010
  • Diao, H., & Sireci, S. G. (2018). Item response theory-based methods for estimating classification accuracy and consistency. Journal of Applied Testing Technology, 19(1), 20-25.
  • Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, Research, and Evaluation, 11(1), 6.https://doi.org/10.7275/bxba-7466
  • Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345-359. https://doi.org/10.1111/j.1745-3984.1990.tb00753.x
  • Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253–264. https://doi.org/10.1111/j.1745-3984.1976.tb00016.x
  • Kingsbury, G. G., & Weiss, D. J. (1980). A comparison of adaptive sequential, and conventional testing strategies for mastery decisions. (ADA094478). https://apps.dtic.mil/sti/pdfs/ADA094478.pdf.
  • Lathrop, Q. N. (2020). Package ‘cacIRT’. [Computer software]. https://cran.r-project.org /web/ packages/cacIRT/cacIRT.pdf.
  • Lathrop, Q. N., & Cheng, Y. (2013). Two approaches to the estimation of classification accuracy rate under item response theory. Applied Psychological Measurement, 37(3), 226-241. https://doi.org/10.1177/0146621612471888.
  • Lathrop, Q. N., & Cheng, Y. (2014). A nonparametric approach to estimate classification accuracy and consistency. Journal of Educational Measurement, 51(3), 318-334. https://doi.org/10.1111/jedm.12048
  • Lee, W. C. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1-17. https://doi.org/10.1111/j.1745-3984.2009.00096.x
  • Lee, W. C., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33(5), 374-390. https://doi.org/10.1177/0146621608321759
  • Lee, W. C., Hanson, B. A., & Brennan, R. L. (2000). Procedures for computing classification consistency and accuracy indices with multiple categories. https://www.act.org/content/dam/act/unsecured/documents/ACT_ RR2000-10.pdf
  • Lee, W. C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412-432. https://doi.org/10.1177/014662102237797
  • Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179-197. https://doi.org/10.1111/j.1745-3984.1995.tb00462.x
  • Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy. Applied Psychological Measurement, 31(3), 181-194. https://doi.org/10.1177/0146621606291557
  • Md Desa, Z. N. D. (2012). Bi-factor multidimensional item response theory modeling for subscores estimation, reliability, and classification [Doctoral dissertation, University of Kansas]. https://kuscholarworks.ku.edu/handle/1808/10126
  • Minchen, N., & de la Torre, J. (2018). A general cognitive diagnosis model for continuous-response data. Measurement: Interdisciplinary Research and Perspectives, 16(1), 30-44. https://doi.org/10.1080/15366367.2018.1436817
  • Partchev, I. (2017). Package ‘irtoys’. [Computer software] .https://cran.r-project.org/ web/packages/irtoys/irtoys.pdf
  • Revelle, W. (2015). Package ‘psych’. [Computer software]. https://cran.r-project.org/ web/ packages/psych/psych.pdf
  • Robitzsch, A. (2020). Package ‘sirt’. [Computer software].https://cran.r-project.org/ web/ packages/sirt/sirt.pdf
  • Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment, 7(14), 1-5. https://doi.org/10.7275/an9m-2035
  • Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment, 10(13), 1-4. https://doi.org/10.7275/56a5-6b14
  • Sen, S., & Cohen, A. S. (2020). The impact of test and sample characteristics on model selection and classification accuracy in the multilevel mixture IRT model. Frontiers in Psychology, 11, 197. https://doi.org/10.3389/fpsyg.2020.00197
  • Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 265-276. https://doi.org/10.1111/j.1745-3984.1976.tb00017.x
  • Terzi, R., & De la Torre, J. (2018). An iterative method for empirically-based Q-matrix validation. International Journal of Assessment Tools in Education, 5(2), 248-262. https://doi.org/10.21449/ijate.407193
  • Thompson, N. A. (2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69(5), 778-793. https://doi.org/10.1177/0013164408324460
  • Wang, S., & Wang, T. (2001). Precision of Warm’s weighted likelihood estimates for a polytomous model in computerized adaptive testing. Applied Psychological Measurement, 25(4), 317-331. https://doi.org/10.1177/01466210122032163
  • Wyse, A. E., & Hao, S. (2012). An evaluation of item response theory classification accuracy and consistency indices. Applied Psychological Measurement, 36(7), 602-624. https://doi.org/10.1177/0146621612451522
  • Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125-145. https://doi.org/10.1177/014662168400800201

Comparison of Classification Accuracy a nd Consistency Indices Under the Item Response Theory

Year 2025, Volume: 6 Issue: 1, 90 - 105
https://doi.org/10.5281/zenodo.15075438

Abstract

In educational settings, individual diagnostic and placement decisions are made based on several measures, and classification accuracy indicates how accurate these decisions are. In this study, the effectiveness of Lee's, Guo's, and Rudner's methods in assessing classification accuracy and consistency were examined under Dichomotous IRT models in terms of different sample sizes and test lengths. The data were generated using the 'irtoys' package in R Studio. Classification accuracy and consistency indices and bias values related to these indices were calculated using the 'cacIRT' package. As the number of items increased, the classification accuracy and consistency indices showed a remarkable difference; for Kappa values calculated using Lee's method and FP and FN rates calculated using Guo's method, higher bias values were observed. Rudner indices were observed to have lower “absolute values of the bias” than other methods. In terms of classification decisions, it is considered that Rudner's method would work better when applied to large sample sizes.

Ethical Statement

An ethical approval form for the study was obtained from the Gazi University Ethical Committee, Document number and date: E-77082166-302.08.01-357512 11.05.2022

References

  • Chalmers, P. (2020). Package ‘mirt’. [Computer software]. https://cran.r-project.org/ web/ packages/mirt/mirt.pdf.
  • Chau, L. H. (2018). Evaluating the correctness of IRT-based methods in computing classification consistency and accuracy indices in model misspecification. [Doctoral dissertation, University of British Columbia]. http://hdl.handle.net/ 2429/66984
  • Chen, J., de la Torre, J., & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123-140. https://doi.org/10.1111/j.1745-3984.2012.00185.x
  • Cizek, G. J. & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Sage Publication.
  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/00131644600200010
  • Diao, H., & Sireci, S. G. (2018). Item response theory-based methods for estimating classification accuracy and consistency. Journal of Applied Testing Technology, 19(1), 20-25.
  • Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, Research, and Evaluation, 11(1), 6.https://doi.org/10.7275/bxba-7466
  • Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345-359. https://doi.org/10.1111/j.1745-3984.1990.tb00753.x
  • Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253–264. https://doi.org/10.1111/j.1745-3984.1976.tb00016.x
  • Kingsbury, G. G., & Weiss, D. J. (1980). A comparison of adaptive sequential, and conventional testing strategies for mastery decisions. (ADA094478). https://apps.dtic.mil/sti/pdfs/ADA094478.pdf.
  • Lathrop, Q. N. (2020). Package ‘cacIRT’. [Computer software]. https://cran.r-project.org /web/ packages/cacIRT/cacIRT.pdf.
  • Lathrop, Q. N., & Cheng, Y. (2013). Two approaches to the estimation of classification accuracy rate under item response theory. Applied Psychological Measurement, 37(3), 226-241. https://doi.org/10.1177/0146621612471888.
  • Lathrop, Q. N., & Cheng, Y. (2014). A nonparametric approach to estimate classification accuracy and consistency. Journal of Educational Measurement, 51(3), 318-334. https://doi.org/10.1111/jedm.12048
  • Lee, W. C. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1-17. https://doi.org/10.1111/j.1745-3984.2009.00096.x
  • Lee, W. C., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33(5), 374-390. https://doi.org/10.1177/0146621608321759
  • Lee, W. C., Hanson, B. A., & Brennan, R. L. (2000). Procedures for computing classification consistency and accuracy indices with multiple categories. https://www.act.org/content/dam/act/unsecured/documents/ACT_ RR2000-10.pdf
  • Lee, W. C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412-432. https://doi.org/10.1177/014662102237797
  • Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179-197. https://doi.org/10.1111/j.1745-3984.1995.tb00462.x
  • Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy. Applied Psychological Measurement, 31(3), 181-194. https://doi.org/10.1177/0146621606291557
  • Md Desa, Z. N. D. (2012). Bi-factor multidimensional item response theory modeling for subscores estimation, reliability, and classification [Doctoral dissertation, University of Kansas]. https://kuscholarworks.ku.edu/handle/1808/10126
  • Minchen, N., & de la Torre, J. (2018). A general cognitive diagnosis model for continuous-response data. Measurement: Interdisciplinary Research and Perspectives, 16(1), 30-44. https://doi.org/10.1080/15366367.2018.1436817
  • Partchev, I. (2017). Package ‘irtoys’. [Computer software] .https://cran.r-project.org/ web/packages/irtoys/irtoys.pdf
  • Revelle, W. (2015). Package ‘psych’. [Computer software]. https://cran.r-project.org/ web/ packages/psych/psych.pdf
  • Robitzsch, A. (2020). Package ‘sirt’. [Computer software].https://cran.r-project.org/ web/ packages/sirt/sirt.pdf
  • Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment, 7(14), 1-5. https://doi.org/10.7275/an9m-2035
  • Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment, 10(13), 1-4. https://doi.org/10.7275/56a5-6b14
  • Sen, S., & Cohen, A. S. (2020). The impact of test and sample characteristics on model selection and classification accuracy in the multilevel mixture IRT model. Frontiers in Psychology, 11, 197. https://doi.org/10.3389/fpsyg.2020.00197
  • Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 265-276. https://doi.org/10.1111/j.1745-3984.1976.tb00017.x
  • Terzi, R., & De la Torre, J. (2018). An iterative method for empirically-based Q-matrix validation. International Journal of Assessment Tools in Education, 5(2), 248-262. https://doi.org/10.21449/ijate.407193
  • Thompson, N. A. (2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69(5), 778-793. https://doi.org/10.1177/0013164408324460
  • Wang, S., & Wang, T. (2001). Precision of Warm’s weighted likelihood estimates for a polytomous model in computerized adaptive testing. Applied Psychological Measurement, 25(4), 317-331. https://doi.org/10.1177/01466210122032163
  • Wyse, A. E., & Hao, S. (2012). An evaluation of item response theory classification accuracy and consistency indices. Applied Psychological Measurement, 36(7), 602-624. https://doi.org/10.1177/0146621612451522
  • Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125-145. https://doi.org/10.1177/014662168400800201
There are 33 citations in total.

Details

Primary Language English
Subjects Measurement and Evaluation in Education (Other)
Journal Section Articles
Authors

Nurşah Yakut 0000-0002-2983-0329

Emine Önen 0000-0002-0398-3191

Early Pub Date March 25, 2025
Publication Date
Submission Date December 4, 2024
Acceptance Date March 20, 2025
Published in Issue Year 2025 Volume: 6 Issue: 1

Cite

APA Yakut, N., & Önen, E. (2025). Comparison of Classification Accuracy a nd Consistency Indices Under the Item Response Theory. International Journal of Educational Studies and Policy, 6(1), 90-105. https://doi.org/10.5281/zenodo.15075438

Creative Commons License

All content published in the International Journal of Educational Studies and Policy (IJESP) is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license permits unrestricted use, distribution, reproduction, and adaptation of articles, datasets, graphics, and supplementary materials in any medium, including data mining applications, search engines, websites, and blogs—provided that appropriate credit is given to the original source. IJESP adopts an open access policy that promotes free and global availability of academic knowledge. Open access facilitates interdisciplinary communication and fosters collaboration across diverse academic fields. In line with this philosophy, IJESP contributes to the advancement of educational sciences by ensuring wide dissemination of its content and maintaining a transparent editorial and peer review process.