Research Article
BibTex RIS Cite
Year 2021, Volume: 12 Issue: 1, 28 - 53, 31.03.2021
https://doi.org/10.21031/epod.817396

Abstract

References

  • Adesiji, K. M., Agbonifo, O. C., Adesuyi, A. T., & Olabode, O. (2016). Development of an automated descriptive text-based scoring system. British Journal of Mathematics & Computer Science, 19(4), 1-14. doi: 10.9734/BJMCS/2016/27558
  • Altman, D. G. (1991). Practical statistics for medical research. Boca Raton: CRC.
  • Araujo, J., & Born, D. G. (1985). Calculating percentage agreement correctly but writing its formula incorrectly. The Behavior Analyst, 8(2), 207-208. doi: 10.1007/BF03393152
  • Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v.2. Journal of Technology, Learning, and Assessment, 4(3). Retrieved from http://www.jtla.org.
  • Berg, P-C., & Gopinathan, M. (2017). A deep learning ensemble approach to gender identification of tweet authors (Master's thesis, Norwegian University of Science and Technology). Retrieved from https://brage.bibsys.no/xmlui/handle/11250/2458477
  • Brenner, H., & Kliebsch, U. (1996). Dependence of weighted Kappa coefficients on the number of categories. Epidemiology, 7(2), 199-202. https://doi.org/10.1097/00001648-199603000-00016
  • Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and Kappa. Journal of Clinical Epidemiology, 46(5), 423-429. doi: 10.1016/0895-4356(93)90018-V
  • Chen, H., Xu, J., & He, B. (2014). Automated essay scoring by capturing relative writing quality. The Computer Journal, 57(9), 1318-1330. doi:10.1093/comjnl/bxt117
  • Cohen, Y., Ben-Simon, A., & Hovav, M. (October, 2003). The effect of specific language features on the complexity of systems for automated essay scoring. Paper presented at the International Association of Educational Administration, Manchester.
  • Cohen, Y., Levi, E., & Ben-Simon, A. (2018). Validating human and automated scoring of essays against "True" scores. Applied Measurement in Education, 31(3), 241-250. https://doi.org/10.1080/08957347.2018.1464450
  • Creswell, J. W. (2012). Educational research: Planning, conducting and evaluating quantitative and qualitative research (4th ed.). Boston: Pearson.
  • Downing, S. M. (2009). Written tests: Constructed-response and selected-response formats. In S. M. Downing & R. Yudkowsky (Eds.), Assessment in health professions education (pp. 149-184). New York, NY: Routledge.
  • Ebel, R. L., & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.). Englewood Cliffs, NJ: Prentice-Hall.
  • Eugenio, B. D., & Glass, M. (2004). The Kappa statistic: A second look. Computational Linguistics, 30(1), 95-101. https://doi.org/10.1162/089120104773633402
  • Gamer, M., Lemon, I., Fellows, J., & Singh, P. (2010). irr: Various coefficients of interrater reliability and agreement (Version 0.83) [Computer software]. https://CRAN.R-project.org/package=irr
  • Geisinger, K. F., & Usher-Tate, B. J. (2016). A brief history of educational testing and psychometrics. In C. S. Wells, M. Faulkner-Bond (Eds.), Educational measurement from foundations to future (pp. 3-20). New York: The Guilford.
  • Gierl, M. J., Latifi, S., Lai, H., Boulais, A. P., & Champlain, A. D. (2014). Automated essay scoring and the future of educational assessment in medical education. Medical Education, 48, 950-962. doi: 10.1111/medu.12517
  • Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13-34. https://doi.org/10.1207/S15327841MPEE0501_2
  • Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Report of the Center for Educator Compensation Reform. Retrieved from https://files.eric.ed.gov/fulltext/ED532068.pdf
  • Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29-48. doi: 10.1348/000711006X126600
  • Gwet, K. L. (2016). Testing the difference of correlated agreement coefficients for statistical significance. Educational and Psychological Measurement, 76(4), 609-637. doi: 10.1177/0013164415596420
  • Haley, D. T. (2007). Using a new inter-rater reliability statistic (Report No. 2017/16). UK: The Open University.
  • Hamner, B., & Frasco, M. (2018). Metrics: Evaluation metrics for machine learning (Version 0.1.4) [Computer Software]. https://CRAN.R-project.org/package=Metrics
  • Hartmann, D. P. (1977). Considerations in the choice of interobserver reliability estimates. Journal of Applied Behavior Analysis, 10(1), 103-116.
  • Hoek, J., & Scholman, M. C. J. (2017). Evaluating discourse annotation: Some recent insights and new approaches. In H. Bunt (Ed.), ACL Workshop on Interoperable Semantic Annotation (pp. 1-13). https://www.aclweb.org/anthology/W17-7401
  • Ishioka, T., & Kameda, M. (2006). Automated Japanese essay scoring system based on articles written by experts. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, 44, 233-240. doi: 10.3115/1220175.1220205
  • Jang, E-S., Kang, S-S., Noh, E-H., Kim, M-H., Sung, K-H., & Seong, T-J. (2014). KASS: Korean automatic scoring system for short-answer questions. Proceedings of the 6th International Conference on Computer Supported Education, Barcelona, 2, 226-230. doi: 10.5220/0004864302260230
  • Kumar, C. S., & Rama Sree, R. J. (2014). An attempt to improve classification accuracy through implementation of bootstrap aggregation with sequential minimal optimization during automated evaluation of descriptive answers. Indian Journal of Science and Technology, 7(9), 1369-1375.
  • Lacy, S., Watson, B. R., Riffe, D., & Lovejoy, J. (2015). Issues and best practices in content analysis. Journalism and Mass Communication Quarterly, 92(4), 1-21. doi: 10.1177/1077699015607338
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.
  • Lilja, M. (2018). Automatic essay scoring of Swedish essays using neural networks (Doctoral dissertation, Uppsala University). Retrieved from http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1213688&dswid=9250
  • LoMartire, R. (2017). rel: Reliability coefficients (version 1.3.1) [Computer software]. https://CRAN.R-project.org/package=rel
  • Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and porffolio assessment (pp. 61-73). New Jersey: Lawrence Erlbaum Associates, Inc.
  • Meyer, G. J. (1999). Simple procedures to estimate chance agreement and Kappa for the interrater reliability of response segments using the rorschach comprehensive system. Journal of Personality Assessment, 72(2), 230-255. doi: 10.1207/ S15327752JP720209
  • Ministry of National Education (MoNE). (2017a). Akademik becerilerin izlenmesi ve değerlendirilmesi (ABİDE) 2016 8. sınıflar raporu. Erişim Adresi: https://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_11/30114819_iY-web-v6.pdf
  • Ministry of National Education (MoNE). (2017b). İzleme değerlendirme raporu 2016. Erişim Adresi: http://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_06/23161120_2016_izleme_degYerlendirme_raporu.pdf
  • Page, E. B. (1966). The imminence of grading essays by computers. Phi Delta Kappan, 47(5), 238–243. Retrieved from http://www.jstor.org/stable/20371545
  • Powers, D. E., Escoffery, D. S., & Duchnowski, M. P. (2015). Validating automated essay scoring: A (modest) refinement of the "gold standard". Applied Measurement in Education, 28(2), 130-142. doi: 10.1080/08957347.2014.1002920
  • Preston, D., & Goodman, D. (2012). Automated essay scoring and the repair of electronics. Retrieved from https://www.semanticscholar.org/
  • R Core Team. (2018). R: A language and environment for statistical computing (version 3.5.2) [Computer software]. Vienna, Austria: R Foundation for Statistical Computing.
  • Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing, 18(1), 25-39. https://doi.org/10.1016/j.asw.2012.10.004
  • Senay, A., Delisle, J., Raynauld, J. P., Morin, S. N., & Fernandes, J. C. (2015). Agreement between physicians' and nurses' clinical decisions for the management of the fracture liaison service (4iFLS): The Lucky Bone™ program. Osteoporosis International, 27(4), 1569-1576. doi: 10.1007/s00198-015-3413-6
  • Shankar, V., & Bangdiwala, S. I. (2014). Observer agreement paradoxes in 2x2 tables: Comparison of agreement measures. BMC Medical Research Methodology, 14(100). Advance online publication. https://doi.org/10.1186/1471-2288-14-100
  • Shermis, M. D. (2010). Automated essay scoring in a high stakes testing enviroment. In V. J. Shute, B. J. Becker (Eds.), Innovative assessment for the 21st century (pp. 167-185). New York: Springer.
  • Shermis, M. D., & Burnstein, J. (2003). Automated essay scoring. Mahvah, New Jersey: Lawrence Erlbaum Associates.
  • Sim, J., & Wright, C. C. (2005). The Kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257-268. https://doi.org/10.1093/ptj/85.3.257
  • Siriwardhana, D. D., Walters, K., Rait, G., Bazo-Alvarez, J. C., & Weerasinghe, M. C. (2018). Cross-cultural adaptation and psychometric evaluation of the Sinhala version of Lawton Instrumental Activities of Daily Living Scale. Plos One, 13(6), 1-20. https://doi.org/10.1371/journal.pone.0199820
  • Taghipour, K., & Tou Ng, H. (2016). A neural approach to automated essay scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 1882-1891. doi: 10.18653/v1/D16-1193
  • Vanbelle, S. (2016). A new interpretation of the weighted Kappa coefficients. Psychometrika, 81(2), 399-410. https://doi.org/10.1007/s11336-014-9439-4
  • Wang, J., & Brown, M. S. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, 6(2). Retrieved from http://www.jtla.org.
  • Wang, Y., Wei, Z., Zhou, Y., & Huang, X. (2018, November). Automatic essay scoring incorporating rating schema via reinforcement learning. In E. Reloff, D. Chiang, H. Julia & T. Jun'ichi (Eds.), Empirical methods in natural language processing (pp. 791-797). Brussels, Belgium: Association for Computational Linguistics.
  • Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2-13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
  • Wongpakaran, N., Wongpakaran, T., Wedding, D., & Gwet, K. L. (2013). A comparison of Cohen's Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology, 13(61), 1-9.

How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language

Year 2021, Volume: 12 Issue: 1, 28 - 53, 31.03.2021
https://doi.org/10.21031/epod.817396

Abstract

The use of open-ended items, especially in large-scale tests, created difficulties in scoring open-ended items. However, this problem can be overcome with an approach based on automated scoring of open-ended items. The aim of this study was to examine the reliability of the data obtained by scoring open-ended items automatically. One of the objectives was to compare different algorithms based on machine learning in automated scoring (support vector machines, logistic regression, multinominal Naive Bayes, long-short term memory, and bidirectional long-short term memory). The other objective was to investigate the change in the reliability of automated scoring by differentiating the data rate used in testing the automated scoring system (33%, 20%, and 10%). While examining the reliability of automated scoring, a comparison was made with the reliability of the data obtained from human raters. In this study, which demonstrated the first automated scoring attempt of open-ended items in the Turkish language, Turkish test data of the Academic Skills Monitoring and Evaluation (ABIDE) program administered by the Ministry of National Education were used. Cross-validation was used to test the system. Regarding the coefficients of agreement to show reliability, the percentage of agreement, the quadratic-weighted Kappa, which is frequently used in automated scoring studies, and the Gwet's AC1 coefficient, which is not affected by the prevalence problem in the distribution of data into categories, were used. The results of the study showed that automated scoring algorithms could be utilized. It was found that the best algorithm to be used in automated scoring is bidirectional long-short term memory. Long-short term memory and multinominal Naive Bayes algorithms showed lower performance than support vector machines, logistic regression, and bidirectional long-short term memory algorithms. In automated scoring, it was determined that the coefficients of agreement at 33% test data rate were slightly lower comparing 10% and 20% test data rates, but were within the desired range.

References

  • Adesiji, K. M., Agbonifo, O. C., Adesuyi, A. T., & Olabode, O. (2016). Development of an automated descriptive text-based scoring system. British Journal of Mathematics & Computer Science, 19(4), 1-14. doi: 10.9734/BJMCS/2016/27558
  • Altman, D. G. (1991). Practical statistics for medical research. Boca Raton: CRC.
  • Araujo, J., & Born, D. G. (1985). Calculating percentage agreement correctly but writing its formula incorrectly. The Behavior Analyst, 8(2), 207-208. doi: 10.1007/BF03393152
  • Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v.2. Journal of Technology, Learning, and Assessment, 4(3). Retrieved from http://www.jtla.org.
  • Berg, P-C., & Gopinathan, M. (2017). A deep learning ensemble approach to gender identification of tweet authors (Master's thesis, Norwegian University of Science and Technology). Retrieved from https://brage.bibsys.no/xmlui/handle/11250/2458477
  • Brenner, H., & Kliebsch, U. (1996). Dependence of weighted Kappa coefficients on the number of categories. Epidemiology, 7(2), 199-202. https://doi.org/10.1097/00001648-199603000-00016
  • Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and Kappa. Journal of Clinical Epidemiology, 46(5), 423-429. doi: 10.1016/0895-4356(93)90018-V
  • Chen, H., Xu, J., & He, B. (2014). Automated essay scoring by capturing relative writing quality. The Computer Journal, 57(9), 1318-1330. doi:10.1093/comjnl/bxt117
  • Cohen, Y., Ben-Simon, A., & Hovav, M. (October, 2003). The effect of specific language features on the complexity of systems for automated essay scoring. Paper presented at the International Association of Educational Administration, Manchester.
  • Cohen, Y., Levi, E., & Ben-Simon, A. (2018). Validating human and automated scoring of essays against "True" scores. Applied Measurement in Education, 31(3), 241-250. https://doi.org/10.1080/08957347.2018.1464450
  • Creswell, J. W. (2012). Educational research: Planning, conducting and evaluating quantitative and qualitative research (4th ed.). Boston: Pearson.
  • Downing, S. M. (2009). Written tests: Constructed-response and selected-response formats. In S. M. Downing & R. Yudkowsky (Eds.), Assessment in health professions education (pp. 149-184). New York, NY: Routledge.
  • Ebel, R. L., & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.). Englewood Cliffs, NJ: Prentice-Hall.
  • Eugenio, B. D., & Glass, M. (2004). The Kappa statistic: A second look. Computational Linguistics, 30(1), 95-101. https://doi.org/10.1162/089120104773633402
  • Gamer, M., Lemon, I., Fellows, J., & Singh, P. (2010). irr: Various coefficients of interrater reliability and agreement (Version 0.83) [Computer software]. https://CRAN.R-project.org/package=irr
  • Geisinger, K. F., & Usher-Tate, B. J. (2016). A brief history of educational testing and psychometrics. In C. S. Wells, M. Faulkner-Bond (Eds.), Educational measurement from foundations to future (pp. 3-20). New York: The Guilford.
  • Gierl, M. J., Latifi, S., Lai, H., Boulais, A. P., & Champlain, A. D. (2014). Automated essay scoring and the future of educational assessment in medical education. Medical Education, 48, 950-962. doi: 10.1111/medu.12517
  • Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13-34. https://doi.org/10.1207/S15327841MPEE0501_2
  • Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Report of the Center for Educator Compensation Reform. Retrieved from https://files.eric.ed.gov/fulltext/ED532068.pdf
  • Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29-48. doi: 10.1348/000711006X126600
  • Gwet, K. L. (2016). Testing the difference of correlated agreement coefficients for statistical significance. Educational and Psychological Measurement, 76(4), 609-637. doi: 10.1177/0013164415596420
  • Haley, D. T. (2007). Using a new inter-rater reliability statistic (Report No. 2017/16). UK: The Open University.
  • Hamner, B., & Frasco, M. (2018). Metrics: Evaluation metrics for machine learning (Version 0.1.4) [Computer Software]. https://CRAN.R-project.org/package=Metrics
  • Hartmann, D. P. (1977). Considerations in the choice of interobserver reliability estimates. Journal of Applied Behavior Analysis, 10(1), 103-116.
  • Hoek, J., & Scholman, M. C. J. (2017). Evaluating discourse annotation: Some recent insights and new approaches. In H. Bunt (Ed.), ACL Workshop on Interoperable Semantic Annotation (pp. 1-13). https://www.aclweb.org/anthology/W17-7401
  • Ishioka, T., & Kameda, M. (2006). Automated Japanese essay scoring system based on articles written by experts. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, 44, 233-240. doi: 10.3115/1220175.1220205
  • Jang, E-S., Kang, S-S., Noh, E-H., Kim, M-H., Sung, K-H., & Seong, T-J. (2014). KASS: Korean automatic scoring system for short-answer questions. Proceedings of the 6th International Conference on Computer Supported Education, Barcelona, 2, 226-230. doi: 10.5220/0004864302260230
  • Kumar, C. S., & Rama Sree, R. J. (2014). An attempt to improve classification accuracy through implementation of bootstrap aggregation with sequential minimal optimization during automated evaluation of descriptive answers. Indian Journal of Science and Technology, 7(9), 1369-1375.
  • Lacy, S., Watson, B. R., Riffe, D., & Lovejoy, J. (2015). Issues and best practices in content analysis. Journalism and Mass Communication Quarterly, 92(4), 1-21. doi: 10.1177/1077699015607338
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.
  • Lilja, M. (2018). Automatic essay scoring of Swedish essays using neural networks (Doctoral dissertation, Uppsala University). Retrieved from http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1213688&dswid=9250
  • LoMartire, R. (2017). rel: Reliability coefficients (version 1.3.1) [Computer software]. https://CRAN.R-project.org/package=rel
  • Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and porffolio assessment (pp. 61-73). New Jersey: Lawrence Erlbaum Associates, Inc.
  • Meyer, G. J. (1999). Simple procedures to estimate chance agreement and Kappa for the interrater reliability of response segments using the rorschach comprehensive system. Journal of Personality Assessment, 72(2), 230-255. doi: 10.1207/ S15327752JP720209
  • Ministry of National Education (MoNE). (2017a). Akademik becerilerin izlenmesi ve değerlendirilmesi (ABİDE) 2016 8. sınıflar raporu. Erişim Adresi: https://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_11/30114819_iY-web-v6.pdf
  • Ministry of National Education (MoNE). (2017b). İzleme değerlendirme raporu 2016. Erişim Adresi: http://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_06/23161120_2016_izleme_degYerlendirme_raporu.pdf
  • Page, E. B. (1966). The imminence of grading essays by computers. Phi Delta Kappan, 47(5), 238–243. Retrieved from http://www.jstor.org/stable/20371545
  • Powers, D. E., Escoffery, D. S., & Duchnowski, M. P. (2015). Validating automated essay scoring: A (modest) refinement of the "gold standard". Applied Measurement in Education, 28(2), 130-142. doi: 10.1080/08957347.2014.1002920
  • Preston, D., & Goodman, D. (2012). Automated essay scoring and the repair of electronics. Retrieved from https://www.semanticscholar.org/
  • R Core Team. (2018). R: A language and environment for statistical computing (version 3.5.2) [Computer software]. Vienna, Austria: R Foundation for Statistical Computing.
  • Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing, 18(1), 25-39. https://doi.org/10.1016/j.asw.2012.10.004
  • Senay, A., Delisle, J., Raynauld, J. P., Morin, S. N., & Fernandes, J. C. (2015). Agreement between physicians' and nurses' clinical decisions for the management of the fracture liaison service (4iFLS): The Lucky Bone™ program. Osteoporosis International, 27(4), 1569-1576. doi: 10.1007/s00198-015-3413-6
  • Shankar, V., & Bangdiwala, S. I. (2014). Observer agreement paradoxes in 2x2 tables: Comparison of agreement measures. BMC Medical Research Methodology, 14(100). Advance online publication. https://doi.org/10.1186/1471-2288-14-100
  • Shermis, M. D. (2010). Automated essay scoring in a high stakes testing enviroment. In V. J. Shute, B. J. Becker (Eds.), Innovative assessment for the 21st century (pp. 167-185). New York: Springer.
  • Shermis, M. D., & Burnstein, J. (2003). Automated essay scoring. Mahvah, New Jersey: Lawrence Erlbaum Associates.
  • Sim, J., & Wright, C. C. (2005). The Kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257-268. https://doi.org/10.1093/ptj/85.3.257
  • Siriwardhana, D. D., Walters, K., Rait, G., Bazo-Alvarez, J. C., & Weerasinghe, M. C. (2018). Cross-cultural adaptation and psychometric evaluation of the Sinhala version of Lawton Instrumental Activities of Daily Living Scale. Plos One, 13(6), 1-20. https://doi.org/10.1371/journal.pone.0199820
  • Taghipour, K., & Tou Ng, H. (2016). A neural approach to automated essay scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 1882-1891. doi: 10.18653/v1/D16-1193
  • Vanbelle, S. (2016). A new interpretation of the weighted Kappa coefficients. Psychometrika, 81(2), 399-410. https://doi.org/10.1007/s11336-014-9439-4
  • Wang, J., & Brown, M. S. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, 6(2). Retrieved from http://www.jtla.org.
  • Wang, Y., Wei, Z., Zhou, Y., & Huang, X. (2018, November). Automatic essay scoring incorporating rating schema via reinforcement learning. In E. Reloff, D. Chiang, H. Julia & T. Jun'ichi (Eds.), Empirical methods in natural language processing (pp. 791-797). Brussels, Belgium: Association for Computational Linguistics.
  • Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2-13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
  • Wongpakaran, N., Wongpakaran, T., Wedding, D., & Gwet, K. L. (2013). A comparison of Cohen's Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology, 13(61), 1-9.
There are 53 citations in total.

Details

Primary Language English
Journal Section Articles
Authors

İbrahim Uysal 0000-0002-6767-0362

Nuri Doğan 0000-0001-6274-2016

Publication Date March 31, 2021
Acceptance Date February 14, 2021
Published in Issue Year 2021 Volume: 12 Issue: 1

Cite

APA Uysal, İ., & Doğan, N. (2021). How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language. Journal of Measurement and Evaluation in Education and Psychology, 12(1), 28-53. https://doi.org/10.21031/epod.817396