Araştırma Makalesi
PDF Mendeley EndNote BibTex Kaynak Göster

Yıl 2021, Cilt 12, Sayı 1, 28 - 53, 31.03.2021
https://doi.org/10.21031/epod.817396

Öz

Kaynakça

  • Adesiji, K. M., Agbonifo, O. C., Adesuyi, A. T., & Olabode, O. (2016). Development of an automated descriptive text-based scoring system. British Journal of Mathematics & Computer Science, 19(4), 1-14. doi: 10.9734/BJMCS/2016/27558
  • Altman, D. G. (1991). Practical statistics for medical research. Boca Raton: CRC.
  • Araujo, J., & Born, D. G. (1985). Calculating percentage agreement correctly but writing its formula incorrectly. The Behavior Analyst, 8(2), 207-208. doi: 10.1007/BF03393152
  • Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v.2. Journal of Technology, Learning, and Assessment, 4(3). Retrieved from http://www.jtla.org.
  • Berg, P-C., & Gopinathan, M. (2017). A deep learning ensemble approach to gender identification of tweet authors (Master's thesis, Norwegian University of Science and Technology). Retrieved from https://brage.bibsys.no/xmlui/handle/11250/2458477
  • Brenner, H., & Kliebsch, U. (1996). Dependence of weighted Kappa coefficients on the number of categories. Epidemiology, 7(2), 199-202. https://doi.org/10.1097/00001648-199603000-00016
  • Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and Kappa. Journal of Clinical Epidemiology, 46(5), 423-429. doi: 10.1016/0895-4356(93)90018-V
  • Chen, H., Xu, J., & He, B. (2014). Automated essay scoring by capturing relative writing quality. The Computer Journal, 57(9), 1318-1330. doi:10.1093/comjnl/bxt117
  • Cohen, Y., Ben-Simon, A., & Hovav, M. (October, 2003). The effect of specific language features on the complexity of systems for automated essay scoring. Paper presented at the International Association of Educational Administration, Manchester.
  • Cohen, Y., Levi, E., & Ben-Simon, A. (2018). Validating human and automated scoring of essays against "True" scores. Applied Measurement in Education, 31(3), 241-250. https://doi.org/10.1080/08957347.2018.1464450
  • Creswell, J. W. (2012). Educational research: Planning, conducting and evaluating quantitative and qualitative research (4th ed.). Boston: Pearson.
  • Downing, S. M. (2009). Written tests: Constructed-response and selected-response formats. In S. M. Downing & R. Yudkowsky (Eds.), Assessment in health professions education (pp. 149-184). New York, NY: Routledge.
  • Ebel, R. L., & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.). Englewood Cliffs, NJ: Prentice-Hall.
  • Eugenio, B. D., & Glass, M. (2004). The Kappa statistic: A second look. Computational Linguistics, 30(1), 95-101. https://doi.org/10.1162/089120104773633402
  • Gamer, M., Lemon, I., Fellows, J., & Singh, P. (2010). irr: Various coefficients of interrater reliability and agreement (Version 0.83) [Computer software]. https://CRAN.R-project.org/package=irr
  • Geisinger, K. F., & Usher-Tate, B. J. (2016). A brief history of educational testing and psychometrics. In C. S. Wells, M. Faulkner-Bond (Eds.), Educational measurement from foundations to future (pp. 3-20). New York: The Guilford.
  • Gierl, M. J., Latifi, S., Lai, H., Boulais, A. P., & Champlain, A. D. (2014). Automated essay scoring and the future of educational assessment in medical education. Medical Education, 48, 950-962. doi: 10.1111/medu.12517
  • Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13-34. https://doi.org/10.1207/S15327841MPEE0501_2
  • Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Report of the Center for Educator Compensation Reform. Retrieved from https://files.eric.ed.gov/fulltext/ED532068.pdf
  • Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29-48. doi: 10.1348/000711006X126600
  • Gwet, K. L. (2016). Testing the difference of correlated agreement coefficients for statistical significance. Educational and Psychological Measurement, 76(4), 609-637. doi: 10.1177/0013164415596420
  • Haley, D. T. (2007). Using a new inter-rater reliability statistic (Report No. 2017/16). UK: The Open University.
  • Hamner, B., & Frasco, M. (2018). Metrics: Evaluation metrics for machine learning (Version 0.1.4) [Computer Software]. https://CRAN.R-project.org/package=Metrics
  • Hartmann, D. P. (1977). Considerations in the choice of interobserver reliability estimates. Journal of Applied Behavior Analysis, 10(1), 103-116.
  • Hoek, J., & Scholman, M. C. J. (2017). Evaluating discourse annotation: Some recent insights and new approaches. In H. Bunt (Ed.), ACL Workshop on Interoperable Semantic Annotation (pp. 1-13). https://www.aclweb.org/anthology/W17-7401
  • Ishioka, T., & Kameda, M. (2006). Automated Japanese essay scoring system based on articles written by experts. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, 44, 233-240. doi: 10.3115/1220175.1220205
  • Jang, E-S., Kang, S-S., Noh, E-H., Kim, M-H., Sung, K-H., & Seong, T-J. (2014). KASS: Korean automatic scoring system for short-answer questions. Proceedings of the 6th International Conference on Computer Supported Education, Barcelona, 2, 226-230. doi: 10.5220/0004864302260230
  • Kumar, C. S., & Rama Sree, R. J. (2014). An attempt to improve classification accuracy through implementation of bootstrap aggregation with sequential minimal optimization during automated evaluation of descriptive answers. Indian Journal of Science and Technology, 7(9), 1369-1375.
  • Lacy, S., Watson, B. R., Riffe, D., & Lovejoy, J. (2015). Issues and best practices in content analysis. Journalism and Mass Communication Quarterly, 92(4), 1-21. doi: 10.1177/1077699015607338
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.
  • Lilja, M. (2018). Automatic essay scoring of Swedish essays using neural networks (Doctoral dissertation, Uppsala University). Retrieved from http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1213688&dswid=9250
  • LoMartire, R. (2017). rel: Reliability coefficients (version 1.3.1) [Computer software]. https://CRAN.R-project.org/package=rel
  • Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and porffolio assessment (pp. 61-73). New Jersey: Lawrence Erlbaum Associates, Inc.
  • Meyer, G. J. (1999). Simple procedures to estimate chance agreement and Kappa for the interrater reliability of response segments using the rorschach comprehensive system. Journal of Personality Assessment, 72(2), 230-255. doi: 10.1207/ S15327752JP720209
  • Ministry of National Education (MoNE). (2017a). Akademik becerilerin izlenmesi ve değerlendirilmesi (ABİDE) 2016 8. sınıflar raporu. Erişim Adresi: https://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_11/30114819_iY-web-v6.pdf
  • Ministry of National Education (MoNE). (2017b). İzleme değerlendirme raporu 2016. Erişim Adresi: http://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_06/23161120_2016_izleme_degYerlendirme_raporu.pdf
  • Page, E. B. (1966). The imminence of grading essays by computers. Phi Delta Kappan, 47(5), 238–243. Retrieved from http://www.jstor.org/stable/20371545
  • Powers, D. E., Escoffery, D. S., & Duchnowski, M. P. (2015). Validating automated essay scoring: A (modest) refinement of the "gold standard". Applied Measurement in Education, 28(2), 130-142. doi: 10.1080/08957347.2014.1002920
  • Preston, D., & Goodman, D. (2012). Automated essay scoring and the repair of electronics. Retrieved from https://www.semanticscholar.org/
  • R Core Team. (2018). R: A language and environment for statistical computing (version 3.5.2) [Computer software]. Vienna, Austria: R Foundation for Statistical Computing.
  • Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing, 18(1), 25-39. https://doi.org/10.1016/j.asw.2012.10.004
  • Senay, A., Delisle, J., Raynauld, J. P., Morin, S. N., & Fernandes, J. C. (2015). Agreement between physicians' and nurses' clinical decisions for the management of the fracture liaison service (4iFLS): The Lucky Bone™ program. Osteoporosis International, 27(4), 1569-1576. doi: 10.1007/s00198-015-3413-6
  • Shankar, V., & Bangdiwala, S. I. (2014). Observer agreement paradoxes in 2x2 tables: Comparison of agreement measures. BMC Medical Research Methodology, 14(100). Advance online publication. https://doi.org/10.1186/1471-2288-14-100
  • Shermis, M. D. (2010). Automated essay scoring in a high stakes testing enviroment. In V. J. Shute, B. J. Becker (Eds.), Innovative assessment for the 21st century (pp. 167-185). New York: Springer.
  • Shermis, M. D., & Burnstein, J. (2003). Automated essay scoring. Mahvah, New Jersey: Lawrence Erlbaum Associates.
  • Sim, J., & Wright, C. C. (2005). The Kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257-268. https://doi.org/10.1093/ptj/85.3.257
  • Siriwardhana, D. D., Walters, K., Rait, G., Bazo-Alvarez, J. C., & Weerasinghe, M. C. (2018). Cross-cultural adaptation and psychometric evaluation of the Sinhala version of Lawton Instrumental Activities of Daily Living Scale. Plos One, 13(6), 1-20. https://doi.org/10.1371/journal.pone.0199820
  • Taghipour, K., & Tou Ng, H. (2016). A neural approach to automated essay scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 1882-1891. doi: 10.18653/v1/D16-1193
  • Vanbelle, S. (2016). A new interpretation of the weighted Kappa coefficients. Psychometrika, 81(2), 399-410. https://doi.org/10.1007/s11336-014-9439-4
  • Wang, J., & Brown, M. S. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, 6(2). Retrieved from http://www.jtla.org.
  • Wang, Y., Wei, Z., Zhou, Y., & Huang, X. (2018, November). Automatic essay scoring incorporating rating schema via reinforcement learning. In E. Reloff, D. Chiang, H. Julia & T. Jun'ichi (Eds.), Empirical methods in natural language processing (pp. 791-797). Brussels, Belgium: Association for Computational Linguistics.
  • Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2-13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
  • Wongpakaran, N., Wongpakaran, T., Wedding, D., & Gwet, K. L. (2013). A comparison of Cohen's Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology, 13(61), 1-9.

How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language

Yıl 2021, Cilt 12, Sayı 1, 28 - 53, 31.03.2021
https://doi.org/10.21031/epod.817396

Öz

The use of open-ended items, especially in large-scale tests, created difficulties in scoring open-ended items. However, this problem can be overcome with an approach based on automated scoring of open-ended items. The aim of this study was to examine the reliability of the data obtained by scoring open-ended items automatically. One of the objectives was to compare different algorithms based on machine learning in automated scoring (support vector machines, logistic regression, multinominal Naive Bayes, long-short term memory, and bidirectional long-short term memory). The other objective was to investigate the change in the reliability of automated scoring by differentiating the data rate used in testing the automated scoring system (33%, 20%, and 10%). While examining the reliability of automated scoring, a comparison was made with the reliability of the data obtained from human raters. In this study, which demonstrated the first automated scoring attempt of open-ended items in the Turkish language, Turkish test data of the Academic Skills Monitoring and Evaluation (ABIDE) program administered by the Ministry of National Education were used. Cross-validation was used to test the system. Regarding the coefficients of agreement to show reliability, the percentage of agreement, the quadratic-weighted Kappa, which is frequently used in automated scoring studies, and the Gwet's AC1 coefficient, which is not affected by the prevalence problem in the distribution of data into categories, were used. The results of the study showed that automated scoring algorithms could be utilized. It was found that the best algorithm to be used in automated scoring is bidirectional long-short term memory. Long-short term memory and multinominal Naive Bayes algorithms showed lower performance than support vector machines, logistic regression, and bidirectional long-short term memory algorithms. In automated scoring, it was determined that the coefficients of agreement at 33% test data rate were slightly lower comparing 10% and 20% test data rates, but were within the desired range.

Kaynakça

  • Adesiji, K. M., Agbonifo, O. C., Adesuyi, A. T., & Olabode, O. (2016). Development of an automated descriptive text-based scoring system. British Journal of Mathematics & Computer Science, 19(4), 1-14. doi: 10.9734/BJMCS/2016/27558
  • Altman, D. G. (1991). Practical statistics for medical research. Boca Raton: CRC.
  • Araujo, J., & Born, D. G. (1985). Calculating percentage agreement correctly but writing its formula incorrectly. The Behavior Analyst, 8(2), 207-208. doi: 10.1007/BF03393152
  • Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v.2. Journal of Technology, Learning, and Assessment, 4(3). Retrieved from http://www.jtla.org.
  • Berg, P-C., & Gopinathan, M. (2017). A deep learning ensemble approach to gender identification of tweet authors (Master's thesis, Norwegian University of Science and Technology). Retrieved from https://brage.bibsys.no/xmlui/handle/11250/2458477
  • Brenner, H., & Kliebsch, U. (1996). Dependence of weighted Kappa coefficients on the number of categories. Epidemiology, 7(2), 199-202. https://doi.org/10.1097/00001648-199603000-00016
  • Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and Kappa. Journal of Clinical Epidemiology, 46(5), 423-429. doi: 10.1016/0895-4356(93)90018-V
  • Chen, H., Xu, J., & He, B. (2014). Automated essay scoring by capturing relative writing quality. The Computer Journal, 57(9), 1318-1330. doi:10.1093/comjnl/bxt117
  • Cohen, Y., Ben-Simon, A., & Hovav, M. (October, 2003). The effect of specific language features on the complexity of systems for automated essay scoring. Paper presented at the International Association of Educational Administration, Manchester.
  • Cohen, Y., Levi, E., & Ben-Simon, A. (2018). Validating human and automated scoring of essays against "True" scores. Applied Measurement in Education, 31(3), 241-250. https://doi.org/10.1080/08957347.2018.1464450
  • Creswell, J. W. (2012). Educational research: Planning, conducting and evaluating quantitative and qualitative research (4th ed.). Boston: Pearson.
  • Downing, S. M. (2009). Written tests: Constructed-response and selected-response formats. In S. M. Downing & R. Yudkowsky (Eds.), Assessment in health professions education (pp. 149-184). New York, NY: Routledge.
  • Ebel, R. L., & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.). Englewood Cliffs, NJ: Prentice-Hall.
  • Eugenio, B. D., & Glass, M. (2004). The Kappa statistic: A second look. Computational Linguistics, 30(1), 95-101. https://doi.org/10.1162/089120104773633402
  • Gamer, M., Lemon, I., Fellows, J., & Singh, P. (2010). irr: Various coefficients of interrater reliability and agreement (Version 0.83) [Computer software]. https://CRAN.R-project.org/package=irr
  • Geisinger, K. F., & Usher-Tate, B. J. (2016). A brief history of educational testing and psychometrics. In C. S. Wells, M. Faulkner-Bond (Eds.), Educational measurement from foundations to future (pp. 3-20). New York: The Guilford.
  • Gierl, M. J., Latifi, S., Lai, H., Boulais, A. P., & Champlain, A. D. (2014). Automated essay scoring and the future of educational assessment in medical education. Medical Education, 48, 950-962. doi: 10.1111/medu.12517
  • Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13-34. https://doi.org/10.1207/S15327841MPEE0501_2
  • Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Report of the Center for Educator Compensation Reform. Retrieved from https://files.eric.ed.gov/fulltext/ED532068.pdf
  • Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29-48. doi: 10.1348/000711006X126600
  • Gwet, K. L. (2016). Testing the difference of correlated agreement coefficients for statistical significance. Educational and Psychological Measurement, 76(4), 609-637. doi: 10.1177/0013164415596420
  • Haley, D. T. (2007). Using a new inter-rater reliability statistic (Report No. 2017/16). UK: The Open University.
  • Hamner, B., & Frasco, M. (2018). Metrics: Evaluation metrics for machine learning (Version 0.1.4) [Computer Software]. https://CRAN.R-project.org/package=Metrics
  • Hartmann, D. P. (1977). Considerations in the choice of interobserver reliability estimates. Journal of Applied Behavior Analysis, 10(1), 103-116.
  • Hoek, J., & Scholman, M. C. J. (2017). Evaluating discourse annotation: Some recent insights and new approaches. In H. Bunt (Ed.), ACL Workshop on Interoperable Semantic Annotation (pp. 1-13). https://www.aclweb.org/anthology/W17-7401
  • Ishioka, T., & Kameda, M. (2006). Automated Japanese essay scoring system based on articles written by experts. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, 44, 233-240. doi: 10.3115/1220175.1220205
  • Jang, E-S., Kang, S-S., Noh, E-H., Kim, M-H., Sung, K-H., & Seong, T-J. (2014). KASS: Korean automatic scoring system for short-answer questions. Proceedings of the 6th International Conference on Computer Supported Education, Barcelona, 2, 226-230. doi: 10.5220/0004864302260230
  • Kumar, C. S., & Rama Sree, R. J. (2014). An attempt to improve classification accuracy through implementation of bootstrap aggregation with sequential minimal optimization during automated evaluation of descriptive answers. Indian Journal of Science and Technology, 7(9), 1369-1375.
  • Lacy, S., Watson, B. R., Riffe, D., & Lovejoy, J. (2015). Issues and best practices in content analysis. Journalism and Mass Communication Quarterly, 92(4), 1-21. doi: 10.1177/1077699015607338
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.
  • Lilja, M. (2018). Automatic essay scoring of Swedish essays using neural networks (Doctoral dissertation, Uppsala University). Retrieved from http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1213688&dswid=9250
  • LoMartire, R. (2017). rel: Reliability coefficients (version 1.3.1) [Computer software]. https://CRAN.R-project.org/package=rel
  • Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and porffolio assessment (pp. 61-73). New Jersey: Lawrence Erlbaum Associates, Inc.
  • Meyer, G. J. (1999). Simple procedures to estimate chance agreement and Kappa for the interrater reliability of response segments using the rorschach comprehensive system. Journal of Personality Assessment, 72(2), 230-255. doi: 10.1207/ S15327752JP720209
  • Ministry of National Education (MoNE). (2017a). Akademik becerilerin izlenmesi ve değerlendirilmesi (ABİDE) 2016 8. sınıflar raporu. Erişim Adresi: https://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_11/30114819_iY-web-v6.pdf
  • Ministry of National Education (MoNE). (2017b). İzleme değerlendirme raporu 2016. Erişim Adresi: http://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_06/23161120_2016_izleme_degYerlendirme_raporu.pdf
  • Page, E. B. (1966). The imminence of grading essays by computers. Phi Delta Kappan, 47(5), 238–243. Retrieved from http://www.jstor.org/stable/20371545
  • Powers, D. E., Escoffery, D. S., & Duchnowski, M. P. (2015). Validating automated essay scoring: A (modest) refinement of the "gold standard". Applied Measurement in Education, 28(2), 130-142. doi: 10.1080/08957347.2014.1002920
  • Preston, D., & Goodman, D. (2012). Automated essay scoring and the repair of electronics. Retrieved from https://www.semanticscholar.org/
  • R Core Team. (2018). R: A language and environment for statistical computing (version 3.5.2) [Computer software]. Vienna, Austria: R Foundation for Statistical Computing.
  • Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing, 18(1), 25-39. https://doi.org/10.1016/j.asw.2012.10.004
  • Senay, A., Delisle, J., Raynauld, J. P., Morin, S. N., & Fernandes, J. C. (2015). Agreement between physicians' and nurses' clinical decisions for the management of the fracture liaison service (4iFLS): The Lucky Bone™ program. Osteoporosis International, 27(4), 1569-1576. doi: 10.1007/s00198-015-3413-6
  • Shankar, V., & Bangdiwala, S. I. (2014). Observer agreement paradoxes in 2x2 tables: Comparison of agreement measures. BMC Medical Research Methodology, 14(100). Advance online publication. https://doi.org/10.1186/1471-2288-14-100
  • Shermis, M. D. (2010). Automated essay scoring in a high stakes testing enviroment. In V. J. Shute, B. J. Becker (Eds.), Innovative assessment for the 21st century (pp. 167-185). New York: Springer.
  • Shermis, M. D., & Burnstein, J. (2003). Automated essay scoring. Mahvah, New Jersey: Lawrence Erlbaum Associates.
  • Sim, J., & Wright, C. C. (2005). The Kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257-268. https://doi.org/10.1093/ptj/85.3.257
  • Siriwardhana, D. D., Walters, K., Rait, G., Bazo-Alvarez, J. C., & Weerasinghe, M. C. (2018). Cross-cultural adaptation and psychometric evaluation of the Sinhala version of Lawton Instrumental Activities of Daily Living Scale. Plos One, 13(6), 1-20. https://doi.org/10.1371/journal.pone.0199820
  • Taghipour, K., & Tou Ng, H. (2016). A neural approach to automated essay scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 1882-1891. doi: 10.18653/v1/D16-1193
  • Vanbelle, S. (2016). A new interpretation of the weighted Kappa coefficients. Psychometrika, 81(2), 399-410. https://doi.org/10.1007/s11336-014-9439-4
  • Wang, J., & Brown, M. S. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, 6(2). Retrieved from http://www.jtla.org.
  • Wang, Y., Wei, Z., Zhou, Y., & Huang, X. (2018, November). Automatic essay scoring incorporating rating schema via reinforcement learning. In E. Reloff, D. Chiang, H. Julia & T. Jun'ichi (Eds.), Empirical methods in natural language processing (pp. 791-797). Brussels, Belgium: Association for Computational Linguistics.
  • Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2-13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
  • Wongpakaran, N., Wongpakaran, T., Wedding, D., & Gwet, K. L. (2013). A comparison of Cohen's Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology, 13(61), 1-9.

Ayrıntılar

Birincil Dil İngilizce
Konular Sosyal
Yayınlanma Tarihi İlkbahar
Bölüm Makaleler
Yazarlar

İbrahim UYSAL (Sorumlu Yazar)
BOLU ABANT İZZET BAYSAL ÜNİVERSİTESİ
0000-0002-6767-0362
Türkiye


Nuri DOĞAN
HACETTEPE ÜNİVERSİTESİ
0000-0001-6274-2016
Türkiye

Yayımlanma Tarihi 31 Mart 2021
Yayınlandığı Sayı Yıl 2021, Cilt 12, Sayı 1

Kaynak Göster

Bibtex @araştırma makalesi { epod817396, journal = {Journal of Measurement and Evaluation in Education and Psychology}, issn = {1309-6575}, eissn = {1309-6575}, address = {}, publisher = {Eğitimde ve Psikolojide Ölçme ve Değerlendirme Derneği}, year = {2021}, volume = {12}, number = {1}, pages = {28 - 53}, doi = {10.21031/epod.817396}, title = {How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language}, key = {cite}, author = {Uysal, İbrahim and Doğan, Nuri} }
APA Uysal, İ. & Doğan, N. (2021). How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language . Journal of Measurement and Evaluation in Education and Psychology , 12 (1) , 28-53 . DOI: 10.21031/epod.817396
MLA Uysal, İ. , Doğan, N. "How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language" . Journal of Measurement and Evaluation in Education and Psychology 12 (2021 ): 28-53 <https://dergipark.org.tr/tr/pub/epod/issue/61109/817396>
Chicago Uysal, İ. , Doğan, N. "How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language". Journal of Measurement and Evaluation in Education and Psychology 12 (2021 ): 28-53
RIS TY - JOUR T1 - How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language AU - İbrahim Uysal , Nuri Doğan Y1 - 2021 PY - 2021 N1 - doi: 10.21031/epod.817396 DO - 10.21031/epod.817396 T2 - Journal of Measurement and Evaluation in Education and Psychology JF - Journal JO - JOR SP - 28 EP - 53 VL - 12 IS - 1 SN - 1309-6575-1309-6575 M3 - doi: 10.21031/epod.817396 UR - https://doi.org/10.21031/epod.817396 Y2 - 2021 ER -
EndNote %0 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language %A İbrahim Uysal , Nuri Doğan %T How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language %D 2021 %J Journal of Measurement and Evaluation in Education and Psychology %P 1309-6575-1309-6575 %V 12 %N 1 %R doi: 10.21031/epod.817396 %U 10.21031/epod.817396
ISNAD Uysal, İbrahim , Doğan, Nuri . "How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language". Journal of Measurement and Evaluation in Education and Psychology 12 / 1 (Mart 2021): 28-53 . https://doi.org/10.21031/epod.817396
AMA Uysal İ. , Doğan N. How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language. JMEEP. 2021; 12(1): 28-53.
Vancouver Uysal İ. , Doğan N. How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language. Journal of Measurement and Evaluation in Education and Psychology. 2021; 12(1): 28-53.
IEEE İ. Uysal ve N. Doğan , "How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language", Journal of Measurement and Evaluation in Education and Psychology, c. 12, sayı. 1, ss. 28-53, Mar. 2021, doi:10.21031/epod.817396