Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity

Abdulkadir Kara; Zeynep Avinç Kara; Serkan Yıldırım

doi:10.21449/ijate.1687429

Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity

Abstract

In measurement and evaluation processes, natural language responses are often avoided due to time, workload, and reliability concerns. However, the increasing popularity of automatic short-answer grading studies for natural language responses means such answers can now be measured more quickly and reliably. This study aims to build models for predicting automatic short answer scores using the pre-trained BERT deep learning language model and to reveal their effectiveness. For this purpose, two different score prediction models were created using an answer-based approach that aligns student answers with expert judgements and a reference-based approach that matches student answers with reference answers. The dataset includes answers from 246 Physics department students responding to 4 physics-related questions. The performance of these models was evaluated on four physics questions representing varying levels of cognitive complexity, using Cohen’s Kappa for statistical comparison of agreement with expert scores. Our findings reveal a clear interaction between model architecture and task complexity. The answer-based model was unequivocally superior for the most complex, multi-class task, effectively capturing diverse, nuanced responses. Conversely, the reference-based model demonstrated a statistically significant advantage for a well-defined, medium-complexity binary task. This study concludes that the optimal model for ASAG in Turkish is contingent on the cognitive demands of the assessment task, suggesting that a onesize-fits-all solution may not be the most effective approach. This provides a critical framework for practitioners, demonstrating not only that effective models are feasible for complex languages, but that their selection must be guided by task complexity.

Keywords

Supporting Institution

Ataturk University, Institute of Educational Sciences

Ethical Statement

A ready-made data set shared as open source was used in the research. Since the research does not contain any elements that violate the principle of ethics, it is within the scope of research that does not require ethics committee approval.

Thanks

We thank Cinar et al. (2020) for sharing the physics dataset as an open data source.

References

Abdul-Mageed, M., Elmadany, A., & Nagoudi, E.M.B. (2020). ARBERT & MARBERT: Deep bidirectional transformers for Arabic. arXiv. https://doi.org/10.48550/arXiv.2101.01785
Abdul-Salam, M., El-Fatah, M.A., & Hassan, N.F. (2022). Automatic grading for Arabic short answer questions using optimized deep learning model. PloS ONE, 17(8), Article e0272269. https://doi.org/10.1371/journal.pone.0272269
Akila Devi, T.R., Javubar Sathick, K., Abdul Azeez Khan, A., & Arun Raj, L. (2023). Novel framework for improving the correctness of reference answers to enhance results of ASAG systems. SN Computer Science, 4(4), Article 415. https://doi.org/10.1007/s42979 023 01682-8
Amur, Z.H., Hooi, Y.K., & Soomro, G.M. (2022). Automatic short answer grading (ASAG) using attention-based deep learning MODEL. In 2022 International Conference on Digital Transformation and Intelligence (pp. 1-7). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICDI57181.2022.10007187
Badry, R.M., Ali, M., Rslan, E., & Kaseb, M.R. (2023). Automatic arabic grading system for short answer questions. IEEE Access, 11, 39457 39465. https://doi.org/10.1109/ACCESS.2023.3267407
Benli, I., & İsmailova, R. (2018). Use of open-ended questions in measurement and evaluation methods in distance education. International Technology and Education Journal, 2(1), 1-8.
Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25, 60 117. https://doi.org/10.1007/s40593-014-0026-8
Chan, S., Sathyamurthy, M., Inoue, C., Bax, M., Jones, J., & Oyekan, J. (2024). Integrating metadiscourse analysis with transformer-based models for enhancing construct representation and discourse competence assessment in l2 writing: A systemic multidisciplinary approach. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 318-347. https://doi.org/10.21031/epod.1531269

Chaudhari, R., & Patel, M. (2024). Deep learning in automatic short answer grading: A comprehensive review. ITM Web of Conferences, 65, Article 03003. https://doi.org/10.1051/itmconf/20246503003
Chen, X., Zhou, Z., & Prado, M. (2025). ChatGPT-3.5 as an automatic scoring system and feedback provider in IELTS exams. International Journal of Assessment Tools in Education, 12(1), 62-77. https://doi.org/10.21449/ijate.1496193
Chen, Y., Luo, J., Zhu, X., Wu, H., & Yuan, S. (2023). A cross-lingual hybrid neural network with interaction enhancement for grading short-answer texts. IEEE Access, 11, 37508-37514. https://doi.org/10.1109/ACCESS.2023.3260840
Çınar, A., İnce, E., Gezer, M., & Yılmaz, O. (2020). Machine learning algorithm for grading open-ended physics questions in Turkish. Education and Information Technologies, 25(5), 3821-3844. https://doi.org/10.1007/s10639-020-10128-0
Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805
Dönmez, M. (2024). AI-based feedback tools in education: a comprehensive bibliometric analysis study. International Journal of Assessment Tools in Education, 11(4), 622 646. https://doi.org/10.21449/ijate.1467476
Filighera, A., Ochs, S., Steuer, T., & Tregel, T. (2023). Cheating automatic short answer grading with the adversarial usage of adjectives and adverbs. International Journal of Artificial Intelligence in Education, 34, 616 646. https://doi.org/10.1007/s40593 023 00361-2
Garg, J., Papreja, J., Apurva, K., & Jain, G. (2022). Domain-specific hybrid BERT based system for automatic short answer grading. In Proceedings of 2nd International Conference on Intelligent Technologies (pp. 1-6). The Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/CONIT55038.2022.9847754
Ghavidel, H.A., Zouaq, A., & Desmarais, M.C. (2020). Using BERT and XLNET for the automatic short answer grading task. In H.C. Lane, S. Zvacek, & J. Uhomoibhi (Eds.), Proceedings of the 12th International Conference on Computer Supported Education - (Volume 1) (pp. 58-67). SciTePress. https://doi.org/10.5220/0009422400580067
Gomaa, W.H., Nagib, A.E., Saeed, M.M., Algarni, A., & Nabil, E. (2023). Empowering short answer grading: integrating transformer-based embeddings and BI-LSTM network. Big Data and Cognitive Computing, 7(3), Article 122. https://doi.org/10.3390/bdcc7030122
Haller, S., Aldea, A., Seifert, C., & Strisciuglio, N. (2022). Survey on automatic short answer grading with deep learning: From word embeddings to transformers. arXiv. https://doi.org/10.48550/arXiv.2204.03503
Hasanah, U., Permanasari, A.E., Kusumawardani, S.S., & Pribadi, F.S. (2016, August). A review of an information extraction technique approach for automatic short answer grading. In Proceedings of 2016 1st International Conference on Information Technology, Information Systems and Electrical Engineering (pp. 192-196). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICITISEE.2016.7803072
Jadidinejad, A.H., & Mahmoudi, F. (2014). Unsupervised short answer grading using spreading activation over an associative network of concepts / la notation sans surveillance des réponses courtes en utilisant la diffusion d'activation dans un réseau associatif de concepts. Canadian Journal of Information and Library Science, 38(4), 287-303.
Katsaris, I., & Vidakis, N. (2021). Adaptive e-learning systems through learning styles: a review of the literature. Advances in Mobile Learning Educational Research, 1(2), 124-145. https://doi.org/10.25082/AMLER.2021.02.007
Kurbanoğlu, N.I., & Olcaytürk, M. (2023). Investigation of the exam question types attitude scale for secondary school students: development, validity, and reliability. Sakarya University Journal of Education, 13(2), 191-206. https://doi.org/10.19126/suje.1187470
Leacock, C., & Chodorow, M. (2003). C-rater: Automatic scoring of short-answer questions. Computers and the Humanities, 37, 389-405. https://doi.org/10.1023/A:1025779619903
Li, X., Li, X., Chen, S., Ma, S., & Xie, F. (2022). Neural-based automatic scoring model for Chinese-English interpretation with a multi-indicator assessment. Connection Science, 34(1), 1638-1653. https://doi.org/10.1080/09540091.2022.2078279
Lubis, F.F., Putri, A., Waskita, D., Sulistyaningtyas, T., Arman, A.A., & Rosmansyah, Y. (2021). Automatic short-answer grading using semantic similarity based on word embedding. International Journal of Technology, 12(3), 571 581. https://doi.org/10.14716/ijtech.v12i3.4651
Mardini, G.I.D., Quintero, M.C.G., Viloria, N.C.A., Percybrooks, B.W.S., Robles, N.H.S., & Villalba, R.K. (2024). A deep-learning-based grading system (ASAG) for reading comprehension assessment by using aphorisms as open-answer-questions. Education and Information Technologies, 29(4), 4565-4590. https://doi.org/10.1007/s10639-023-11890-7
Meurers, D., Ziai, R., Ott, N., & Kopp, J. (2011). Evaluating answers to reading comprehension questions in context: Results for German and the role of information structure. In S. Padó, & S. Thater (Eds.), Proceedings of the TextInfer 2011 workshop on textual entailment (pp. 1–9). Association for Computational Linguistics. https://aclanthology.org/W11-2401/
Mohler, M., & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. In A. Lascarides, C. Gardent, & J. Nivre (Eds.), Proceedings of the 12th Conference of the European chapter of the association for computational linguistics (pp. 567-575). Association for Computational Linguistics. https://aclanthology.org/E09-1065.pdf
Nael, O., ElManyalawy, Y., & Sharaf, N. (2022). AraScore: a deep learning-based system for Arabic short answer scoring. Array, 13, Article 100109. https://doi.org/10.1016/j.array.2021.100109
Nath, S., Parsaeifard, B., & Werlen, E. (2023, August 22-26). Automatic short answer grading using BERT on German datasets [Paper presentation]. 20th Biennial EARLI Conference, Thessaloniki, Greece.
Noyes, K., McKay, R.L., Neumann, M., Haudek, K.C., & Cooper, M.M. (2020). Developing computer resources to automate analysis of students’ explanations of London dispersion forces. Journal of Chemical Education, 97(11), 3923 3936. https://doi.org/10.1021/acs.jchemed.0c00445
Padó, U. (2016). Get semantic with me! the usefulness of different feature types for short-answer grading. In Y. Matsumoto, & R. Prasad (Eds.), Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical Papers (pp. 2186-2195). The COLING 2016 Organizing Committee. https://aclanthology.org/C16-1206/
Page, E.B. (1967). Grading essays by computer: Progress report. Proceedings of the Invitational Conference on Testing Problems, 87-100.
Ramineni, C., & Williamson, D.M. (2013). Automatic essay scoring: psychometric guidelines and practices. Assessing Writing, 18(1), 25-39. https://doi.org/10.1016/j.asw.2012.10.004
Riyanto, S., Imas, S.S., Djatna, T., & Atikah, T.D. (2023). Comparative analysis using various performance metrics in imbalanced data for multi-class text classification. International Journal of Advanced Computer Science and Applications, 14(6). https://doi.org/10.14569/IJACSA.2023.01406116
Salim, H.R., De, C., Pratamaputra, N.D., & Suhartono, D. (2022). Indonesian automatic short answer grading system. Bulletin of Electrical Engineering and Informatics, 11(3), 1586-1603. https://doi.org/10.11591/eei.v11i3.3531
Saunders, D.R., Bex, P.J., Rose, D.J., & Woods, R.L. (2014). Measuring information acquisition from sensory input using automatic scoring of natural-language descriptions. PLoS ONE, 9(4), Article e93251. https://doi.org/10.1371/journal.pone.0093251
Sawatzki, J., Schlippe, T., & Benner-Wickner, M. (2021). Deep learning techniques for automatic short answer grading: predicting scores for English and German answers. In E. Cheng, R.B. Koul, T. Wang, & X. Yu (Eds.), Proceedings of 2021 2nd international conference on artificial intelligence in education technology (pp. 65 75). Springer. https://doi.org/10.1007/978-981-16-7527-0_5
Sayeed, M.A., & Gupta, D. (2022). Automate descriptive answer grading using reference based models. In Proceedings of 2022 OITS international conference on information technology (pp. 262 267). The Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/OCIT56763.2022.00057
Schleifer, A.G., Klebanov, B.B., Ariely, M., & Alexandron, G. (2023). Transformer-based Hebrew NLP models for short answer scoring in biology. In Proceedings of the 18th workshop on innovative use of NLP for building educational applications (pp. 550-555). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.46
Seker, A., Bandel, E., Bareket, D., Brusilovsky, I., Greenfeld, R., & Tsarfaty, R. (2022). AlephBERT: Language model pre-training and evaluation from sub-word to sentence level. In Proceedings of the 60th annual meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 46 56). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.4
Siddiqi, R., Harrison, C.J., & Siddiqi, R. (2010). Improving teaching and learning through automatic short-answer marking. IEEE Transactions on Learning Technologies, 3(3), 237-249. https://doi.org/10.1109/TLT.2010.4
Sung, C., Dhamecha, T., Saha, S., Ma, T., Reddy, V., & Arora, R. (2019, November). Pre-training BERT on domain resources for short answer grading. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (pp. 6071 6075). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1628
Şenel, S., & Şenel, H.C. (2021). Remote assessment in higher education during COVID-19 pandemic. International Journal of Assessment Tools in Education, 8(2), 181 199. https://doi.org/10.21449/ijate.820140
Tulu, C.N., Özkaya, O., & Orhan, U. (2021). Automatic short answer grading with semspace sense vectors and malstm. IEEE Access, 9, 19270 19280. https://doi.org/10.1109/ACCESS.2021.3054346
Uto, M., & Uchida, Y. (2020). Automatic short-answer grading using deep neural networks and item response theory. In I.I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millan (Eds.), Proceedings of the 21th International Conference on Artificial Intelligence in Education (pp. 334-339). Springer International Publishing. https://doi.org/10.1007/978-3-030-52240-7_61
Uyar, A.C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: a focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20-32. https://doi.org/10.21449/ijate.1517994
Uysal, I., & Doğan, N. (2021). How reliable is it to automatically score open-ended items? An application in the Turkish language. Journal of Measurement and Evaluation in Education and Psychology, 12(1), 28-53. https://doi.org/10.1007/978-3-030-52240-7_61
Westera, W., Dascalu, M., Kurvers, H., Ruseti, S., & Trausan-Matu, S. (2018). Automatic essay scoring in applied games: Reducing the teacher bandwidth problem in online training. Computers & Education, 123, 212 224. https://doi.org/10.1016/j.compedu.2018.05.010
Xie, M.K., Xiao, J., Liu, H.Z., Niu, G., Sugiyama, M., & Huang, S.J. (2023). Class-distribution-aware pseudo-labeling for semi-supervised multi-label learning. Advances in Neural Information Processing Systems, 36, 25731 25747. https://doi.org/10.48550/arXiv.2305.02795
Yang, X., Huang, J.Y., Zhou, W., & Chen, M. (2022). Parameter-efficient tuning with special token adaptation. arXiv. https://doi.org/10.48550/arXiv.2210.04382
Yıldırım, O., & Demir, S.B. (2022). Inside the black box: Do teachers practice assessment as learning?. International Journal of Assessment Tools in Education, 9(Special Issue), 46-71. https://doi.org/10.21449/ijate.1132923
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76(2), 280-303. https://doi.org/10.1177/0013164415590022
Zesch, T., Horbach, A., & Zehner, F. (2023). To score or not to score: Factors influencing performance and feasibility of automatic content scoring of text responses. Educational Measurement: Issues and Practice, 42(1), 44-58. https://doi.org/10.1111/emip.12544
Zhang, L., & Copus, B. (2023). A Study of Compressed Language Models in Social Media Domain. The International FLAIRS Conference Proceedings, 36(1). https://doi.org/10.32473/flairs.36.133056
Zhang, M., Johnson, M., & Ruan, C. (2024). Investigating sampling impacts on an LLM-based AI scoring approach: Prediction accuracy and fairness. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 348 360. https://doi.org/10.21031/epod.1561580
Zhu, X., Wu, H., & Zhang, L. (2022). Automatic short-answer grading via BERT-based deep neural networks. IEEE Transactions on Learning Technologies, 15(3), 364 375. https://doi.org/10.1109/tlt.2022.3175537
Zimmerman, W.A., Kang, H.B., Kim, K., Gao, M., Johnson, G., Clariana, R., & Zhang, F. (2018). Computer-automatic approach for scoring short essays in an introductory statistics course. Journal of Statistics Education, 26(1), 40 47. https://doi.org/10.1080/10691898.2018.1443047

Details

Primary Language

English

Subjects

Classroom Measurement Practices

Journal Section

Research Article

Authors

Abdulkadir Kara ^*
0000-0003-3255-1408
Türkiye

Zeynep Avinç Kara
0000-0002-8309-3876
Türkiye

Serkan Yıldırım
0000-0002-8277-5963
Türkiye

Early Pub Date

October 1, 2025

Publication Date

December 5, 2025

Submission Date

April 30, 2025

Acceptance Date

August 28, 2025

Published in Issue

Year 2025 Volume: 12 Number: 4

DOI

https://doi.org/10.21449/ijate.1687429

IZ

https://izlik.org/JA95RY38ZR

Cite

RIS / Bibtex

APA

Kara, A., Avinç Kara, Z., & Yıldırım, S. (2025). Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity. International Journal of Assessment Tools in Education, 12(4), 905-925. https://doi.org/10.21449/ijate.1687429

Cited By

A gated multi-scale attention framework for automated short answer grading in Turkish

Information Processing & Management

https://doi.org/10.1016/j.ipm.2026.104791