Research Article
BibTex RIS Cite

Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity

Year 2025, Volume: 12 Issue: 4, 905 - 925

Abstract

In measurement and evaluation processes, natural language responses are often avoided due to time, workload, and reliability concerns. However, the increasing popularity of automatic short-answer grading studies for natural language responses means such answers can now be measured more quickly and reliably. This study aims to build models for predicting automatic short answer scores using the pre-trained BERT deep learning language model and to reveal their effectiveness. For this purpose, two different score prediction models were created using an answer-based approach that aligns student answers with expert judgements and a reference-based approach that matches student answers with reference answers. The dataset includes answers from 246 Physics department students responding to 4 physics-related questions. The performance of these models was evaluated on four physics questions representing varying levels of cognitive complexity, using Cohen’s Kappa for statistical comparison of agreement with expert scores. Our findings reveal a clear interaction between model architecture and task complexity. The answer-based model was unequivocally superior for the most complex, multi-class task, effectively capturing diverse, nuanced responses. Conversely, the reference-based model demonstrated a statistically significant advantage for a well-defined, medium-complexity binary task. This study concludes that the optimal model for ASAG in Turkish is contingent on the cognitive demands of the assessment task, suggesting that a onesize-fits-all solution may not be the most effective approach. This provides a critical framework for practitioners, demonstrating not only that effective models are feasible for complex languages, but that their selection must be guided by task complexity.

Ethical Statement

A ready-made data set shared as open source was used in the research. Since the research does not contain any elements that violate the principle of ethics, it is within the scope of research that does not require ethics committee approval.

Supporting Institution

Ataturk University, Institute of Educational Sciences

Thanks

We thank Cinar et al. (2020) for sharing the physics dataset as an open data source.

References

  • Abdul-Mageed, M., Elmadany, A., & Nagoudi, E.M.B. (2020). ARBERT & MARBERT: Deep bidirectional transformers for Arabic. arXiv. https://doi.org/10.48550/arXiv.2101.01785
  • Abdul-Salam, M., El-Fatah, M.A., & Hassan, N.F. (2022). Automatic grading for Arabic short answer questions using optimized deep learning model. PloS ONE, 17(8), Article e0272269. https://doi.org/10.1371/journal.pone.0272269
  • Akila Devi, T.R., Javubar Sathick, K., Abdul Azeez Khan, A., & Arun Raj, L. (2023). Novel framework for improving the correctness of reference answers to enhance results of ASAG systems. SN Computer Science, 4(4), Article 415. https://doi.org/10.1007/s42979 023 01682-8
  • Amur, Z.H., Hooi, Y.K., & Soomro, G.M. (2022). Automatic short answer grading (ASAG) using attention-based deep learning MODEL. In 2022 International Conference on Digital Transformation and Intelligence (pp. 1-7). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICDI57181.2022.10007187
  • Badry, R.M., Ali, M., Rslan, E., & Kaseb, M.R. (2023). Automatic arabic grading system for short answer questions. IEEE Access, 11, 39457 39465. https://doi.org/10.1109/ACCESS.2023.3267407
  • Benli, I., & İsmailova, R. (2018). Use of open-ended questions in measurement and evaluation methods in distance education. International Technology and Education Journal, 2(1), 1-8.
  • Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25, 60 117. https://doi.org/10.1007/s40593-014-0026-8
  • Chan, S., Sathyamurthy, M., Inoue, C., Bax, M., Jones, J., & Oyekan, J. (2024). Integrating metadiscourse analysis with transformer-based models for enhancing construct representation and discourse competence assessment in l2 writing: A systemic multidisciplinary approach. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 318-347. https://doi.org/10.21031/epod.1531269
  • Chaudhari, R., & Patel, M. (2024). Deep learning in automatic short answer grading: A comprehensive review. ITM Web of Conferences, 65, Article 03003. https://doi.org/10.1051/itmconf/20246503003
  • Chen, X., Zhou, Z., & Prado, M. (2025). ChatGPT-3.5 as an automatic scoring system and feedback provider in IELTS exams. International Journal of Assessment Tools in Education, 12(1), 62-77. https://doi.org/10.21449/ijate.1496193
  • Chen, Y., Luo, J., Zhu, X., Wu, H., & Yuan, S. (2023). A cross-lingual hybrid neural network with interaction enhancement for grading short-answer texts. IEEE Access, 11, 37508-37514. https://doi.org/10.1109/ACCESS.2023.3260840
  • Çınar, A., İnce, E., Gezer, M., & Yılmaz, O. (2020). Machine learning algorithm for grading open-ended physics questions in Turkish. Education and Information Technologies, 25(5), 3821-3844. https://doi.org/10.1007/s10639-020-10128-0
  • Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805
  • Dönmez, M. (2024). AI-based feedback tools in education: a comprehensive bibliometric analysis study. International Journal of Assessment Tools in Education, 11(4), 622 646. https://doi.org/10.21449/ijate.1467476
  • Filighera, A., Ochs, S., Steuer, T., & Tregel, T. (2023). Cheating automatic short answer grading with the adversarial usage of adjectives and adverbs. International Journal of Artificial Intelligence in Education, 34, 616 646. https://doi.org/10.1007/s40593 023 00361-2
  • Garg, J., Papreja, J., Apurva, K., & Jain, G. (2022). Domain-specific hybrid BERT based system for automatic short answer grading. In Proceedings of 2nd International Conference on Intelligent Technologies (pp. 1-6). The Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/CONIT55038.2022.9847754
  • Ghavidel, H.A., Zouaq, A., & Desmarais, M.C. (2020). Using BERT and XLNET for the automatic short answer grading task. In H.C. Lane, S. Zvacek, & J. Uhomoibhi (Eds.), Proceedings of the 12th International Conference on Computer Supported Education - (Volume 1) (pp. 58-67). SciTePress. https://doi.org/10.5220/0009422400580067
  • Gomaa, W.H., Nagib, A.E., Saeed, M.M., Algarni, A., & Nabil, E. (2023). Empowering short answer grading: integrating transformer-based embeddings and BI-LSTM network. Big Data and Cognitive Computing, 7(3), Article 122. https://doi.org/10.3390/bdcc7030122
  • Haller, S., Aldea, A., Seifert, C., & Strisciuglio, N. (2022). Survey on automatic short answer grading with deep learning: From word embeddings to transformers. arXiv. https://doi.org/10.48550/arXiv.2204.03503
  • Hasanah, U., Permanasari, A.E., Kusumawardani, S.S., & Pribadi, F.S. (2016, August). A review of an information extraction technique approach for automatic short answer grading. In Proceedings of 2016 1st International Conference on Information Technology, Information Systems and Electrical Engineering (pp. 192-196). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICITISEE.2016.7803072
  • Jadidinejad, A.H., & Mahmoudi, F. (2014). Unsupervised short answer grading using spreading activation over an associative network of concepts / la notation sans surveillance des réponses courtes en utilisant la diffusion d'activation dans un réseau associatif de concepts. Canadian Journal of Information and Library Science, 38(4), 287-303.
  • Katsaris, I., & Vidakis, N. (2021). Adaptive e-learning systems through learning styles: a review of the literature. Advances in Mobile Learning Educational Research, 1(2), 124-145. https://doi.org/10.25082/AMLER.2021.02.007
  • Kurbanoğlu, N.I., & Olcaytürk, M. (2023). Investigation of the exam question types attitude scale for secondary school students: development, validity, and reliability. Sakarya University Journal of Education, 13(2), 191-206. https://doi.org/10.19126/suje.1187470
  • Leacock, C., & Chodorow, M. (2003). C-rater: Automatic scoring of short-answer questions. Computers and the Humanities, 37, 389-405. https://doi.org/10.1023/A:1025779619903
  • Li, X., Li, X., Chen, S., Ma, S., & Xie, F. (2022). Neural-based automatic scoring model for Chinese-English interpretation with a multi-indicator assessment. Connection Science, 34(1), 1638-1653. https://doi.org/10.1080/09540091.2022.2078279
  • Lubis, F.F., Putri, A., Waskita, D., Sulistyaningtyas, T., Arman, A.A., & Rosmansyah, Y. (2021). Automatic short-answer grading using semantic similarity based on word embedding. International Journal of Technology, 12(3), 571 581. https://doi.org/10.14716/ijtech.v12i3.4651
  • Mardini, G.I.D., Quintero, M.C.G., Viloria, N.C.A., Percybrooks, B.W.S., Robles, N.H.S., & Villalba, R.K. (2024). A deep-learning-based grading system (ASAG) for reading comprehension assessment by using aphorisms as open-answer-questions. Education and Information Technologies, 29(4), 4565-4590. https://doi.org/10.1007/s10639-023-11890-7
  • Meurers, D., Ziai, R., Ott, N., & Kopp, J. (2011). Evaluating answers to reading comprehension questions in context: Results for German and the role of information structure. In S. Padó, & S. Thater (Eds.), Proceedings of the TextInfer 2011 workshop on textual entailment (pp. 1–9). Association for Computational Linguistics. https://aclanthology.org/W11-2401/
  • Mohler, M., & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. In A. Lascarides, C. Gardent, & J. Nivre (Eds.), Proceedings of the 12th Conference of the European chapter of the association for computational linguistics (pp. 567-575). Association for Computational Linguistics. https://aclanthology.org/E09-1065.pdf
  • Nael, O., ElManyalawy, Y., & Sharaf, N. (2022). AraScore: a deep learning-based system for Arabic short answer scoring. Array, 13, Article 100109. https://doi.org/10.1016/j.array.2021.100109
  • Nath, S., Parsaeifard, B., & Werlen, E. (2023, August 22-26). Automatic short answer grading using BERT on German datasets [Paper presentation]. 20th Biennial EARLI Conference, Thessaloniki, Greece.
  • Noyes, K., McKay, R.L., Neumann, M., Haudek, K.C., & Cooper, M.M. (2020). Developing computer resources to automate analysis of students’ explanations of London dispersion forces. Journal of Chemical Education, 97(11), 3923 3936. https://doi.org/10.1021/acs.jchemed.0c00445
  • Padó, U. (2016). Get semantic with me! the usefulness of different feature types for short-answer grading. In Y. Matsumoto, & R. Prasad (Eds.), Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical Papers (pp. 2186-2195). The COLING 2016 Organizing Committee. https://aclanthology.org/C16-1206/
  • Page, E.B. (1967). Grading essays by computer: Progress report. Proceedings of the Invitational Conference on Testing Problems, 87-100.
  • Ramineni, C., & Williamson, D.M. (2013). Automatic essay scoring: psychometric guidelines and practices. Assessing Writing, 18(1), 25-39. https://doi.org/10.1016/j.asw.2012.10.004
  • Riyanto, S., Imas, S.S., Djatna, T., & Atikah, T.D. (2023). Comparative analysis using various performance metrics in imbalanced data for multi-class text classification. International Journal of Advanced Computer Science and Applications, 14(6). https://doi.org/10.14569/IJACSA.2023.01406116
  • Salim, H.R., De, C., Pratamaputra, N.D., & Suhartono, D. (2022). Indonesian automatic short answer grading system. Bulletin of Electrical Engineering and Informatics, 11(3), 1586-1603. https://doi.org/10.11591/eei.v11i3.3531
  • Saunders, D.R., Bex, P.J., Rose, D.J., & Woods, R.L. (2014). Measuring information acquisition from sensory input using automatic scoring of natural-language descriptions. PLoS ONE, 9(4), Article e93251. https://doi.org/10.1371/journal.pone.0093251
  • Sawatzki, J., Schlippe, T., & Benner-Wickner, M. (2021). Deep learning techniques for automatic short answer grading: predicting scores for English and German answers. In E. Cheng, R.B. Koul, T. Wang, & X. Yu (Eds.), Proceedings of 2021 2nd international conference on artificial intelligence in education technology (pp. 65 75). Springer. https://doi.org/10.1007/978-981-16-7527-0_5
  • Sayeed, M.A., & Gupta, D. (2022). Automate descriptive answer grading using reference based models. In Proceedings of 2022 OITS international conference on information technology (pp. 262 267). The Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/OCIT56763.2022.00057
  • Schleifer, A.G., Klebanov, B.B., Ariely, M., & Alexandron, G. (2023). Transformer-based Hebrew NLP models for short answer scoring in biology. In Proceedings of the 18th workshop on innovative use of NLP for building educational applications (pp. 550-555). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.46
  • Seker, A., Bandel, E., Bareket, D., Brusilovsky, I., Greenfeld, R., & Tsarfaty, R. (2022). AlephBERT: Language model pre-training and evaluation from sub-word to sentence level. In Proceedings of the 60th annual meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 46 56). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.4
  • Siddiqi, R., Harrison, C.J., & Siddiqi, R. (2010). Improving teaching and learning through automatic short-answer marking. IEEE Transactions on Learning Technologies, 3(3), 237-249. https://doi.org/10.1109/TLT.2010.4
  • Sung, C., Dhamecha, T., Saha, S., Ma, T., Reddy, V., & Arora, R. (2019, November). Pre-training BERT on domain resources for short answer grading. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (pp. 6071 6075). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1628
  • Şenel, S., & Şenel, H.C. (2021). Remote assessment in higher education during COVID-19 pandemic. International Journal of Assessment Tools in Education, 8(2), 181 199. https://doi.org/10.21449/ijate.820140
  • Tulu, C.N., Özkaya, O., & Orhan, U. (2021). Automatic short answer grading with semspace sense vectors and malstm. IEEE Access, 9, 19270 19280. https://doi.org/10.1109/ACCESS.2021.3054346
  • Uto, M., & Uchida, Y. (2020). Automatic short-answer grading using deep neural networks and item response theory. In I.I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millan (Eds.), Proceedings of the 21th International Conference on Artificial Intelligence in Education (pp. 334-339). Springer International Publishing. https://doi.org/10.1007/978-3-030-52240-7_61
  • Uyar, A.C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: a focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20-32. https://doi.org/10.21449/ijate.1517994
  • Uysal, I., & Doğan, N. (2021). How reliable is it to automatically score open-ended items? An application in the Turkish language. Journal of Measurement and Evaluation in Education and Psychology, 12(1), 28-53. https://doi.org/10.1007/978-3-030-52240-7_61
  • Westera, W., Dascalu, M., Kurvers, H., Ruseti, S., & Trausan-Matu, S. (2018). Automatic essay scoring in applied games: Reducing the teacher bandwidth problem in online training. Computers & Education, 123, 212 224. https://doi.org/10.1016/j.compedu.2018.05.010
  • Xie, M.K., Xiao, J., Liu, H.Z., Niu, G., Sugiyama, M., & Huang, S.J. (2023). Class-distribution-aware pseudo-labeling for semi-supervised multi-label learning. Advances in Neural Information Processing Systems, 36, 25731 25747. https://doi.org/10.48550/arXiv.2305.02795
  • Yang, X., Huang, J.Y., Zhou, W., & Chen, M. (2022). Parameter-efficient tuning with special token adaptation. arXiv. https://doi.org/10.48550/arXiv.2210.04382
  • Yıldırım, O., & Demir, S.B. (2022). Inside the black box: Do teachers practice assessment as learning?. International Journal of Assessment Tools in Education, 9(Special Issue), 46-71. https://doi.org/10.21449/ijate.1132923
  • Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76(2), 280-303. https://doi.org/10.1177/0013164415590022
  • Zesch, T., Horbach, A., & Zehner, F. (2023). To score or not to score: Factors influencing performance and feasibility of automatic content scoring of text responses. Educational Measurement: Issues and Practice, 42(1), 44-58. https://doi.org/10.1111/emip.12544
  • Zhang, L., & Copus, B. (2023). A Study of Compressed Language Models in Social Media Domain. The International FLAIRS Conference Proceedings, 36(1). https://doi.org/10.32473/flairs.36.133056
  • Zhang, M., Johnson, M., & Ruan, C. (2024). Investigating sampling impacts on an LLM-based AI scoring approach: Prediction accuracy and fairness. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 348 360. https://doi.org/10.21031/epod.1561580
  • Zhu, X., Wu, H., & Zhang, L. (2022). Automatic short-answer grading via BERT-based deep neural networks. IEEE Transactions on Learning Technologies, 15(3), 364 375. https://doi.org/10.1109/tlt.2022.3175537
  • Zimmerman, W.A., Kang, H.B., Kim, K., Gao, M., Johnson, G., Clariana, R., & Zhang, F. (2018). Computer-automatic approach for scoring short essays in an introductory statistics course. Journal of Statistics Education, 26(1), 40 47. https://doi.org/10.1080/10691898.2018.1443047
There are 59 citations in total.

Details

Primary Language English
Subjects Classroom Measurement Practices
Journal Section Articles
Authors

Abdulkadir Kara 0000-0003-3255-1408

Zeynep Avinç Kara 0000-0002-8309-3876

Serkan Yıldırım 0000-0002-8277-5963

Early Pub Date October 1, 2025
Publication Date October 4, 2025
Submission Date April 30, 2025
Acceptance Date August 28, 2025
Published in Issue Year 2025 Volume: 12 Issue: 4

Cite

APA Kara, A., Avinç Kara, Z., & Yıldırım, S. (2025). Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity. International Journal of Assessment Tools in Education, 12(4), 905-925.

23823             23825             23824