Research Article

Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity

Volume: 12 Number: 4 December 5, 2025

Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity

Abstract

In measurement and evaluation processes, natural language responses are often avoided due to time, workload, and reliability concerns. However, the increasing popularity of automatic short-answer grading studies for natural language responses means such answers can now be measured more quickly and reliably. This study aims to build models for predicting automatic short answer scores using the pre-trained BERT deep learning language model and to reveal their effectiveness. For this purpose, two different score prediction models were created using an answer-based approach that aligns student answers with expert judgements and a reference-based approach that matches student answers with reference answers. The dataset includes answers from 246 Physics department students responding to 4 physics-related questions. The performance of these models was evaluated on four physics questions representing varying levels of cognitive complexity, using Cohen’s Kappa for statistical comparison of agreement with expert scores. Our findings reveal a clear interaction between model architecture and task complexity. The answer-based model was unequivocally superior for the most complex, multi-class task, effectively capturing diverse, nuanced responses. Conversely, the reference-based model demonstrated a statistically significant advantage for a well-defined, medium-complexity binary task. This study concludes that the optimal model for ASAG in Turkish is contingent on the cognitive demands of the assessment task, suggesting that a onesize-fits-all solution may not be the most effective approach. This provides a critical framework for practitioners, demonstrating not only that effective models are feasible for complex languages, but that their selection must be guided by task complexity.

Keywords

Supporting Institution

Ataturk University, Institute of Educational Sciences

Ethical Statement

A ready-made data set shared as open source was used in the research. Since the research does not contain any elements that violate the principle of ethics, it is within the scope of research that does not require ethics committee approval.

Thanks

We thank Cinar et al. (2020) for sharing the physics dataset as an open data source.

References

  1. Abdul-Mageed, M., Elmadany, A., & Nagoudi, E.M.B. (2020). ARBERT & MARBERT: Deep bidirectional transformers for Arabic. arXiv. https://doi.org/10.48550/arXiv.2101.01785
  2. Abdul-Salam, M., El-Fatah, M.A., & Hassan, N.F. (2022). Automatic grading for Arabic short answer questions using optimized deep learning model. PloS ONE, 17(8), Article e0272269. https://doi.org/10.1371/journal.pone.0272269
  3. Akila Devi, T.R., Javubar Sathick, K., Abdul Azeez Khan, A., & Arun Raj, L. (2023). Novel framework for improving the correctness of reference answers to enhance results of ASAG systems. SN Computer Science, 4(4), Article 415. https://doi.org/10.1007/s42979 023 01682-8
  4. Amur, Z.H., Hooi, Y.K., & Soomro, G.M. (2022). Automatic short answer grading (ASAG) using attention-based deep learning MODEL. In 2022 International Conference on Digital Transformation and Intelligence (pp. 1-7). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICDI57181.2022.10007187
  5. Badry, R.M., Ali, M., Rslan, E., & Kaseb, M.R. (2023). Automatic arabic grading system for short answer questions. IEEE Access, 11, 39457 39465. https://doi.org/10.1109/ACCESS.2023.3267407
  6. Benli, I., & İsmailova, R. (2018). Use of open-ended questions in measurement and evaluation methods in distance education. International Technology and Education Journal, 2(1), 1-8.
  7. Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25, 60 117. https://doi.org/10.1007/s40593-014-0026-8
  8. Chan, S., Sathyamurthy, M., Inoue, C., Bax, M., Jones, J., & Oyekan, J. (2024). Integrating metadiscourse analysis with transformer-based models for enhancing construct representation and discourse competence assessment in l2 writing: A systemic multidisciplinary approach. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 318-347. https://doi.org/10.21031/epod.1531269

Details

Primary Language

English

Subjects

Classroom Measurement Practices

Journal Section

Research Article

Early Pub Date

October 1, 2025

Publication Date

December 5, 2025

Submission Date

April 30, 2025

Acceptance Date

August 28, 2025

Published in Issue

Year 2025 Volume: 12 Number: 4

APA
Kara, A., Avinç Kara, Z., & Yıldırım, S. (2025). Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity. International Journal of Assessment Tools in Education, 12(4), 905-925. https://doi.org/10.21449/ijate.1687429

Cited By

23823             23825             23824