TY - JOUR T1 - Smart Graders? Untersuchung des Potenzials von Sprachmodellen in der Fremdsprachenevaluation TT - Smart Graders? Exploring the Potential of Language Models in Foreign Language Evaluation AU - Başaran, Bora AU - Sarkiler, Yaşar Ali PY - 2025 DA - November Y2 - 2025 DO - 10.37583/diyalog.1824385 JF - Diyalog Interkulturelle Zeitschrift Für Germanistik JO - DİYALOG PB - Germanistler Derneği WT - DergiPark SN - 2148-1482 SP - 501 EP - 525 VL - 0 IS - Sonderausgabe: Germanistik im 21. Jahrhundert- Band I LA - de AB - Bewertungen sind ein integraler Bestandteil des Bildungssystems und erfordern ihrer Natur nach häufig einen hohen Zeitaufwand, da Genauigkeit und Konsistenz erwartet werden. Diese Studie untersucht, inwieweit große Sprachmodelle (LLMs) die Leistungsbewertung im Bereich des Fremdsprachenunterrichts unterstützen können. Grundlage sind mehrere Deutsch-Prüfungen, die sowohl von Lehrkräften als auch von LLMs bewertet wurden. Ziel ist es, KI-gestützte Bewertungen mit traditionellen Bewertungen qualitativ zu vergleichen.Die Analyse konzentriert sich auf Aspekte wie Genauigkeit, Effizienz und Konsistenz und berücksichtigt zudem die Komplexität der Aufgaben sowie die Art der Antworten. Darüber hinaus bietet die Studie eine differenzierte Betrachtung darüber, in welchen Bereichen KI-Leistungen die Arbeitsbelastung von Lehrkräften verringern kann, ohne die pädagogische Qualität der Bewertung zu beeinträchtigen. Abschließend werden praxisnahe Empfehlungen gegeben, wie KI sinnvoll und nachhaltig in den Unterricht integriert werden kann.Durch den Vergleich von KI-durchgeführten Bewertungen mit Menschlichen, identifiziert die Studie zentrale Bereiche, in denen große Sprachmodelle (LLMs) entweder erfolgreich sind oder nicht. Die technischen und ethischen Grenzen des Einsatzes von KI als eigenständiges Bewertungssystem werden auch thematisiert. Durch die vielsichtige Darstellung sowohl des revolutionären Potenzials von KI als auch der damit verbundenen Risiken leistet diese Studie einen Beitrag zur zunehmend kontrovers geführten Debatte über die Integration von LLMs in die pädagogische Praxis. KW - Benotungsautomatisierung KW - Bewertung KW - Deutsch als Fremdsprache KW - KI in der Lehre KW - Sprachmodelle N2 - Assessments function as part of the fabric of education, and by their very nature, are often time-intensive because of the expectation of accuracy and consistency. This study aims to explore how large language models (LLMs) can mediate assessment in the space of a foreign language based on several German exam papers that were graded and assessed by both LLMs and teachers, while ultimately comparing AI assessments to traditional assessments using a qualitative approach. The analyses focused on aspects of accuracy, efficiency and consistency, while also noting the 'complexity' of the tasks and response types. In addition, the study provides a detailed overview of how AI could help reduce teacher workload without compromising the pedagogical quality of assessment and offers practical suggestions for the meaningful and sustainable integration of AI into the classroom.By comparing AI-output to human judgment, the research determines principal areas of LLM failure or success. The technical and moral boundaries of using AI as a standalone assessor are also covered, especially where subtle or linguistically advanced judgments are required. By adding a balanced viewpoint that emphasizes both the potentially revolutionary ability of AI and the wariness in its application, this study adds to the increasingly heated debate regarding the incorporation of LLMs into pedagogic practice. CR - Adiguzel, Tufan / Kaya, Mehmet Haldun / Cansu, Fatih Kürşat (2023): Revolutionizing education with AI: Exploring the transformative potential of ChatGPT. Contemporary Educational Technology, 15(3), ep429. https://doi.org/10.30935/cedtech/13152. CR - Ahmad, Sayed Fayaz / Rahmat, Mohd. Khairil / Mubarik, Muhammad Shujaat (2021): Artificial intelligence and its role in education. Sustainability, 13(22), 12902. https://doi.org/10.3390/su132212902. CR - Aldosari, Share Aiyed M. (2020): The future of higher education in the light of artificial intelligence transformations. International Journal of Higher Education, 9(3), 145-151. https://doi.org/10.5430/ijhe.v9n3p145. CR - Bachman, Lyle F. (1990): Fundamental considerations in language testing. Oxford University Press. CR - Başaran, Bora (2025): The cultural dance of words: The transforming value of language teaching in the age of AI. Çevikkilıç, Deniz Beste (Hg.): International studies in educational sciences (Chapter 1). Serüven Yayınevi. CR - Boud, David (2000): Sustainable assessment: Rethinking assessment for the learning society. Studies in Continuing Education, 22(2), 151–167. https://doi.org/10.1080/713695728. CR - Cerf, Vinton G. (2023): Large Language Models. Communications of the ACM, 66, 7 - 7. https://doi.org/10.1145/3606337. CR - Chen, Lijia / Chen, Pingping / Lin, Zhijian (2020): Artificial intelligence in education: A review. IEEE Access, 8, 75264–75278. https://doi.org/10.1109/ACCESS.2020.2988510. CR - Chen, Yong / Chen, Hongpeng / Su, Songzhi (2023): Fine-tuning large language models in education. 2023 13th International Conference on Information Technology in Medicine and Education (ITME). IEEE, 718–723. https://doi.org/10.1109/itme60234.2023.00148. CR - Cohen, Jacob (1960): A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104. CR - Dunbar, Stephen B. / Koretz, Daniel M. / Hoover, H.D. (1991): Quality control in the development and use of performance assessments. Applied Measurement in Education, 4(4), 289–303. https://doi.org/10.1207/S15324818AME0404_3. CR - Durall, Eva / Kapros, Evangelos (2020): Co-design for a competency self-assessment chatbot and survey in science education. International conference on human-computer interaction. Springer, 13-24. https://doi.org/10.1007/978-3-030-50506-6_2. CR - Efremova, Nadezhda / Shvedova Svetlana / Huseynova, Anastasia (2019): The influence of assessment on learning motivation. SHS Web of Conferences, 70, 04003. https://doi.org/10.1051/shsconf/20197004003. CR - Endres, Christoph / Ibisch, Andrea (2025) Why one size doesn’t fit all – Differenzierte Absicherung von LLMs. Datenschutz Datensich 49, 220–224. https://doi.org/10.1007/s11623-025-2075-6. CR - Evenddy, Sutrisno Sadji (2024): Investigating AI's automated feedback in English language learning. FLIP: Foreign Language Instruction Probe, 3(1), 76–87. https://doi.org/10.54213/flip.v3i1.401. CR - Felder, Ekkehard / Kückelhaus, Marcel (2025): Das definierende Sprachmodell (LLM): Anthropomorphisierung in der Mensch-Maschine-Interaktion. Zeitschrift für Literaturwissenschaft und Linguistik, 431-448. https://doi.org/10.1007/s41244-025-00380-7 CR - Göçer, Ali (2024): Öğrencilerin dinleme ve konuşma becerilerinin uygulamalı sınavlarla ölçülüp değerlendirilmesine yönelik Türkçe öğretmenlerinin görüşleri. International Journal of Language Academy, 12(3), 120–144. CR - Gunawan, Kadek Dwi Hendratma / Liliasari, Liliasari / Kaniawati, Ida / Setiawan, Wawan (2021): Implementation of competency enhancement program for science teachers assisted by artificial intelligence in designing HOTS-based integrated science learning. Journal Penelitian dan Pembelajaran IPA, 7(1), 55-65. https://doi.org/10.30870/jppi.v7i1.8655. CR - Hao, Jiangang / Davier, Alina A. / Yaneva, Victoria / Lottridge, Susan / Davier, Matthias / Harris, Deborah J. (2024): Transforming assessment: The impacts and implications of large language models and generative AI. Educational Measurement: Issues and Practice, 16-29. https://doi.org/10.1111/emip.12602. CR - Hu, Jingjing (2021): Teaching evaluation system by use of machine learning and artificial intelligence Methods. International Journal of Emerging Technologies in Learning, 16(5), 87-101. https://doi.org/10.3991/ijet.v16i05.20299. CR - Igaki, Takahiro / Kitaguchi, Daichi / Matsuzaki, Hiroki / Nakajima, Kei / Kojima, Shigehiro / Hasegawa, Hiro / Takeshita, Nobuyoshi / Kinugasa, Yusuke / Ito, Masaaki (2023): Automatic surgical skill assessment system based on concordance of standardized surgical field development using artificial intelligence. JAMA Surgery, e231131. https://doi.org/10.1001/jamasurg.2023.1131. CR - IU Internationale Hochschule (2024): Lernreport 2024: Was treibt Menschen in Deutschland zum Lernen an? https://www.iu.de/forschung/studien/lernreport-2024/ (Zugriff am 05.05.2025). CR - Jaiswal, Akanksha / Arun, C. Joe (2021): Potential of artificial intelligence for transformation of the education system in India. International Journal of Education and Development Using Information and Communication Technology, 17(1), 142-158. CR - Kafadar, Tuğba (2022): Oyunlaştırmanın eğitimdeki yeri. Kafadar, Tuğba / Can, Asena Ayvaz (Hg.), Eğitimde oyunlaştırma. Nobel Akademik Yayıncılık, 1–16. CR - Kankanamge, Dinesha / Wijiweera, C. / Ong, Z. / Preda, T. / Carney, T. / Wilson, M. / Preda, V. (2025): Artificial intelligence based assessment of minimally invasive surgical skills using standardised objective metrics – A narrative review. The American Journal of Surgery, 241, 116074. https://doi.org/10.1016/j.amjsurg.2024.116074. CR - Lucena Sangreman Aldeman, Nayze / Sá Urtiga Aita, Keylla Maria de / Ponte Machado, Vinícius / Demes da Mata Sousa, Luiz Claudio / Gilberto Borges Coelho, Antonio / Socorro da Silva, Adalberto / Silva Mendes, Ana Paula da / Oliveira Neres, Francisco Jair de / Jamil Hadad do Monte, Semíramis (2021): Smartpath (k): A platform for teaching glomerulopathies using machine learning. BMC Medical Education, 21(1), 248. https://doi.org/10.1186/s12909-021-02680-1. CR - Luckin, Rose / Holmes, Wayne (2016): Intelligence unleashed: An argument for AI in education. Pearson. Maghsudi, Setareh / Lan, Andrew / Xu, Jie / Schaar, Michaela (2021): Personalized education in the artificial intelligence era: What to expect next. IEEE Signal Processing Magazine, 38(2), 37–50. https://doi.org/10.1109/MSP.2021.3055032. CR - Makridakis, Spyros / Petropoulos, Fotios / Kang, Yanfei (2023): Large Language Models: Their Success and Impact. Forecasting, 5(3), 536-549 https://doi.org/10.3390/forecast5030030. CR - Mede, Enisa / Atay, Derin (2017): English language teachers’ assessment literacy: The Turkish context. Dil Dergisi, 168(1), 43–60. CR - Millî Eğitim Bakanlığı (2023): Yazılı ve Uygulamalı Sınavlar Yönergesi. https://odsgm.meb.gov.tr/meb_iys_dosyalar/2023_10/12115933_MEB_yazili_ve_uygulamali_sinavlar_yonergesi.pdf (Zugriff am 05.05.2025). CR - Minaee, Shervin / Mikolov, Tomas / Nikzad, Narjes / Chenaghlu, Meysam / Socher, Richard / Amatriain, Xavier / Gao, Jianfeng (2024): Large language models: A survey. https://doi.org/10.48550/arXiv.2402.06196. CR - Mohan, G. Bharathi / Kumar, R. Prasanna / Krishh, P. Vishal / Keerthinathan, A. / Lavanya, G. / Meghana, Meka Kavya Uma / Sulthana, Sheba / Doss, Srinath (2024): An analysis of large language models: their impact and potential applications. Knowl. Inf. Syst., 66, 5047-5070. https://doi.org/10.1007/s10115-024-02120-8. CR - Norcini, John / Anderson, Brownell / Bollela, Valdes / Burch, Vanessa / Costa, Manuel João / Duvivier, Robbert / Galbraith, Robert / Hays, Richard / Kent, Athol / Perrott, Vanessa / Roberts, Trudie (2011): Criteria for good assessment: Consensus statement and recommendations from the Ottawa 2010 Conference. Medical Teacher, 33, 206–214. https://doi.org/10.3109/0142159X.2011.551559. CR - Popenici, Stefan A. D. / Kerr, Sharon (2017): Exploring the impact of artificial intelligence on teaching and learning in higher education. Research and Practice in Technology Enhanced Learning, 12(1): 22. Epub 2017 Nov 23. PMID: 30595727; PMCID: PMC6294271. https://doi.org/10.1186/s41039-017-0062-8. CR - Sarker, Ikbal H. (2022): AI-based modeling: Techniques, applications and research issues towards automation, intelligent and smart systems. SN Computer Science, 3(2), 158. https://doi.org/10.1007/s42979-022-01043-x. CR - Shao, Yueyang / Liu, Qimeng / Dong, Yaoyao / Liu, Jian (2024): Perceived formative assessment practices in homework and creativity competence: The mediating effects of self-confidence in learning and intrinsic motivation. Studies in Educational Evaluation, 80, 101376. https://doi.org/10.1016/j.stueduc.2024.101376. CR - Soliman, Hassan / Kravcik, Milos / Neumann, Alexander Tobias / Yin, Yue / Pengel, Norbert / Haag, Maike / Wollersheim, Heinz-Werner (2024): Generative KI zur Lernenbegleitung in den Bildungswissenschaften: Implementierung eines LLM-basierten Chatbots im Lehramtsstudium. Proceedings of DELFI 2024. Gesellschaft für Informatik e.V. 171-177 https://doi.org/10.18420/delfi2024_15. CR - Sullivan, Gail M. (2011): A primer on the validity of assessment instruments. Journal of Graduate Medical Education, 3(2), 119–120. https://doi.org/10.4300/JGME-D-11-00075.1. CR - Tanır, Ahmet (2023): YouTube-assisted listening instruction (YALI): A study of listening comprehension and listening anxiety of university students of german as a foreign language. Research on Education and Psychology (REP), 7(Special Issue 2), 270-299. CR - Tanrıkulu, Lokman / Üstün, Bilal (2020): Almanca öğretmenliği yüksek lisans öğrencilerinin lisansüstü eğitim yapma nedenlerine ilişkin nitel bir çalışma. International Journal of Language Academy, 8(5), 104–114. https://doi.org/10.29228/ijla.47061. CR - Thirunavukarasu, Arun James u.a. (2023): Large language models in medicine. Nature Medicine, 29(9), 1930–1940. https://doi.org/10.1038/s41591-023-02448-8. CR - Tsagari, Dina (2011): Investigating the ‘assessment literacy’ of EFL state school teachers in Greece. Tsagari, Dina & Csépes, Ildikó (Hg.), Classroom-based language assessment. Peter Lang, 169–190. CR - Üstün, Ebru (2025): Kursplanung, Materialentwicklung und Kompetenzaufbau bei angehenden Fremdsprachenlehrkräften in der Türkei: Eine qualitative Fallstudie. Diyalog Interkulturelle Zeitschrift Für Germanistik, 13(1), 193-215. https://doi.org/10.37583/diyalog.1714784. CR - Üstün, Ebru / Üstün, Bilal / Karataş, Fatih (2024): K.I.-Literacy von Studierenden im Grundstudium. RumeliDE Dil ve Edebiyat Ara ştırmaları Dergisi, (42), 404-415. DOI: https://doi.org/10.5281/zenodo.13980839. CR - Wang, Shan / Wang, Fang / Zhu, Zhen / Wang, Jingxuan / Tran, Tam / Du, Zhao (2024): Artificial intelligence in education: A systematic literature review. Expert Systems with Applications, 252, 124167. https://doi.org/10.1016/j.eswa.2024.124167. CR - Wiliam, Dylan (2011): What is assessment for learning? Studies in Educational Evaluation, 37(1), 3–14. https://doi.org/10.1016/j.stueduc.2011.03.001. CR - Winston, Patrick Henry (1992): Artificial intelligence (3rd ed.). Addison-Wesley Longman Publishing Co., Inc. CR - Yamtinah, Sri / Wiyarsi, Antuni / Widarti, Hayuni Retno / Shidiq, Ari Syahidul / Ramadhani, Dimas Gilang (2025): Fine-tuning AI models for enhanced consistency and precision in chemistry educational assessments. Computers and Education: Artificial Intelligence, 8, 100399. https://doi.org/10.1016/j.caeai.2025.100399. CR - Yin, Shukang / Fu, Chaoyou / Zhao, Sirui / Li, Ke / Sun, Xing / Xu, Tong / Chen, Enhong (2023): A survey on multimodal large language models. National Science Review, 11. https://doi.org/10.1093/nsr/nwae403. CR - Zafari, Mostafa / Safari Bazargani, Jalal / Sadeghi-Niaraki, Abolghasem / Choi, Soo-Mi (2022): Artificial intelligence applications in K-12 education: A systematic literature review. IEEE Access, PP, 1–1. https://doi.org/10.1109/ACCESS.2022.3179356. UR - https://doi.org/10.37583/diyalog.1824385 L1 - https://dergipark.org.tr/tr/download/article-file/5423592 ER -