A systematic testbed for evaluating emotion classification in large language models

Seda Nur Altun; Murat Dörterler

EN

A systematic testbed for evaluating emotion classification in large language models

Abstract

The advent of large language models (LLMs) in the domain of natural language processing (NLP) has engendered novel opportunities for the resolution of intricate tasks, such as emotion classification. However, achieving effective emotion analysis with LLMs requires more than simply choosing a ready-made model. In addition, the implementation of specially designed prompt structures, the alignment of the model with tokenisers, the meticulous formatting of both input and output data, and the regulated management of the generation process are imperative. The present paper sets out a technically detailed, reproducible framework for zero-shot and few-shot emotion classification using generative LLMs. The objective of this study is not to assess the efficacy of a given model, but rather to furnish researchers with a comprehensive manual outlining the essential components necessary to construct an LLM-based emotion recognition system from its fundamental principles. Utilising the Meta-LLaMA3 8B Instruct model and the DailyDialog dataset, the study demonstrates that prompt engineering tailored to the purpose, vocabulary-compatible tokenisation strategies, logit-level output constraint mechanisms and structured output normalisation can enable accurate and interpretable emotion classification, even in environments with limited or no labels. The objective of this paper is to furnish a practical and adaptive resource on the construction of LLM infrastructures that are context-sensitive, resilient to class imbalances and suitable for flexible task-oriented applications.

Keywords

Project Number

This study was not supported by any specific research project.

References

[1] A. Uçan, “TÜRKÇE HİS ANALİZİNDE OPTİMİZASYON VE ÖN-EĞİTİMLİ MODELLERİN KULLANIMI,” Hacettepe University, Ankara, Turkey, 2020.
[2] E. Akçapınar Sezer et al., “Türkçe bilgisayarlı dil bilimi çalışmalarında his analizi,” tday, no. 70, pp. 193–210, Dec. 2020, doi: 10.32925/tday.2020.48.
[3] B. Pang et al., “Thumbs up?: sentiment classification using machine learning techniques,” in Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP ’02, Not Known: Association for Computational Linguistics, 2002, pp. 79–86. doi: 10.3115/1118693.1118704.
[4] J. Wiebe et al., “Annotating Expressions of Opinions and Emotions in Language,” Language Res Eval, vol. 39, no. 2–3, pp. 165–210, May 2005, doi: 10.1007/s10579-005-7880-9.
[5] S. Aman et al., “Identifying Expressions of Emotion in Text,” in Text, Speech and Dialogue, vol. 4629, V. Matoušek and P. Mautner, Eds., in Lecture Notes in Computer Science, vol. 4629. , Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 196–205. doi: 10.1007/978-3-540-74628-7_27.
[6] W. Medhat et al., “Sentiment analysis algorithms and applications: A survey,” Ain Shams Engineering Journal, vol. 5, no. 4, pp. 1093–1113, Dec. 2014, doi: 10.1016/j.asej.2014.04.011.
[7] R. Li et al., “EmoMix: Building an Emotion Lexicon for Compound Emotion Analysis,” in Computational Science – ICCS 2019, vol. 11536, J. M. F. Rodrigues, P. J. S. Cardoso, J. Monteiro, R. Lam, V. V. Krzhizhanovskaya, M. H. Lees, J. J. Dongarra, and P. M. A. Sloot, Eds., in Lecture Notes in Computer Science, vol. 11536. , Cham: Springer International Publishing, 2019, pp. 353–368. doi: 10.1007/978-3-030-22734-0_26.
[8] M. A. Tocoglu et al., “Emotion Analysis From Turkish Tweets Using Deep Neural Networks,” IEEE Access, vol. 7, pp. 183061–183069, 2019, doi: 10.1109/ACCESS.2019.2960113.

[9] E. Batbaatar et al., “Semantic-Emotion Neural Network for Emotion Recognition From Text,” IEEE Access, vol. 7, pp. 111866–111878, 2019, doi: 10.1109/ACCESS.2019.2934529.
[10] W. Jiao et al., “HiGRU: Hierarchical Gated Recurrent Units for Utterance-level Emotion Recognition,” presented at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), Minneapolis, Minnesota: Association for Computational Linguistics, Apr. 2019, pp. 397–406. doi: 10.18653/v1/N19-1037.
[11] Z. Gou et al., “TG-ERC: Utilizing three generation models to handle emotion recognition in conversation tasks,” Expert Systems with Applications, vol. 268, p. 126269, Apr. 2025, doi: 10.1016/j.eswa.2024.126269.
[12] D. Hu et al., “DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, 2021, pp. 7042–7052. doi: 10.18653/v1/2021.acl-long.547.
[13] D. Hu et al., “Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada: Association for Computational Linguistics, 2023, pp. 10835–10852. doi: 10.18653/v1/2023.acl-long.606.
[14] Y. Liu et al., “EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation,” Sci. China Inf. Sci., vol. 67, no. 8, p. 182103, Aug. 2024, doi: 10.1007/s11432-023-3908-6.
[15] D. Ghosal et al., “DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation,” presented at the Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics, Aug. 2019, pp. 154–164. doi: 10.18653/v1/D19-1015.
[16] W. Shen et al., “Directed Acyclic Graph Network for Conversational Emotion Recognition,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, 2021, pp. 1551–1560. doi: 10.18653/v1/2021.acl-long.123.
[17] X. Song et al., “Supervised Prototypical Contrastive Learning for Emotion Recognition in Conversation,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 5197–5206. doi: 10.18653/v1/2022.emnlp-main.347.
[18] P. Zhong et al., “Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations,” presented at the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics, Oct. 2019, pp. 165–176. doi: 10.18653/v1/D19-1016.
[19] S. Li et al., “Contrast and Generation Make BART a Good Dialogue Emotion Recognizer,” AAAI, vol. 36, no. 10, pp. 11002–11010, Jun. 2022, doi: 10.1609/aaai.v36i10.21348.
[20] L. Zhu et al., “Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, 2021, pp. 1571–1582. doi: 10.18653/v1/2021.acl-long.125.
[21] Y. Li et al., “DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset,” presented at the Proceedings of the Eighth International Joint Conference on Natural Language Processing, Taipei, Taiwan: Asian Federation of Natural Language Processing, 2017, pp. 986–995.
[22] V. Kalra et al., “Importance of Text Data Preprocessing & Implementation in RapidMiner,” presented at the The First International Conference on Information Technology and Knowledge Management, Jan. 2018, pp. 71–75. doi: 10.15439/2017KM46.
[23] P. Ekman et al., “Universals and cultural differences in the judgments of facial expressions of emotion.,” Journal of Personality and Social Psychology, vol. 53, no. 4, pp. 712–717, 1987, doi: 10.1037/0022-3514.53.4.712.
[24] A. Grattafiori et al., “The Llama 3 Herd of Models,” Nov. 23, 2024, arXiv: arXiv:2407.21783. doi: 10.48550/arXiv.2407.21783.

Details

Primary Language

English

Subjects

Natural Language Processing

Journal Section

Research Article

Authors

Seda Nur Altun ^*
0000-0001-9717-0759
Türkiye

Murat Dörterler
0000-0003-1127-515X
Türkiye

Publication Date

August 31, 2025

Submission Date

May 20, 2025

Acceptance Date

June 23, 2025

Published in Issue

Year 2025 Number: 013

IZ

https://izlik.org/JA98MA47ZB

Cite

RIS / Bibtex

APA

Altun, S. N., & Dörterler, M. (2025). A systematic testbed for evaluating emotion classification in large language models. Journal of Scientific Reports-B, 013, 1-19. https://izlik.org/JA98MA47ZB

AMA

1.Altun SN, Dörterler M. A systematic testbed for evaluating emotion classification in large language models. Journal of Scientific Reports-B. 2025;(013):1-19. https://izlik.org/JA98MA47ZB

Chicago

Altun, Seda Nur, and Murat Dörterler. 2025. “A Systematic Testbed for Evaluating Emotion Classification in Large Language Models”. Journal of Scientific Reports-B, nos. 013: 1-19. https://izlik.org/JA98MA47ZB.

EndNote

Altun SN, Dörterler M (August 1, 2025) A systematic testbed for evaluating emotion classification in large language models. Journal of Scientific Reports-B 013 1–19.

IEEE

[1]S. N. Altun and M. Dörterler, “A systematic testbed for evaluating emotion classification in large language models”, Journal of Scientific Reports-B, no. 013, pp. 1–19, Aug. 2025, [Online]. Available: https://izlik.org/JA98MA47ZB

ISNAD

Altun, Seda Nur - Dörterler, Murat. “A Systematic Testbed for Evaluating Emotion Classification in Large Language Models”. Journal of Scientific Reports-B. 013 (August 1, 2025): 1-19. https://izlik.org/JA98MA47ZB.

JAMA

1.Altun SN, Dörterler M. A systematic testbed for evaluating emotion classification in large language models. Journal of Scientific Reports-B. 2025;:1–19.

MLA

Altun, Seda Nur, and Murat Dörterler. “A Systematic Testbed for Evaluating Emotion Classification in Large Language Models”. Journal of Scientific Reports-B, no. 013, Aug. 2025, pp. 1-19, https://izlik.org/JA98MA47ZB.

Vancouver

1.Seda Nur Altun, Murat Dörterler. A systematic testbed for evaluating emotion classification in large language models. Journal of Scientific Reports-B [Internet]. 2025 Aug. 1;(013):1-19. Available from: https://izlik.org/JA98MA47ZB