A systematic testbed for evaluating emotion classification in large language models

Seda Nur Altun; Murat Dörterler

Research Article

Year 2025, Issue: 013, 1 - 19, 31.08.2025

Seda Nur Altun , Murat Dörterler

Abstract

Project Number

This study was not supported by any specific research project.

References

[1] A. Uçan, “TÜRKÇE HİS ANALİZİNDE OPTİMİZASYON VE ÖN-EĞİTİMLİ MODELLERİN KULLANIMI,” Hacettepe University, Ankara, Turkey, 2020.
[2] E. Akçapınar Sezer et al., “Türkçe bilgisayarlı dil bilimi çalışmalarında his analizi,” tday, no. 70, pp. 193–210, Dec. 2020, doi: 10.32925/tday.2020.48.
[3] B. Pang et al., “Thumbs up?: sentiment classification using machine learning techniques,” in Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP ’02, Not Known: Association for Computational Linguistics, 2002, pp. 79–86. doi: 10.3115/1118693.1118704.
[4] J. Wiebe et al., “Annotating Expressions of Opinions and Emotions in Language,” Language Res Eval, vol. 39, no. 2–3, pp. 165–210, May 2005, doi: 10.1007/s10579-005-7880-9.
[5] S. Aman et al., “Identifying Expressions of Emotion in Text,” in Text, Speech and Dialogue, vol. 4629, V. Matoušek and P. Mautner, Eds., in Lecture Notes in Computer Science, vol. 4629. , Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 196–205. doi: 10.1007/978-3-540-74628-7_27.
[6] W. Medhat et al., “Sentiment analysis algorithms and applications: A survey,” Ain Shams Engineering Journal, vol. 5, no. 4, pp. 1093–1113, Dec. 2014, doi: 10.1016/j.asej.2014.04.011.
[7] R. Li et al., “EmoMix: Building an Emotion Lexicon for Compound Emotion Analysis,” in Computational Science – ICCS 2019, vol. 11536, J. M. F. Rodrigues, P. J. S. Cardoso, J. Monteiro, R. Lam, V. V. Krzhizhanovskaya, M. H. Lees, J. J. Dongarra, and P. M. A. Sloot, Eds., in Lecture Notes in Computer Science, vol. 11536. , Cham: Springer International Publishing, 2019, pp. 353–368. doi: 10.1007/978-3-030-22734-0_26.
[8] M. A. Tocoglu et al., “Emotion Analysis From Turkish Tweets Using Deep Neural Networks,” IEEE Access, vol. 7, pp. 183061–183069, 2019, doi: 10.1109/ACCESS.2019.2960113.
[9] E. Batbaatar et al., “Semantic-Emotion Neural Network for Emotion Recognition From Text,” IEEE Access, vol. 7, pp. 111866–111878, 2019, doi: 10.1109/ACCESS.2019.2934529.
[10] W. Jiao et al., “HiGRU: Hierarchical Gated Recurrent Units for Utterance-level Emotion Recognition,” presented at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), Minneapolis, Minnesota: Association for Computational Linguistics, Apr. 2019, pp. 397–406. doi: 10.18653/v1/N19-1037.
[11] Z. Gou et al., “TG-ERC: Utilizing three generation models to handle emotion recognition in conversation tasks,” Expert Systems with Applications, vol. 268, p. 126269, Apr. 2025, doi: 10.1016/j.eswa.2024.126269.
[12] D. Hu et al., “DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, 2021, pp. 7042–7052. doi: 10.18653/v1/2021.acl-long.547.
[13] D. Hu et al., “Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada: Association for Computational Linguistics, 2023, pp. 10835–10852. doi: 10.18653/v1/2023.acl-long.606.
[14] Y. Liu et al., “EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation,” Sci. China Inf. Sci., vol. 67, no. 8, p. 182103, Aug. 2024, doi: 10.1007/s11432-023-3908-6.
[15] D. Ghosal et al., “DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation,” presented at the Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics, Aug. 2019, pp. 154–164. doi: 10.18653/v1/D19-1015.
[16] W. Shen et al., “Directed Acyclic Graph Network for Conversational Emotion Recognition,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, 2021, pp. 1551–1560. doi: 10.18653/v1/2021.acl-long.123.
[17] X. Song et al., “Supervised Prototypical Contrastive Learning for Emotion Recognition in Conversation,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 5197–5206. doi: 10.18653/v1/2022.emnlp-main.347.
[18] P. Zhong et al., “Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations,” presented at the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics, Oct. 2019, pp. 165–176. doi: 10.18653/v1/D19-1016.
[19] S. Li et al., “Contrast and Generation Make BART a Good Dialogue Emotion Recognizer,” AAAI, vol. 36, no. 10, pp. 11002–11010, Jun. 2022, doi: 10.1609/aaai.v36i10.21348.
[20] L. Zhu et al., “Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, 2021, pp. 1571–1582. doi: 10.18653/v1/2021.acl-long.125.
[21] Y. Li et al., “DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset,” presented at the Proceedings of the Eighth International Joint Conference on Natural Language Processing, Taipei, Taiwan: Asian Federation of Natural Language Processing, 2017, pp. 986–995.
[22] V. Kalra et al., “Importance of Text Data Preprocessing & Implementation in RapidMiner,” presented at the The First International Conference on Information Technology and Knowledge Management, Jan. 2018, pp. 71–75. doi: 10.15439/2017KM46.
[23] P. Ekman et al., “Universals and cultural differences in the judgments of facial expressions of emotion.,” Journal of Personality and Social Psychology, vol. 53, no. 4, pp. 712–717, 1987, doi: 10.1037/0022-3514.53.4.712.
[24] A. Grattafiori et al., “The Llama 3 Herd of Models,” Nov. 23, 2024, arXiv: arXiv:2407.21783. doi: 10.48550/arXiv.2407.21783.

A systematic testbed for evaluating emotion classification in large language models

Year 2025, Issue: 013, 1 - 19, 31.08.2025

Seda Nur Altun , Murat Dörterler

Abstract

The advent of large language models (LLMs) in the domain of natural language processing (NLP) has engendered novel opportunities for the resolution of intricate tasks, such as emotion classification. However, achieving effective emotion analysis with LLMs requires more than simply choosing a ready-made model. In addition, the implementation of specially designed prompt structures, the alignment of the model with tokenisers, the meticulous formatting of both input and output data, and the regulated management of the generation process are imperative. The present paper sets out a technically detailed, reproducible framework for zero-shot and few-shot emotion classification using generative LLMs. The objective of this study is not to assess the efficacy of a given model, but rather to furnish researchers with a comprehensive manual outlining the essential components necessary to construct an LLM-based emotion recognition system from its fundamental principles. Utilising the Meta-LLaMA3 8B Instruct model and the DailyDialog dataset, the study demonstrates that prompt engineering tailored to the purpose, vocabulary-compatible tokenisation strategies, logit-level output constraint mechanisms and structured output normalisation can enable accurate and interpretable emotion classification, even in environments with limited or no labels. The objective of this paper is to furnish a practical and adaptive resource on the construction of LLM infrastructures that are context-sensitive, resilient to class imbalances and suitable for flexible task-oriented applications.

Keywords

emotion classification , zero-shot learning , few-shot learning , large languale models

Project Number

This study was not supported by any specific research project.

References

[1] A. Uçan, “TÜRKÇE HİS ANALİZİNDE OPTİMİZASYON VE ÖN-EĞİTİMLİ MODELLERİN KULLANIMI,” Hacettepe University, Ankara, Turkey, 2020.
[2] E. Akçapınar Sezer et al., “Türkçe bilgisayarlı dil bilimi çalışmalarında his analizi,” tday, no. 70, pp. 193–210, Dec. 2020, doi: 10.32925/tday.2020.48.
[3] B. Pang et al., “Thumbs up?: sentiment classification using machine learning techniques,” in Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP ’02, Not Known: Association for Computational Linguistics, 2002, pp. 79–86. doi: 10.3115/1118693.1118704.
[4] J. Wiebe et al., “Annotating Expressions of Opinions and Emotions in Language,” Language Res Eval, vol. 39, no. 2–3, pp. 165–210, May 2005, doi: 10.1007/s10579-005-7880-9.
[5] S. Aman et al., “Identifying Expressions of Emotion in Text,” in Text, Speech and Dialogue, vol. 4629, V. Matoušek and P. Mautner, Eds., in Lecture Notes in Computer Science, vol. 4629. , Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 196–205. doi: 10.1007/978-3-540-74628-7_27.
[6] W. Medhat et al., “Sentiment analysis algorithms and applications: A survey,” Ain Shams Engineering Journal, vol. 5, no. 4, pp. 1093–1113, Dec. 2014, doi: 10.1016/j.asej.2014.04.011.
[7] R. Li et al., “EmoMix: Building an Emotion Lexicon for Compound Emotion Analysis,” in Computational Science – ICCS 2019, vol. 11536, J. M. F. Rodrigues, P. J. S. Cardoso, J. Monteiro, R. Lam, V. V. Krzhizhanovskaya, M. H. Lees, J. J. Dongarra, and P. M. A. Sloot, Eds., in Lecture Notes in Computer Science, vol. 11536. , Cham: Springer International Publishing, 2019, pp. 353–368. doi: 10.1007/978-3-030-22734-0_26.
[8] M. A. Tocoglu et al., “Emotion Analysis From Turkish Tweets Using Deep Neural Networks,” IEEE Access, vol. 7, pp. 183061–183069, 2019, doi: 10.1109/ACCESS.2019.2960113.
[9] E. Batbaatar et al., “Semantic-Emotion Neural Network for Emotion Recognition From Text,” IEEE Access, vol. 7, pp. 111866–111878, 2019, doi: 10.1109/ACCESS.2019.2934529.
[10] W. Jiao et al., “HiGRU: Hierarchical Gated Recurrent Units for Utterance-level Emotion Recognition,” presented at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), Minneapolis, Minnesota: Association for Computational Linguistics, Apr. 2019, pp. 397–406. doi: 10.18653/v1/N19-1037.
[11] Z. Gou et al., “TG-ERC: Utilizing three generation models to handle emotion recognition in conversation tasks,” Expert Systems with Applications, vol. 268, p. 126269, Apr. 2025, doi: 10.1016/j.eswa.2024.126269.
[12] D. Hu et al., “DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, 2021, pp. 7042–7052. doi: 10.18653/v1/2021.acl-long.547.
[13] D. Hu et al., “Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada: Association for Computational Linguistics, 2023, pp. 10835–10852. doi: 10.18653/v1/2023.acl-long.606.
[14] Y. Liu et al., “EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation,” Sci. China Inf. Sci., vol. 67, no. 8, p. 182103, Aug. 2024, doi: 10.1007/s11432-023-3908-6.
[15] D. Ghosal et al., “DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation,” presented at the Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics, Aug. 2019, pp. 154–164. doi: 10.18653/v1/D19-1015.
[16] W. Shen et al., “Directed Acyclic Graph Network for Conversational Emotion Recognition,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, 2021, pp. 1551–1560. doi: 10.18653/v1/2021.acl-long.123.
[17] X. Song et al., “Supervised Prototypical Contrastive Learning for Emotion Recognition in Conversation,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 5197–5206. doi: 10.18653/v1/2022.emnlp-main.347.
[18] P. Zhong et al., “Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations,” presented at the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics, Oct. 2019, pp. 165–176. doi: 10.18653/v1/D19-1016.
[19] S. Li et al., “Contrast and Generation Make BART a Good Dialogue Emotion Recognizer,” AAAI, vol. 36, no. 10, pp. 11002–11010, Jun. 2022, doi: 10.1609/aaai.v36i10.21348.
[20] L. Zhu et al., “Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, 2021, pp. 1571–1582. doi: 10.18653/v1/2021.acl-long.125.
[21] Y. Li et al., “DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset,” presented at the Proceedings of the Eighth International Joint Conference on Natural Language Processing, Taipei, Taiwan: Asian Federation of Natural Language Processing, 2017, pp. 986–995.
[22] V. Kalra et al., “Importance of Text Data Preprocessing & Implementation in RapidMiner,” presented at the The First International Conference on Information Technology and Knowledge Management, Jan. 2018, pp. 71–75. doi: 10.15439/2017KM46.
[23] P. Ekman et al., “Universals and cultural differences in the judgments of facial expressions of emotion.,” Journal of Personality and Social Psychology, vol. 53, no. 4, pp. 712–717, 1987, doi: 10.1037/0022-3514.53.4.712.
[24] A. Grattafiori et al., “The Llama 3 Herd of Models,” Nov. 23, 2024, arXiv: arXiv:2407.21783. doi: 10.48550/arXiv.2407.21783.

There are 24 citations in total.

Details

Primary Language	English
Subjects	Natural Language Processing
Journal Section	Research Articles
Authors	Seda Nur Altun 0000-0001-9717-0759 Murat Dörterler 0000-0003-1127-515X
Project Number	This study was not supported by any specific research project.
Publication Date	August 31, 2025
Submission Date	May 20, 2025
Acceptance Date	June 23, 2025
Published in Issue	Year 2025 Issue: 013

Cite

APA	Altun, S. N., & Dörterler, M. (2025). A systematic testbed for evaluating emotion classification in large language models. Journal of Scientific Reports-B(013), 1-19.
AMA	Altun SN, Dörterler M. A systematic testbed for evaluating emotion classification in large language models. Journal of Scientific Reports-B. August 2025;(013):1-19.
Chicago	Altun, Seda Nur, and Murat Dörterler. “A Systematic Testbed for Evaluating Emotion Classification in Large Language Models”. Journal of Scientific Reports-B, no. 013 (August 2025): 1-19.
EndNote	Altun SN, Dörterler M (August 1, 2025) A systematic testbed for evaluating emotion classification in large language models. Journal of Scientific Reports-B 013 1–19.
IEEE	S. N. Altun and M. Dörterler, “A systematic testbed for evaluating emotion classification in large language models”, Journal of Scientific Reports-B, no. 013, pp. 1–19, August2025.
ISNAD	Altun, Seda Nur - Dörterler, Murat. “A Systematic Testbed for Evaluating Emotion Classification in Large Language Models”. Journal of Scientific Reports-B 013 (August2025), 1-19.
JAMA	Altun SN, Dörterler M. A systematic testbed for evaluating emotion classification in large language models. Journal of Scientific Reports-B. 2025;:1–19.
MLA	Altun, Seda Nur and Murat Dörterler. “A Systematic Testbed for Evaluating Emotion Classification in Large Language Models”. Journal of Scientific Reports-B, no. 013, 2025, pp. 1-19.
Vancouver	Altun SN, Dörterler M. A systematic testbed for evaluating emotion classification in large language models. Journal of Scientific Reports-B. 2025(013):1-19.

Download Cover Image

Article Files

Full Text