<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.4 20241031//EN"
        "https://jats.nlm.nih.gov/publishing/1.4/JATS-journalpublishing1-4.dtd">
<article  article-type="research-article"        dtd-version="1.4">
            <front>

                <journal-meta>
                                                                <journal-id>konjes</journal-id>
            <journal-title-group>
                                                                                    <journal-title>Konya Journal of Engineering Sciences</journal-title>
            </journal-title-group>
                                        <issn pub-type="epub">2667-8055</issn>
                                                                                            <publisher>
                    <publisher-name>Konya Technical University</publisher-name>
                </publisher>
                    </journal-meta>
                <article-meta>
                                        <article-id pub-id-type="doi">10.36306/konjes.1574874</article-id>
                                                                <article-categories>
                                            <subj-group  xml:lang="en">
                                                            <subject>Electrical Engineering (Other)</subject>
                                                    </subj-group>
                                            <subj-group  xml:lang="tr">
                                                            <subject>Elektrik Mühendisliği (Diğer)</subject>
                                                    </subj-group>
                                    </article-categories>
                                                                                                                                                        <title-group>
                                                                                                                                                            <article-title>VOICE AND IMAGE BASED EMOTION RECOGNITION WITH DEEP LEARNING</article-title>
                                                                                                    </title-group>
            
                                                    <contrib-group content-type="authors">
                                                                        <contrib contrib-type="author">
                                                                    <contrib-id contrib-id-type="orcid">
                                        https://orcid.org/0000-0003-1651-7568</contrib-id>
                                                                <name>
                                    <surname>Karakan</surname>
                                    <given-names>Abdil</given-names>
                                </name>
                                                                    <aff>AFYON KOCATEPE UNIVERSITY</aff>
                                                            </contrib>
                                                                                </contrib-group>
                        
                                        <pub-date pub-type="pub" iso-8601-date="20260301">
                    <day>03</day>
                    <month>01</month>
                    <year>2026</year>
                </pub-date>
                                        <volume>14</volume>
                                        <issue>1</issue>
                                        <fpage>97</fpage>
                                        <lpage>112</lpage>
                        
                        <history>
                                    <date date-type="received" iso-8601-date="20241028">
                        <day>10</day>
                        <month>28</month>
                        <year>2024</year>
                    </date>
                                                    <date date-type="accepted" iso-8601-date="20250916">
                        <day>09</day>
                        <month>16</month>
                        <year>2025</year>
                    </date>
                            </history>
                                        <permissions>
                    <copyright-statement>Copyright © 2004, Konya Journal of Engineering Sciences</copyright-statement>
                    <copyright-year>2004</copyright-year>
                    <copyright-holder>Konya Journal of Engineering Sciences</copyright-holder>
                </permissions>
            
                                                                                                                        <abstract><p>Emotion is a phenomenon that reflects every moment of an individual&#039;s life. The way in which an emotional state is expressed can be complex and different for each individual. Facial expressions and changes in voice are ways of expressing emotions. In the study, a sound and image-based system was implemented for emotion recognition. Since there was no Turkish dataset for voice detection, an original dataset named TR-EmotionSpeech was prepared for this study. Likewise, a facial expression dataset named TRFace-40 was developed to recognize visual emotional cues. This dataset consists of samples taken from 40 different Turkish-speaking people. The dataset includes 6 different emotions and 2000 audio files. It consists of samples taken from 40 different people from different angles for face recognition. The study will perform the detection process in real time. For this reason, errors that may occur from the camera were added to the samples in the dataset. A new dataset consisting of 40000 images was created with the changes in the dataset. The modifications applied to the dataset significantly contributed to improving the overall recognition accuracy. First, pre-processing and feature extraction were applied to the audio files. Then, they were classified with Long-Short Term Memory Networks. The emotion recognition accuracy rate of the system was determined as 75.18%. YOLOv5, YOLOv6, YOLOv7 and YOLOv8 architectures were used in image recognition. 97.82% accuracy was achieved in the YOLOv8 architecture.</p></abstract>
                                                            
            
                                                                                        <kwd-group>
                                                    <kwd>Deep learning</kwd>
                                                    <kwd>  Face recognition</kwd>
                                                    <kwd>  Long-Short Term Memory Network</kwd>
                                                    <kwd>  Voice recognition</kwd>
                                                    <kwd>  YOLO architectures</kwd>
                                            </kwd-group>
                            
                                                                                                                                                    </article-meta>
    </front>
    <back>
                            <ref-list>
                                    <ref id="ref1">
                        <label>1</label>
                        <mixed-citation publication-type="journal">V.V. Narasimha, R. Saravanakumar, N. Yusuf, R. Pradhan, H. Hamdi, K. A. Saravanan, V. S. Rao, and M. A. Askar, &quot;Enhancing emotion prediction using deep learning and distributed federated systems with SMOTE oversampling technique&quot;, Alexandria Engineering Journal, 108, 498–508, 2024. https://doi.org/10.1016/j.aej.2024.07.081.</mixed-citation>
                    </ref>
                                    <ref id="ref2">
                        <label>2</label>
                        <mixed-citation publication-type="journal">F. G. Eris¸ and E. Akbal, &quot;Enhancing speech emotion recognition through deep learning and handcrafted feature fusion&quot;, Applied Acoustics, 222, 110070, 2024. https://doi.org/10.1016/j.apacoust.2024.110070</mixed-citation>
                    </ref>
                                    <ref id="ref3">
                        <label>3</label>
                        <mixed-citation publication-type="journal">D. Weber, and B. Kostek, &quot;Bimodal deep learning model for subjectively enhanced emotion classification in films&quot;, Information Sciences, 678, 121049, 2024. https://doi.org/10.1016/j.ins.2024.121049</mixed-citation>
                    </ref>
                                    <ref id="ref4">
                        <label>4</label>
                        <mixed-citation publication-type="journal">R. K. Gupta, and R. Sinha, &quot;Deep multi-task learning based detection of correlated mental disorders using audio modality&quot;, Computer Speech &amp; Language, 89, 101710, 2025. https://doi.org/10.1016/j.csl.2024.101710</mixed-citation>
                    </ref>
                                    <ref id="ref5">
                        <label>5</label>
                        <mixed-citation publication-type="journal">A. I. Middya, B. Nag, and S. Roy, &quot;Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities&quot;, Knowledge-Based Systems, 244, 108580, 2022. https://doi.org/10.1016/j.knosys.2022.108580</mixed-citation>
                    </ref>
                                    <ref id="ref6">
                        <label>6</label>
                        <mixed-citation publication-type="journal">L. Zheng, Q. Li, H. Ban and S. Liu, &quot;Speech emotion recognition based on convolution neural network combined with random forest.&quot; 2018 Chinese Control And Decision Conference (CCDC), Shenyang, pp. 4143-4147, 2018. https://doi.org/10.1109/CCDC.2018.8407844</mixed-citation>
                    </ref>
                                    <ref id="ref7">
                        <label>7</label>
                        <mixed-citation publication-type="journal">D. Bitouk, R. Verma, and A. Nenkova &quot;Class-level spectral features for emotion recognition&quot;, Speech Communication, 52, 613-625, 2010. https://doi.org/10.1016/j.specom.2010.02.010</mixed-citation>
                    </ref>
                                    <ref id="ref8">
                        <label>8</label>
                        <mixed-citation publication-type="journal">Z. T. Liu, M. Wu, W. H. Cao, J. W. Mao, J. P. Xu, and G. Z. Tan, &quot;Speech emotion recognition based on feature selection and extreme learning machine decision tree&quot;, Neurocomputing, 273, 271-280, 2018. https://doi.org/10.1016/j.neucom.2017.07.050</mixed-citation>
                    </ref>
                                    <ref id="ref9">
                        <label>9</label>
                        <mixed-citation publication-type="journal">S. Actis, A. Denner, L. Hofer, J. N. Lang A. Scharf, and S. Uccirati, &quot;RECOLA-Recursive Computation of One-Loop Amplitudes&quot;, Computer Physics Communications, 214, 140-173, 2017. https://doi.org/10.1016/j.cpc.2017.01.004</mixed-citation>
                    </ref>
                                    <ref id="ref10">
                        <label>10</label>
                        <mixed-citation publication-type="journal">G. Trigeorgis, &quot;End-to-end speech emotion recognition using a deep convolutional recurrent network&quot;, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 5200-5204, 2016. https://doi.org/10.1109/ICASSP.2016.7472669.</mixed-citation>
                    </ref>
                                    <ref id="ref11">
                        <label>11</label>
                        <mixed-citation publication-type="journal">K. Wang, N. An, B. N. Li, Y. Zhang, and L. Li, &quot;Speech Emotion Recognition Using Fourier Parameters,&quot; IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 69-75, 1 Jan.-March 2015, https://doi.org/10.1109/TAFFC.2015.2392101</mixed-citation>
                    </ref>
                                    <ref id="ref12">
                        <label>12</label>
                        <mixed-citation publication-type="journal">I. Dias, M. Demirci, M. Fatih and A. Yazıcı,  &quot;Speech emotion recognition with deep convolutional neural networks&quot;, Biomedical Signal Processing and Control, 59, 101894, 2020. https://doi.org/10.1016/j.bspc.2020.101894Get rights and content</mixed-citation>
                    </ref>
                                    <ref id="ref13">
                        <label>13</label>
                        <mixed-citation publication-type="journal">J. Cai, &quot;Feature-Level and Model-Level Audiovisual Fusion for Emotion Recognition in the Wild&quot;, 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, pp. 443-448, 2019. https://doi.org/10.1109/MIPR.2019.00089.</mixed-citation>
                    </ref>
                                    <ref id="ref14">
                        <label>14</label>
                        <mixed-citation publication-type="journal">S. Langari, H. Marvi, and M. Zahedi, M. &quot;Efficient speech emotion recognition using modified feature extraction, &quot; Informatics in Medicine 20, 100424, 2020. https://doi.org/10.1016/j.imu.2020.100424</mixed-citation>
                    </ref>
                                    <ref id="ref15">
                        <label>15</label>
                        <mixed-citation publication-type="journal">J. Zhao, X. Mao, and L. Chen, &quot;Speech emotion recognition using deep 1D &amp; 2D CNN LSTM networks&quot;, Biomedical Signal Processing and Control, 47, 312-323, 2019. https://doi.org/10.1016/j.bspc.2018.08.035</mixed-citation>
                    </ref>
                                    <ref id="ref16">
                        <label>16</label>
                        <mixed-citation publication-type="journal">P. P. Dahake, K. Shaw and P. Malathi, &quot;Speaker dependent speech emotion recognition using MFCC and Support Vector Machine,&quot; 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT), Pune, pp. 1080-1084, 2016. https://doi.org/10.1109/ICACDOT.2016.7877753.</mixed-citation>
                    </ref>
                                    <ref id="ref17">
                        <label>17</label>
                        <mixed-citation publication-type="journal">L. Kerkeni, Y. Serrestou, K. Raoof, M. Mbarki, M. A. &quot;Mahjoub, and C. Cleder, &quot;Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO&quot;, Speech Communication, 114, 22-35, 2019. https://doi.org/10.1016/j.specom.2019.09.00</mixed-citation>
                    </ref>
                                    <ref id="ref18">
                        <label>18</label>
                        <mixed-citation publication-type="journal">V. Zue, S. Seneff, and J. Glass, &quot;Speech database development at MIT: Timit and beyond&quot;, JSpeech Communication, 9(4), 351–356, 2023. https://doi.org/10.1016/0167-6393(90)90010-7</mixed-citation>
                    </ref>
                                    <ref id="ref19">
                        <label>19</label>
                        <mixed-citation publication-type="journal">C. Liu, T. L. Tang, and M. Wang, &quot;Multi-feature based emotion recognition for video clips&quot;, Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 630–634, 2018. https://doi.org/10.1145/ 3242969.3264989.</mixed-citation>
                    </ref>
                                    <ref id="ref20">
                        <label>20</label>
                        <mixed-citation publication-type="journal">J. Wei, X. Yang, and Y. Dong, &quot;User-generated video emotion recognition based on key frames&quot;, Multimedia Tools and Applications, 80(9), 14343–14361, 2021. https://doi.org/ 10.1007/s11042-020-10203-1.</mixed-citation>
                    </ref>
                                    <ref id="ref21">
                        <label>21</label>
                        <mixed-citation publication-type="journal">T. L. B. Khanh, S. Kim, G. Lee, H. J. Yang, and E. T. Baek, E.-T. &quot;Korean video dataset for emotion recognition in the wild&quot;, Multimedia Tools and Applications, 80(6), 9479–9492, 2021. https://doi.org/10.1007/s11042-020-10106-1</mixed-citation>
                    </ref>
                                    <ref id="ref22">
                        <label>22</label>
                        <mixed-citation publication-type="journal">Guo, X., Polanía, L. F., &amp; Barner, K. E.  &quot;Toward end-to-end deception detection in videos&quot;, 2018 IEEE International Conference on Big Data, pp. 1278–1283, 2018. https://doi.org/10.1109/BigData.2018.8621909.</mixed-citation>
                    </ref>
                                    <ref id="ref23">
                        <label>23</label>
                        <mixed-citation publication-type="journal">R. Guetari, A. Chetouani, H. Tabia, and N. Khlifa, N. &quot;Real time emotion recognition in video stream, using B-CNN and F-CNN&quot;, 2020 5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), pp. 1–6, 2020. https://doi.org/10.1109/ATSIP49331.2020.9231902.</mixed-citation>
                    </ref>
                                    <ref id="ref24">
                        <label>24</label>
                        <mixed-citation publication-type="journal">H. Zhou, D. Meng, Y. Zhang, X. Peng, J. Du, and K. Wang, &quot;Exploring emotion features and fusion strategies for audio-video emotion recognition&quot;, 2019 International Conference on Multimodal Interaction, pp. 562–566, 2019. https://doi.org/10.1145/3340555.3355713.</mixed-citation>
                    </ref>
                                    <ref id="ref25">
                        <label>25</label>
                        <mixed-citation publication-type="journal">S. E. Kahou, &quot;EmoNets: Multimodal deep learning approaches for emotion recognition in video&quot;, Journal On Multimodal User Interfaces, 10(2), 99–111, 2016. https:// doi.org/10.1007/s12193-015-0195-2.</mixed-citation>
                    </ref>
                                    <ref id="ref26">
                        <label>26</label>
                        <mixed-citation publication-type="journal">T. S. Gunawan, A. Ashraf, B. S. Riza, E. V. Haryanto, R. Rosnelly, M. Kartiwi, and Z. Janin, Z. &quot;Development of video-based emotion recognition using deep learning with Google Colab&quot;, TELKOMNIKA Telecommunication Computing Electronics And Control, 18(5), 2463–2471,2020. https://doi.org/10.12928/telkomnika.v18i5.16717</mixed-citation>
                    </ref>
                                    <ref id="ref27">
                        <label>27</label>
                        <mixed-citation publication-type="journal">H. V. Manalu, and A. P. Rifai, &quot;Detection of human emotions through facial expressions using hybrid convolutional neural network-recurrent neural network algorithm&quot;, Intelligent Systems with Applications, 21, 200339, 2024. https://doi.org/10.1016/j.iswa.2024.200339</mixed-citation>
                    </ref>
                                    <ref id="ref28">
                        <label>28</label>
                        <mixed-citation publication-type="journal">R. Memisevic, S. E. Kahou, V. Michalski, K. Konda, and C. Pal, &quot;Recurrent neural networks for emotion recognition in video&quot;, Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 467–474, 2015. https://doi.org/10.1145/2818346.2830596.</mixed-citation>
                    </ref>
                                    <ref id="ref29">
                        <label>29</label>
                        <mixed-citation publication-type="journal">L. H. Sun, J. Chen, and T. Gu, &quot;Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition&quot;, Int. J. Speech Technol, 21(4):931–40, 2018. https://doi.org/10.1007/s10772-018-9551-4</mixed-citation>
                    </ref>
                                    <ref id="ref30">
                        <label>30</label>
                        <mixed-citation publication-type="journal">Y. M. Huang, &quot;Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition&quot;, J. Ambient Intell Hum Comput, 10(5):1787–98, 2019. https://doi.org/10.1016/j.engappai.2024.108293</mixed-citation>
                    </ref>
                                    <ref id="ref31">
                        <label>31</label>
                        <mixed-citation publication-type="journal">M. Xu, F. Zhang and S. U. Khan, &quot;Improve accuracy of speech emotion recognition with attention head fusion&quot;, 2020 10th annual computing and communication workshop and conference (CCWC). pp. 12-5-18, 2020. https://doi.org/10.1109/CCWC47524.2020.9031207</mixed-citation>
                    </ref>
                                    <ref id="ref32">
                        <label>32</label>
                        <mixed-citation publication-type="journal">W. Jiang, Z. Wang, J. S. Jin, X. Han, and C. Li &quot;Speech emotion recognition with heterogeneous feature unification of deep neural network&quot;, Sensors (Basel), 19(12), 2730, 2019. https://doi.org/10.3390/s19122730.</mixed-citation>
                    </ref>
                                    <ref id="ref33">
                        <label>33</label>
                        <mixed-citation publication-type="journal">Z. W. Tu, B. Lui, W. Zhao, R. Yan and Y. Zou &quot;A feature fusion model with data augmentation for speech emotion recognition&quot;, Appl Sci-Basel, 13(7), 4124, 2023.  https://doi.org/10.3390/app13074124.</mixed-citation>
                    </ref>
                                    <ref id="ref34">
                        <label>34</label>
                        <mixed-citation publication-type="journal">I. Shahin, O. S. Alamori, A. B. Nassif, I. Afyouni, I. A. Hashem, A. Elnagar &quot;An efficient feature selection method for arabic and english speech emotion recognition using Grey Wolf Optimizer&quot;, Appl Acoust, 205, 109279, 2023. https://doi.org/10.1016/j.apacoust.2023.109279</mixed-citation>
                    </ref>
                                    <ref id="ref35">
                        <label>35</label>
                        <mixed-citation publication-type="journal">Y. Liu, &quot;Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion&quot;, Multimed Tools Appl, pp.1–21, 2024. https://doi.org/10.1007/s11042-023-17829-x</mixed-citation>
                    </ref>
                                    <ref id="ref36">
                        <label>36</label>
                        <mixed-citation publication-type="journal">Z. Liu, X. Kang. And F. Ren, &quot;Improving speech emotion recognition by fusing pre-trained and acoustic features using transformer and BiLSTM&quot;, International Conference on Intelligent Information Processing. pp 68-79, 2022. https://doi.org/10.1007/978-3-031-03948-5_28.</mixed-citation>
                    </ref>
                            </ref-list>
                    </back>
    </article>
