Facial Expression Recognition (FER) tasks have widely studied in the literature since it has many applications. Fast development of technology in deep learning computer vision algorithms, especially, transformer-based classification models, makes it hard to select most appropriate models. Using complex model may increase accuracy performance but decreasing inference time which is a crucial in near real-time applications. On the other hand, small models may not give desired results. In this study, we aimed to examine performance of 5 different relatively small transformer-based image classification algorithms for FER tasks. We used vanilla ViT, PiT, Swin, DeiT, and CrossViT with considering their trainable parameter size and architectures. Each model has 20-30M trainable parameters which means relatively small. Moreover, each model has different architectures. As an illustration, CrossViT focuses on image using multi-scale patches and PiT model introduces convolution layers and pooling techniques to vanilla ViT model. We obtained all results for widely used FER datasets: CK+ and KDEF. We observed that, PiT model achieves the best accuracy scores 0.9513 and 0.9090 for CK+ and KDEF datasets, respectively
Birincil Dil | İngilizce |
---|---|
Konular | Yazılım Mühendisliği (Diğer), Elektrik Mühendisliği (Diğer) |
Bölüm | Araştırma Makalesi |
Yazarlar | |
Erken Görünüm Tarihi | 24 Ekim 2024 |
Yayımlanma Tarihi | 30 Eylül 2024 |
Gönderilme Tarihi | 18 Mayıs 2024 |
Kabul Tarihi | 20 Ağustos 2024 |
Yayımlandığı Sayı | Yıl 2024 Cilt: 12 Sayı: 3 |
All articles published by BAJECE are licensed under the Creative Commons Attribution 4.0 International License. This permits anyone to copy, redistribute, remix, transmit and adapt the work provided the original work and source is appropriately cited.