TY - JOUR T1 - A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51 AU - Seven, Engin AU - Yücel Demirel, Eylem PY - 2025 DA - December Y2 - 2025 DO - 10.38088/jise.1703936 JF - Journal of Innovative Science and Engineering JO - JISE PB - Bursa Technical University WT - DergiPark SN - 2602-4217 SP - 327 EP - 342 VL - 9 IS - 2 LA - en AB - Video-Based Human Action Recognition (HAR) remains challenging due to inter-class similarity, background noise, and the need to capture long-term temporal dependencies. This study proposes a hybrid deep learning model that integrates 3D Convolutional Neural Networks (3D CNNs) with Transformer-based attention mechanisms to jointly capture spatio-temporal features and long-range motion context. The architecture was optimized for parameter efficiency and trained on the UCF101 and HMDB51 benchmark datasets using standardized preprocessing and training strategies. Experimental results indicate that the proposed model reaches 97% accuracy and 96.8% mean F1-score on UCF101, and 85% accuracy, and 83.8% F1-score on HMDB51, showing consistent improvements compared to the standalone 3D CNNs and Transformer variants under identical settings. Ablation studies confirm that the combination of convolutional and attention layers significantly improves recognition performance while maintaining competitive computational cost (3.78M parameters, 17.75 GFLOPs/video, ~7 ms GPU latency). These findings highlight the effectiveness of the hybrid design for accurate and efficient HAR. Future work will address class imbalance using focal loss or weighted training, explore multimodal data integration, and develop more lightweight Transformer modules for real-time deployment on resource-constrained devices. KW - Human Activity Recognition KW - Video-based Action Recognition KW - 3D Convolutional Neural Networks KW - Attention Mechanism KW - Deep Learning in Computer Vision CR - [1] Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21. https://doi.org/10.1016/J.IMAVIS.2017.01.010 CR - [2] Waghchaware, S., & Joshi, R. (2024). Machine learning and deep learning models for human activity recognition in security and surveillance: a review. Knowledge and Information Systems, 66(8), 4405–4436. CR - [3] Andreu-Perez, J., Poon, C. C. Y., Merrifield, R. D., Wong, S. T. C., & Yang, G. Z. (2015). Big Data for Health. IEEE Journal of Biomedical and Health Informatics, 19(4), 1193–1208. https://doi.org/10.1109/JBHI.2015.2450362 CR - [4] Liu, R., Ramli, A. A., Zhang, H., Henricson, E., & Liu, X. (2022). An Overview of Human Activity Recognition Using Wearable Sensors: Healthcare and Artificial Intelligence. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12993 LNCS, 1–14. https://doi.org/10.1007/978-3-030-96068-1_1 CR - [5] Das, D., Nishimura, Y., Vivek, R. P., Takeda, N., Fish, S. T., Plötz, T., & Chernova, S. (2023). Explainable Activity Recognition for Smart Home Systems. ACM Transactions on Interactive Intelligent Systems, 13(2). https://doi.org/https://doi.org/10.1145/3561533 CR - [6] Alzubaidi, A., & Kalita, J. (2016). Authentication of smartphone users using behavioral biometrics. IEEE Communications Surveys and Tutorials, 18(3), 1998–2026. https://doi.org/10.1109/COMST.2016.2537748 CR - [7] Chen, W.-H., & Cho, P.-C. (2021). A GAN-Based Data Augmentation Approach for Sensor-Based Human Activity Recognition. International Journal of Computer and Communication Engineering, 10(4), 75–84. https://doi.org/10.17706/IJCCE.2021.10.4.75-84 CR - [8] Liu, M., Geißler, D., Bian, S., Zhou, B., & Lukowicz, P. (2025). Assessing the Impact of Sampling Irregularity in Time Series Data: Human Activity Recognition As A Case Study. https://arxiv.org/pdf/2501.15330 CR - [9] Hao, Y., Wang, B., & Zheng, R. (2023). VALERIAN: Invariant Feature Learning for IMU Sensor-based Human Activity Recognition in the Wild. ACM International Conference Proceeding Series, 66–78. https://doi.org/10.1145/3576842.3582390 CR - [10] Chen, J., Xu, X., Wang, T., Jeon, G., & Camacho, D. (2024). An AIoT Framework With Multi-modal Frequency Fusion for WiFi-Based Coarse and Fine Activity Recognition. IEEE Internet of Things Journal. https://doi.org/10.1109/JIOT.2024.3400773 CR - [11] Ullah, H. A., Letchmunan, S., Zia, M. S., Butt, U. M., & Hassan, F. H. (2021). Analysis of Deep Neural Networks for Human Activity Recognition in Videos - A Systematic Literature Review. IEEE Access, 9, 126366–126387. https://doi.org/10.1109/ACCESS.2021.3110610 CR - [12] Wang, C., & Mohamed, A. S. A. (2023). Group Activity Recognition in Computer Vision: A Comprehensive Review, Challenges, and Future Perspectives. https://arxiv.org/pdf/2307.13541 CR - [13] Ahn, D., Kim, S., Hong, H., & Ko, B. C. (2023). STAR-Transformer: A Spatio-Temporal Cross Attention Transformer for Human Action Recognition (pp. 3330–3339). CR - [14] Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob, 6201–6210. https://doi.org/10.1109/ICCV.2019.00630 CR - [15] Zeng, M., Nguyen, L. T., Yu, B., Mengshoel, O. J., Zhu, J., Wu, P., & Zhang, J. (2015). Convolutional Neural Networks for human activity recognition using mobile sensors. Proceedings of the 2014 6th International Conference on Mobile Computing, Applications and Services, MobiCASE 2014, 197–205. https://doi.org/10.4108/ICST.MOBICASE.2014.257786 CR - [16] Lara, Ó. D., & Labrador, M. A. (2013). A survey on human activity recognition using wearable sensors. IEEE Communications Surveys and Tutorials, 15(3), 1192–1209. https://doi.org/10.1109/SURV.2012.110112.00192 CR - [17] Bulling, A., Blanke, U., & Schiele, B. (2014). A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys, 46(3). https://doi.org/10.1145/2499621 CR - [18] Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, 1(January), 568–576. http://arxiv.org/abs/1406.2199 CR - [19] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. CR - [19] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2014). Learning Spatiotemporal Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), 2015 Inter, 4489–4497. CR - [20] Carreira, J., & Zisserman, A. (2017). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017-Janua, 4724–4733. https://doi.org/10.1109/CVPR.2017.502 CR - [21] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9912 LNCS, 20–36. CR - [22] Wang, X., Girshick, R., Gupta, A., & He, K. (2017). Non-local Neural Networks. ArXiv, arXiv:1711.07971. https://doi.org/10.48550/ARXIV.1711.07971 CR - [23] Yan, S., Xiong, Y., & Lin, D. (2018). Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). CR - [24] Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). ViGirdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2018). Video Action Transformer Network. Retrieved from http://arxiv.org/abs/1812.02707deo Action Transformer Network. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. http://arxiv.org/abs/1812.02707 CR - [25] Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding? Supplementary Materials 1. Implementation Details. 139. https://github.com/ CR - [26] Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35, 10078-10093. CR - [27] Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., ... & Qiao, Y. (2023). Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14549-14560). CR - [28] Mehta, S., & Rastegari, M. (2021). MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. ICLR 2022 - 10th International Conference on Learning Representations. https://arxiv.org/pdf/2110.02178 CR - [29] Yamato, J., Ohya, J., & Ishii, K. (1992, June). Recognizing human action in time-sequential images using hidden Markov model. In CVPR (Vol. 92, pp. 379-385). CR - [30] Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79. CR - [31] Zhang, R., Li, S., Xue, J., Lin, F., Zhang, Q., Ma, X., & Yan, X. (2024). Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions. https://arxiv.org/pdf/2405.17729 CR - [32] Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. https://arxiv.org/pdf/1212.0402 CR - [33] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. Proceedings of the IEEE International Conference on Computer Vision, 2556–2563. https://doi.org/10.1109/ICCV.2011.6126543 CR - [34] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., & Zisserman, A. (2017). The Kinetics Human Action Video Dataset. https://arxiv.org/pdf/1705.06950 CR - [ 35] Goyal, R., Michalski, V., Materzy, J., Westphal, S., Kim, H., Haenel, V., Yianilos, P., Mueller-freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The “something something” video database for learning and evaluating visual common sense. Proceedings of the IEEE International Conference on Computer Vision, 5842–5850. CR - [36] Kingma, D. P., & Ba, J. L. (2014). Adam: A Method for Stochastic Optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. https://arxiv.org/pdf/1412.6980 CR - [37] Keskar, N. S., Nocedal, J., Tang, P. T. P., Mudigere, D., & Smelyanskiy, M. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 1–16. UR - https://doi.org/10.38088/jise.1703936 L1 - https://dergipark.org.tr/en/download/article-file/4891014 ER -