<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.4 20241031//EN"
        "https://jats.nlm.nih.gov/publishing/1.4/JATS-journalpublishing1-4.dtd">
<article  article-type="research-article"        dtd-version="1.4">
            <front>

                <journal-meta>
                                                                <journal-id>jise</journal-id>
            <journal-title-group>
                                                                                    <journal-title>Journal of Innovative Science and Engineering</journal-title>
            </journal-title-group>
                                        <issn pub-type="epub">2602-4217</issn>
                                                                                            <publisher>
                    <publisher-name>Bursa Technical University</publisher-name>
                </publisher>
                    </journal-meta>
                <article-meta>
                                        <article-id pub-id-type="doi">10.38088/jise.1703936</article-id>
                                                                <article-categories>
                                            <subj-group  xml:lang="en">
                                                            <subject>Image Processing</subject>
                                                    </subj-group>
                                            <subj-group  xml:lang="tr">
                                                            <subject>Görüntü İşleme</subject>
                                                    </subj-group>
                                    </article-categories>
                                                                                                                                                        <title-group>
                                                                                                                        <article-title>A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51</article-title>
                                                                                                    </title-group>
            
                                                    <contrib-group content-type="authors">
                                                                        <contrib contrib-type="author">
                                                                    <contrib-id contrib-id-type="orcid">
                                        https://orcid.org/0000-0002-7994-2679</contrib-id>
                                                                <name>
                                    <surname>Seven</surname>
                                    <given-names>Engin</given-names>
                                </name>
                                                                    <aff>İSTANBUL ÜNİVERSİTESİ - CERRAHPAŞA</aff>
                                                            </contrib>
                                                    <contrib contrib-type="author">
                                                                    <contrib-id contrib-id-type="orcid">
                                        https://orcid.org/0000-0003-1979-8860</contrib-id>
                                                                <name>
                                    <surname>Yücel Demirel</surname>
                                    <given-names>Eylem</given-names>
                                </name>
                                                                    <aff>ISTANBUL UNIVERSITY-CERRAHPASA, FACULTY OF ENGINEERING, DEPARTMENT OF COMPUTER ENGINEERING, COMPUTER ENGINEERING PR.</aff>
                                                            </contrib>
                                                                                </contrib-group>
                        
                                        <pub-date pub-type="pub" iso-8601-date="20251215">
                    <day>12</day>
                    <month>15</month>
                    <year>2025</year>
                </pub-date>
                                        <volume>9</volume>
                                        <issue>2</issue>
                                        <fpage>327</fpage>
                                        <lpage>342</lpage>
                        
                        <history>
                                    <date date-type="received" iso-8601-date="20250521">
                        <day>05</day>
                        <month>21</month>
                        <year>2025</year>
                    </date>
                                                    <date date-type="accepted" iso-8601-date="20250929">
                        <day>09</day>
                        <month>29</month>
                        <year>2025</year>
                    </date>
                            </history>
                                        <permissions>
                    <copyright-statement>Copyright © 2017, Journal of Innovative Science and Engineering</copyright-statement>
                    <copyright-year>2017</copyright-year>
                    <copyright-holder>Journal of Innovative Science and Engineering</copyright-holder>
                </permissions>
            
                                                                                                <abstract><p>Video-Based Human Action Recognition (HAR) remains challenging due to inter-class similarity, background noise, and the need to capture long-term temporal dependencies. This study proposes a hybrid deep learning model that integrates 3D Convolutional Neural Networks (3D CNNs) with Transformer-based attention mechanisms to jointly capture spatio-temporal features and long-range motion context. The architecture was optimized for parameter efficiency and trained on the UCF101 and HMDB51 benchmark datasets using standardized preprocessing and training strategies. Experimental results indicate that the proposed model reaches 97% accuracy and 96.8% mean F1-score on UCF101, and 85% accuracy, and 83.8% F1-score on HMDB51, showing consistent improvements compared to the standalone 3D CNNs and Transformer variants under identical settings. Ablation studies confirm that the combination of convolutional and attention layers significantly improves recognition performance while maintaining competitive computational cost (3.78M parameters, 17.75 GFLOPs/video, ~7 ms GPU latency). These findings highlight the effectiveness of the hybrid design for accurate and efficient HAR. Future work will address class imbalance using focal loss or weighted training, explore multimodal data integration, and develop more lightweight Transformer modules for real-time deployment on resource-constrained devices.</p></abstract>
                                                            
            
                                                            <kwd-group>
                                                    <kwd>Human Activity Recognition</kwd>
                                                    <kwd>  Video-based Action Recognition</kwd>
                                                    <kwd>  3D Convolutional Neural Networks</kwd>
                                                    <kwd>  Attention Mechanism</kwd>
                                                    <kwd>  Deep Learning in Computer Vision</kwd>
                                            </kwd-group>
                            
                                                                                                                        </article-meta>
    </front>
    <back>
                            <ref-list>
                                    <ref id="ref1">
                        <label>1</label>
                        <mixed-citation publication-type="journal">[1] Herath, S., Harandi, M., &amp; Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21. 	 https://doi.org/10.1016/J.IMAVIS.2017.01.010</mixed-citation>
                    </ref>
                                    <ref id="ref2">
                        <label>2</label>
                        <mixed-citation publication-type="journal">[2] Waghchaware, S., &amp; Joshi, R. (2024). Machine learning and deep learning models for human activity recognition in security and surveillance: a review. Knowledge and Information Systems, 66(8), 4405–4436.</mixed-citation>
                    </ref>
                                    <ref id="ref3">
                        <label>3</label>
                        <mixed-citation publication-type="journal">[3] Andreu-Perez, J., Poon, C. C. Y., Merrifield, R. D., Wong, S. T. C., &amp; Yang, G. Z. (2015). Big Data for Health. IEEE Journal of Biomedical and Health Informatics, 19(4), 1193–1208. https://doi.org/10.1109/JBHI.2015.2450362</mixed-citation>
                    </ref>
                                    <ref id="ref4">
                        <label>4</label>
                        <mixed-citation publication-type="journal">[4] Liu, R., Ramli, A. A., Zhang, H., Henricson, E., &amp; Liu, X. (2022). An Overview of Human Activity Recognition Using Wearable Sensors: Healthcare and Artificial Intelligence. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),	12993	LNCS,	1–14. https://doi.org/10.1007/978-3-030-96068-1_1</mixed-citation>
                    </ref>
                                    <ref id="ref5">
                        <label>5</label>
                        <mixed-citation publication-type="journal">[5] Das, D., Nishimura, Y., Vivek, R. P., Takeda, N., Fish, S. T., Plötz, T., &amp; Chernova, S. (2023). Explainable Activity Recognition for Smart Home Systems. ACM Transactions on Interactive Intelligent Systems, 13(2). https://doi.org/https://doi.org/10.1145/3561533</mixed-citation>
                    </ref>
                                    <ref id="ref6">
                        <label>6</label>
                        <mixed-citation publication-type="journal">[6] Alzubaidi, A., &amp; Kalita, J. (2016). Authentication of smartphone users using behavioral biometrics. IEEE Communications Surveys and Tutorials, 18(3), 1998–2026. https://doi.org/10.1109/COMST.2016.2537748</mixed-citation>
                    </ref>
                                    <ref id="ref7">
                        <label>7</label>
                        <mixed-citation publication-type="journal">[7] Chen, W.-H., &amp; Cho, P.-C. (2021). A GAN-Based Data Augmentation Approach for Sensor-Based Human Activity Recognition. International Journal of Computer and Communication Engineering, 10(4), 75–84. https://doi.org/10.17706/IJCCE.2021.10.4.75-84</mixed-citation>
                    </ref>
                                    <ref id="ref8">
                        <label>8</label>
                        <mixed-citation publication-type="journal">[8] Liu, M., Geißler, D., Bian, S., Zhou, B., &amp; Lukowicz, P. (2025). Assessing the Impact of Sampling Irregularity in Time Series Data: Human Activity Recognition As	A	Case	Study. https://arxiv.org/pdf/2501.15330</mixed-citation>
                    </ref>
                                    <ref id="ref9">
                        <label>9</label>
                        <mixed-citation publication-type="journal">[9] Hao, Y., Wang, B., &amp; Zheng, R. (2023). VALERIAN: Invariant Feature Learning for IMU Sensor-based Human Activity Recognition in the Wild. ACM International Conference Proceeding Series, 66–78. https://doi.org/10.1145/3576842.3582390</mixed-citation>
                    </ref>
                                    <ref id="ref10">
                        <label>10</label>
                        <mixed-citation publication-type="journal">[10] Chen, J., Xu, X., Wang, T., Jeon, G., &amp; Camacho, D. (2024). An AIoT Framework With Multi-modal Frequency Fusion for WiFi-Based Coarse and Fine Activity Recognition. IEEE Internet of Things Journal. https://doi.org/10.1109/JIOT.2024.3400773</mixed-citation>
                    </ref>
                                    <ref id="ref11">
                        <label>11</label>
                        <mixed-citation publication-type="journal">[11] Ullah, H. A., Letchmunan, S., Zia, M. S., Butt, U. M., &amp; Hassan, F. H. (2021). Analysis of Deep Neural Networks for Human Activity Recognition in Videos - A Systematic Literature Review. IEEE Access, 9, 126366–126387. https://doi.org/10.1109/ACCESS.2021.3110610</mixed-citation>
                    </ref>
                                    <ref id="ref12">
                        <label>12</label>
                        <mixed-citation publication-type="journal">[12] Wang, C., &amp; Mohamed, A. S. A. (2023). Group Activity Recognition in Computer Vision: A Comprehensive Review, Challenges, and Future Perspectives. https://arxiv.org/pdf/2307.13541</mixed-citation>
                    </ref>
                                    <ref id="ref13">
                        <label>13</label>
                        <mixed-citation publication-type="journal">[13] Ahn, D., Kim, S., Hong, H., &amp; Ko, B. C. (2023). STAR-Transformer: A Spatio-Temporal Cross Attention Transformer for Human Action Recognition (pp. 3330–3339).</mixed-citation>
                    </ref>
                                    <ref id="ref14">
                        <label>14</label>
                        <mixed-citation publication-type="journal">[14] Feichtenhofer, C., Fan, H., Malik, J., &amp; He, K. (2019). Slowfast networks for video recognition. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob, 6201–6210.            https://doi.org/10.1109/ICCV.2019.00630</mixed-citation>
                    </ref>
                                    <ref id="ref15">
                        <label>15</label>
                        <mixed-citation publication-type="journal">[15] Zeng, M., Nguyen, L. T., Yu, B., Mengshoel, O. J., Zhu, J., Wu, P., &amp; Zhang, J. (2015). Convolutional Neural Networks for human activity recognition using mobile sensors. Proceedings of the 2014 6th International Conference on Mobile Computing, Applications and Services, MobiCASE 2014, 197–205. https://doi.org/10.4108/ICST.MOBICASE.2014.257786</mixed-citation>
                    </ref>
                                    <ref id="ref16">
                        <label>16</label>
                        <mixed-citation publication-type="journal">[16] Lara, Ó. D., &amp; Labrador, M. A. (2013). A survey on human activity recognition using wearable sensors. IEEE Communications Surveys and Tutorials, 15(3), 1192–1209. https://doi.org/10.1109/SURV.2012.110112.00192</mixed-citation>
                    </ref>
                                    <ref id="ref17">
                        <label>17</label>
                        <mixed-citation publication-type="journal">[17] Bulling, A., Blanke, U., &amp; Schiele, B. (2014). A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys, 46(3). https://doi.org/10.1145/2499621</mixed-citation>
                    </ref>
                                    <ref id="ref18">
                        <label>18</label>
                        <mixed-citation publication-type="journal">[18] Simonyan, K., &amp; Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, 1(January), 568–576.	 http://arxiv.org/abs/1406.2199</mixed-citation>
                    </ref>
                                    <ref id="ref19">
                        <label>19</label>
                        <mixed-citation publication-type="journal">[19] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... &amp; Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.</mixed-citation>
                    </ref>
                                    <ref id="ref20">
                        <label>20</label>
                        <mixed-citation publication-type="journal">[19] Tran, D., Bourdev, L., Fergus, R., Torresani, L., &amp; Paluri, M. (2014). Learning Spatiotemporal Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), 2015 Inter, 4489–4497.</mixed-citation>
                    </ref>
                                    <ref id="ref21">
                        <label>21</label>
                        <mixed-citation publication-type="journal">[20] Carreira, J., &amp; Zisserman, A. (2017). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017-Janua, 4724–4733. https://doi.org/10.1109/CVPR.2017.502</mixed-citation>
                    </ref>
                                    <ref id="ref22">
                        <label>22</label>
                        <mixed-citation publication-type="journal">[21] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., &amp; van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9912 LNCS, 20–36.</mixed-citation>
                    </ref>
                                    <ref id="ref23">
                        <label>23</label>
                        <mixed-citation publication-type="journal">[22] Wang, X., Girshick, R., Gupta, A., &amp; He, K. (2017). Non-local Neural	Networks.	ArXiv, arXiv:1711.07971. https://doi.org/10.48550/ARXIV.1711.07971</mixed-citation>
                    </ref>
                                    <ref id="ref24">
                        <label>24</label>
                        <mixed-citation publication-type="journal">[23] Yan, S., Xiong, Y., &amp; Lin, D. (2018). Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).</mixed-citation>
                    </ref>
                                    <ref id="ref25">
                        <label>25</label>
                        <mixed-citation publication-type="journal">[24] Girdhar, R., Carreira, J., Doersch, C., &amp; Zisserman, A. (2019). ViGirdhar, R., Carreira, J., Doersch, C., &amp; Zisserman, A. (2018). Video Action Transformer Network. Retrieved from http://arxiv.org/abs/1812.02707deo Action Transformer Network. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. http://arxiv.org/abs/1812.02707</mixed-citation>
                    </ref>
                                    <ref id="ref26">
                        <label>26</label>
                        <mixed-citation publication-type="journal">[25] Bertasius, G., Wang, H., &amp; Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding? Supplementary Materials 1. Implementation Details. 139. https://github.com/</mixed-citation>
                    </ref>
                                    <ref id="ref27">
                        <label>27</label>
                        <mixed-citation publication-type="journal">[26] Tong, Z., Song, Y., Wang, J., &amp; Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35, 10078-10093.</mixed-citation>
                    </ref>
                                    <ref id="ref28">
                        <label>28</label>
                        <mixed-citation publication-type="journal">[27] Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., ... &amp; Qiao, Y. (2023). Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14549-14560).</mixed-citation>
                    </ref>
                                    <ref id="ref29">
                        <label>29</label>
                        <mixed-citation publication-type="journal">[28] Mehta, S., &amp; Rastegari, M. (2021). MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. ICLR 2022 - 10th International Conference	on	Learning	Representations. https://arxiv.org/pdf/2110.02178</mixed-citation>
                    </ref>
                                    <ref id="ref30">
                        <label>30</label>
                        <mixed-citation publication-type="journal">[29] Yamato, J., Ohya, J., &amp; Ishii, K. (1992, June). Recognizing human action in time-sequential images using hidden Markov model. In CVPR (Vol. 92, pp. 379-385).</mixed-citation>
                    </ref>
                                    <ref id="ref31">
                        <label>31</label>
                        <mixed-citation publication-type="journal">[30] Wang, H., Kläser, A., Schmid, C., &amp; Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.</mixed-citation>
                    </ref>
                                    <ref id="ref32">
                        <label>32</label>
                        <mixed-citation publication-type="journal">[31] Zhang, R., Li, S., Xue, J., Lin, F., Zhang, Q., Ma, X., &amp; Yan, X. (2024). Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions.	 https://arxiv.org/pdf/2405.17729</mixed-citation>
                    </ref>
                                    <ref id="ref33">
                        <label>33</label>
                        <mixed-citation publication-type="journal">[32] Soomro, K., Zamir, A. R., &amp; Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. 	 https://arxiv.org/pdf/1212.0402</mixed-citation>
                    </ref>
                                    <ref id="ref34">
                        <label>34</label>
                        <mixed-citation publication-type="journal">[33] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., &amp; Serre, T. (2011). HMDB: A large video database for human motion recognition. Proceedings of the IEEE International Conference on Computer Vision, 2556–2563. https://doi.org/10.1109/ICCV.2011.6126543</mixed-citation>
                    </ref>
                                    <ref id="ref35">
                        <label>35</label>
                        <mixed-citation publication-type="journal">[34] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., &amp; Zisserman, A. (2017). The Kinetics Human Action Video Dataset. https://arxiv.org/pdf/1705.06950</mixed-citation>
                    </ref>
                                    <ref id="ref36">
                        <label>36</label>
                        <mixed-citation publication-type="journal">[ 35] Goyal, R., Michalski, V., Materzy, J., Westphal, S., Kim, H., Haenel, V., Yianilos, P., Mueller-freitag, M., Hoppe, F., Thurau, C., Bax, I., &amp; Memisevic, R. (2017). The “something something” video database for learning and evaluating visual common sense. Proceedings of the IEEE International Conference on Computer Vision, 5842–5850.</mixed-citation>
                    </ref>
                                    <ref id="ref37">
                        <label>37</label>
                        <mixed-citation publication-type="journal">[36] Kingma, D. P., &amp; Ba, J. L. (2014). Adam: A Method for Stochastic Optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. https://arxiv.org/pdf/1412.6980</mixed-citation>
                    </ref>
                                    <ref id="ref38">
                        <label>38</label>
                        <mixed-citation publication-type="journal">[37] Keskar, N. S., Nocedal, J., Tang, P. T. P., Mudigere, D., &amp; Smelyanskiy, M. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 1–16.</mixed-citation>
                    </ref>
                            </ref-list>
                    </back>
    </article>
