Research Article

Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams

Volume: 15 Number: 2 July 1, 2026
EN TR

Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams

Abstract

This study proposes a novel multi-source deep learning architecture, called the Gated Cross-Modal Fusion Transformer (GCM-FT), designed to more effectively integrate the complementary structure of visual and auditory information sources in scene classification. The proposed framework extracts deep representations from the visual stream using an EfficientNetV2 backbone, while processing the MFCC-based time–frequency features provided within the dataset for the auditory stream. The representation vectors obtained from both streams are dynamically unified through a gated attention mechanism. With its multi-headed loss function, auxiliary stream outputs, and attention-based fusion block, the model is able to learn the contributions of visual and auditory information in a stable and balanced manner. Extensive cross-validation experiments demonstrate that GCM-FT achieves higher accuracy, lower variance, and more consistent class-wise performance compared with single-stream models and existing fused-information approaches. These findings indicate that attention-guided fusion offers a powerful and generalizable information integration strategy for visual–auditory scene classification tasks.

Keywords

References

  1. Çelik Y. Application of deep learning for voice command classification in Turkish language. Bitlis Eren University Journal of Science. 2024;13(3):701–708. doi:10.17798/bitlisfen.1477191.
  2. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Communications of the ACM. 2017;60(6):84–90. doi: 10.1145/3065386.
  3. Güneş H, Hark C, Akkaya AE. Comparison of Deep Learning Models and Optimization Algorithms in the Detection of Scoliosis and Spondylolisthesis from X-Ray Images. Sakarya University Journal of Science. 2024;28(2):438–451. doi:10.16984/saufenbilder.1246001.
  4. Doğan F, Aktaş M, Gürsoy Mİ. Classification of Skin Diseases with Different Deep Learning Models and Comparison of the Performances of the Models. TDFD. 2024;13(3):117–123. doi:10.46810/tdfd.1502471.
  5. Ceylan T, İnik Ö. Development of an Effective Deep Learning Model for COVID-19 Detection from CT Images. Tr. J. Nature Sci. 2025;14(1):156–166. doi:10.46810/tdfd.1472034.
  6. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv Preprint. 2013;arXiv:1301.3781.
  7. Liu X, Wang H, Li Z, Qin L. Deep learning in ECG diagnosis: A review. Knowledge-Based Systems. 2021;227:107187 doi: 10.1016/j.knosys.2021.107187.
  8. İnik Ö. Classification of Scenes in Aerial Images with Deep Learning Models. TDFD. 2023;12(1):37–43. doi:10.46810/tdfd.1225756.

Details

Primary Language

English

Subjects

Information Systems (Other)

Journal Section

Research Article

Publication Date

July 1, 2026

Submission Date

November 22, 2025

Acceptance Date

June 1, 2026

Published in Issue

Year 2026 Volume: 15 Number: 2

APA
Çelik, Y. (2026). Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams. Turkish Journal of Nature and Science, 15(2), 207-214. https://doi.org/10.46810/tdfd.1828359
AMA
1.Çelik Y. Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams. TJNS. 2026;15(2):207-214. doi:10.46810/tdfd.1828359
Chicago
Çelik, Yusuf. 2026. “Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams”. Turkish Journal of Nature and Science 15 (2): 207-14. https://doi.org/10.46810/tdfd.1828359.
EndNote
Çelik Y (July 1, 2026) Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams. Turkish Journal of Nature and Science 15 2 207–214.
IEEE
[1]Y. Çelik, “Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams”, TJNS, vol. 15, no. 2, pp. 207–214, July 2026, doi: 10.46810/tdfd.1828359.
ISNAD
Çelik, Yusuf. “Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams”. Turkish Journal of Nature and Science 15/2 (July 1, 2026): 207-214. https://doi.org/10.46810/tdfd.1828359.
JAMA
1.Çelik Y. Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams. TJNS. 2026;15:207–214.
MLA
Çelik, Yusuf. “Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams”. Turkish Journal of Nature and Science, vol. 15, no. 2, July 2026, pp. 207-14, doi:10.46810/tdfd.1828359.
Vancouver
1.Yusuf Çelik. Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams. TJNS. 2026 Jul. 1;15(2):207-14. doi:10.46810/tdfd.1828359