Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders

Biswajit Patra; Dakshina Ranjan Kisku

doi:10.31127/tuje.1507442

Research Article

Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders

Year 2025, Volume: 9 Issue: 1, 64 - 78, 20.01.2025

Biswajit Patra , Dakshina Ranjan Kisku

https://doi.org/10.31127/tuje.1507442

Abstract

In recent years, there has been growing interest among researchers in the field of image captioning, which involves generating one or more descriptions for an image that closely resembles a human-generated description. Most of the existing studies in this area focus on the English language, utilizing CNN and RNN variants as encoder and decoder models, often enhanced by attention mechanisms. Despite Bengali being the fifth most-spoken native language and the seventh most widely spoken language, it has received far less attention in comparison to resource-rich languages like English. This study aims to bridge that gap by introducing a novel approach to image captioning in Bengali. By leveraging state-of-the-art Convolutional Neural Networks such as EfficientNetV2S, ConvNeXtSmall, and InceptionResNetV2 along with an improvised Transformer, the proposed system achieves both computational efficiency and the generation of accurate, contextually relevant captions. Additionally, Bengali text-to-speech synthesis is incorporated into the framework to assist visually impaired Bengali speakers in understanding their environment and visual content more effectively. The model has been evaluated using a chimeric dataset, combining Bengali descriptions from the Ban-Cap dataset with corresponding images from the Flickr 8k dataset. Utilizing EfficientNet, the proposed model attains METEOR, CIDEr, and ROUGE scores of 0.34, 0.30, and 0.40, while BLEU scores for unigram, bigram, trigram, and four-gram matching are 0.66, 0.59, 0.44 and 0.26 respectively. The study demonstrates that the proposed approach produces precise image descriptions, outperforming other state-of-the-art models in generating Bengali descriptions.

Keywords

Image Description, CNN, Self-attention, Transformer, Bengali text-to-speech synthesis

Ethical Statement

The author declares no conflict of interest

Project Number

References

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164).
Gülgün, O. D., & Erol, H. (2020). Classification performance comparisons of deep learning models in pneumonia diagnosis using chest x-ray images. Turkish Journal of Engineering, 4(3), 129-141. https://doi.org/10.31127/tuje.652358
Atalay Aydın, V. (2024). Comparison of CNN-based methods for yoga pose classification. Turkish Journal of Engineering, 8(1), 65-75. https://doi.org/10.31127/tuje.1275826
Pajaziti, A., Basholli, F., & Zhaveli, Y. (2023). Identification and classification of fruits through robotic system by using artificial intelligence. Engineering Applications, 2(2), 154-163.
Meghraoui, K., Sebari, I., Bensiali, S., & El Kadi, K. A. (2022). On behalf of an intelligent approach based on 3D CNN and multimodal remote sensing data for precise crop yield estimation: Case study of wheat in Morocco. Advanced Engineering Science, 2, 118-126.
Sıngh, S., Kumar, K., & Kumar, B. (2024). Analysis of feature extraction techniques for sentiment analysis of tweets. Turkish Journal of Engineering, 8(4), 741-753. https://doi.org/10.31127/tuje.1477502
Othman, M. M. (2023). Modeling of daily groundwater level using deep learning neural networks. Turkish Journal of Engineering, 7(4), 331-337. https://doi.org/10.31127/tuje.1169908
Sen, O., Fuad, M., Islam, M. N., Rabbi, J., Masud, M., Hasan, M. K., ... & Iftee, M. A. R. (2022). Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning-based methods. IEEE Access, 10, 38999-39044.
Patra, B., & Kisku, D. R. (2023, December). Precise and Faster Image Description Generation with Limited Resources Using an Improved Hybrid Deep Model. In International Conference on Pattern Recognition and Machine Intelligence (pp. 166-175). Cham: Springer Nature Switzerland.
Yu, J., Li, J., Yu, Z., & Huang, Q. (2019). Multimodal transformer with multi-view visual representation for image captioning. IEEE transactions on circuits and systems for video technology, 30(12), 4467-4480.
Parvin, H., Naghsh-Nilchi, A. R., & Mohammadi, H. M. (2023). Image captioning using transformer-based double attention network. Engineering Applications of Artificial Intelligence, 125, 106545.
[Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Miller, A., Fisch, A., Dodge, J., Karimi, A. H., Bordes, A., & Weston, J. (2016). Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126.
Li, Z., Li, Y., & Lu, H. (2019). Improve image captioning by self-attention. In Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part V 26 (pp. 91-98). Springer International Publishing.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... & Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057). PMLR.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10578-10587).
Vasireddy, I., HimaBindu, G., & Ratnamala, B. (2023). Transformative Fusion: Vision Transformers and GPT-2 Unleashing New Frontiers in Image Captioning within Image Processing. International Journal of Innovative Research in Engineering & Management, 10(6), 55-59.
Mishra, S., Seth, S., Jain, S., Pant, V., Parikh, J., Jain, R., & Islam, S. M. (2024, May). Image Caption Generation using Vision Transformer and GPT Architecture. In 2024 2nd International Conference on Advancement in Computation & Computer Technologies (InCACCT) (pp. 1-6). IEEE.
Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N. A., & Luo, J. (2023). Promptcap: Prompt-guided image captioning for vqa with gpt-3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2963-2975).
Kurlekar, S., Deshpande, O., Kamble, A., Omanna, A., & Patil, D. (2020). Reading Device for Blind People using Python OCR and GTTS. International Journal of Science and Engineering Applications, 9(4), 049-052.
Granquist, C., Sun, S. Y., Montezuma, S. R., Tran, T. M., Gage, R., & Legge, G. E. (2021). Evaluation and Comparison of Artificial Intelligence Vision Aids: Orcam MyEye 1 and Seeing AI. Journal of Visual Impairment & Blindness, 115(4), 277-285. https://doi.org/10.1177/0145482X211027492
Coughlan, J. M., & Miele, J. (2017, October). AR4VI: AR as an accessibility tool for people with visual impairments. In 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct) (pp. 288-292). IEEE.
Wu, S., Wieland, J., Farivar, O., & Schiller, J. (2017, February). Automatic alt-text: Computer-generated image descriptions for blind users on a social network service. In proceedings of the 2017 ACM conference on computer supported cooperative work and social computing (pp. 1180-1192).
Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., ... & Bigham, J. P. (2018). Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3608-3617).
Doore, S. A., Istrati, D., Xu, C., Qiu, Y., Sarrazin, A., & Giudice, N. A. (2024). Images, Words, and Imagination: Accessible Descriptions to Support Blind and Low Vision Art Exploration and Engagement. Journal of Imaging, 10(1), 26.
Shrestha, R. (2022, February). A transformer-based deep learning model for evaluation of accessibility of image descriptions. In Proceedings of the 2022 14th International Conference on Machine Learning and Computing (pp. 28-33).
Mishra, S. K., Harshit, Saha, S., & Bhattacharyya, P. (2022). An Object Localization-based Dense Image Captioning Framework in Hindi. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(2), 1-15.
Mishra, S. K., Dhir, R., Saha, S., Bhattacharyya, P., & Singh, A. K. (2021). Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering, 92, 107114.
Rahman, M., Mohammed, N., Mansoor, N., & Momen, S. (2019). Chittron: An automatic bangla image captioning system. Procedia Computer Science, 154, 636-642.
Deb, T., Ali, M. Z. A., Bhowmik, S., Firoze, A., Ahmed, S. S., Tahmeed, M. A., ... & Rahman, R. M. (2019). Oboyob: A sequential-semantic bengali image captioning engine. Journal of Intelligent & Fuzzy Systems, 37(6), 7427-7439.
Al Faraby, H., Azad, M. M., Fedous, M. R., & Morol, M. K. (2020, December). Image to Bengali caption generation using deep CNN and bidirectional gated recurrent unit. In 2020 23rd international conference on computer and information technology (ICCIT) (pp. 1-6). IEEE.
Humaira, M., Shimul, P., Jim, M. A. R. K., Ami, A. S., & Shah, F. M. (2021). A hybridized deep learning method for bengali image captioning. International Journal of Advanced Computer Science and Applications, 12(2).
Ami, A. S., Humaira, M., Jim, M. A. R. K., Paul, S., & Shah, F. M. (2020, December). Bengali image captioning with visual attention. In 2020 23rd International Conference on Computer and Information Technology (ICCIT) (pp. 1-5). IEEE.
Hossain, M. A., Hasan, M. A. R., Hossen, E., Asraful, M., Faruk, M. O., Abadin, A. Z., & Ali, M. S. Automatic Bangla Image Captioning Based on Transformer Model in Deep Learning.
Tan, M., & Le, Q. (2021, July). Efficientnetv2: Smaller models and faster training. In International conference on machine learning (pp. 10096-10106). PMLR.
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976-11986).
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017, February). Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1).
Khan, M. F., Shifath, S. M., & Islam, M. S. (2022). BAN-cap: a multi-purpose English-Bangla image descriptions dataset. arXiv preprint arXiv:2205.14462.
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
Chen, X., Kang, B., Wang, D., Li, D., & Lu, H. (2022, October). Efficient visual tracking via hierarchical cross-attention transformer. In European Conference on Computer Vision (pp. 461-477). Cham: Springer Nature Switzerland.
Arafat, M. Y., Fahrin, S., Islam, M. J., Siddiquee, M. A., Khan, A., Kotwal, M. R. A., & Huda, M. N. (2014, December). Speech synthesis for bangla text to speech conversion. In The 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014) (pp. 1-6). IEEE.
Terven, J., Cordova-Esparza, D. M., Ramirez-Pedraza, A., & Chavez-Urbiola, E. A. (2023). Loss functions and metrics in deep learning. A review. arXiv preprint arXiv:2307.02694.
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).
Denkowski, M., & Lavie, A. (2014, June). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation (pp. 376-380).
Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81).
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566-4575)

Year 2025, Volume: 9 Issue: 1, 64 - 78, 20.01.2025

Biswajit Patra , Dakshina Ranjan Kisku

https://doi.org/10.31127/tuje.1507442

Abstract

Project Number

References

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164).
Gülgün, O. D., & Erol, H. (2020). Classification performance comparisons of deep learning models in pneumonia diagnosis using chest x-ray images. Turkish Journal of Engineering, 4(3), 129-141. https://doi.org/10.31127/tuje.652358
Atalay Aydın, V. (2024). Comparison of CNN-based methods for yoga pose classification. Turkish Journal of Engineering, 8(1), 65-75. https://doi.org/10.31127/tuje.1275826
Pajaziti, A., Basholli, F., & Zhaveli, Y. (2023). Identification and classification of fruits through robotic system by using artificial intelligence. Engineering Applications, 2(2), 154-163.
Meghraoui, K., Sebari, I., Bensiali, S., & El Kadi, K. A. (2022). On behalf of an intelligent approach based on 3D CNN and multimodal remote sensing data for precise crop yield estimation: Case study of wheat in Morocco. Advanced Engineering Science, 2, 118-126.
Sıngh, S., Kumar, K., & Kumar, B. (2024). Analysis of feature extraction techniques for sentiment analysis of tweets. Turkish Journal of Engineering, 8(4), 741-753. https://doi.org/10.31127/tuje.1477502
Othman, M. M. (2023). Modeling of daily groundwater level using deep learning neural networks. Turkish Journal of Engineering, 7(4), 331-337. https://doi.org/10.31127/tuje.1169908
Sen, O., Fuad, M., Islam, M. N., Rabbi, J., Masud, M., Hasan, M. K., ... & Iftee, M. A. R. (2022). Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning-based methods. IEEE Access, 10, 38999-39044.
Patra, B., & Kisku, D. R. (2023, December). Precise and Faster Image Description Generation with Limited Resources Using an Improved Hybrid Deep Model. In International Conference on Pattern Recognition and Machine Intelligence (pp. 166-175). Cham: Springer Nature Switzerland.
Yu, J., Li, J., Yu, Z., & Huang, Q. (2019). Multimodal transformer with multi-view visual representation for image captioning. IEEE transactions on circuits and systems for video technology, 30(12), 4467-4480.
Parvin, H., Naghsh-Nilchi, A. R., & Mohammadi, H. M. (2023). Image captioning using transformer-based double attention network. Engineering Applications of Artificial Intelligence, 125, 106545.
[Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Miller, A., Fisch, A., Dodge, J., Karimi, A. H., Bordes, A., & Weston, J. (2016). Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126.
Li, Z., Li, Y., & Lu, H. (2019). Improve image captioning by self-attention. In Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part V 26 (pp. 91-98). Springer International Publishing.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... & Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057). PMLR.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10578-10587).
Vasireddy, I., HimaBindu, G., & Ratnamala, B. (2023). Transformative Fusion: Vision Transformers and GPT-2 Unleashing New Frontiers in Image Captioning within Image Processing. International Journal of Innovative Research in Engineering & Management, 10(6), 55-59.
Mishra, S., Seth, S., Jain, S., Pant, V., Parikh, J., Jain, R., & Islam, S. M. (2024, May). Image Caption Generation using Vision Transformer and GPT Architecture. In 2024 2nd International Conference on Advancement in Computation & Computer Technologies (InCACCT) (pp. 1-6). IEEE.
Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N. A., & Luo, J. (2023). Promptcap: Prompt-guided image captioning for vqa with gpt-3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2963-2975).
Kurlekar, S., Deshpande, O., Kamble, A., Omanna, A., & Patil, D. (2020). Reading Device for Blind People using Python OCR and GTTS. International Journal of Science and Engineering Applications, 9(4), 049-052.
Granquist, C., Sun, S. Y., Montezuma, S. R., Tran, T. M., Gage, R., & Legge, G. E. (2021). Evaluation and Comparison of Artificial Intelligence Vision Aids: Orcam MyEye 1 and Seeing AI. Journal of Visual Impairment & Blindness, 115(4), 277-285. https://doi.org/10.1177/0145482X211027492
Coughlan, J. M., & Miele, J. (2017, October). AR4VI: AR as an accessibility tool for people with visual impairments. In 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct) (pp. 288-292). IEEE.
Wu, S., Wieland, J., Farivar, O., & Schiller, J. (2017, February). Automatic alt-text: Computer-generated image descriptions for blind users on a social network service. In proceedings of the 2017 ACM conference on computer supported cooperative work and social computing (pp. 1180-1192).
Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., ... & Bigham, J. P. (2018). Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3608-3617).
Doore, S. A., Istrati, D., Xu, C., Qiu, Y., Sarrazin, A., & Giudice, N. A. (2024). Images, Words, and Imagination: Accessible Descriptions to Support Blind and Low Vision Art Exploration and Engagement. Journal of Imaging, 10(1), 26.
Shrestha, R. (2022, February). A transformer-based deep learning model for evaluation of accessibility of image descriptions. In Proceedings of the 2022 14th International Conference on Machine Learning and Computing (pp. 28-33).
Mishra, S. K., Harshit, Saha, S., & Bhattacharyya, P. (2022). An Object Localization-based Dense Image Captioning Framework in Hindi. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(2), 1-15.
Mishra, S. K., Dhir, R., Saha, S., Bhattacharyya, P., & Singh, A. K. (2021). Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering, 92, 107114.
Rahman, M., Mohammed, N., Mansoor, N., & Momen, S. (2019). Chittron: An automatic bangla image captioning system. Procedia Computer Science, 154, 636-642.
Deb, T., Ali, M. Z. A., Bhowmik, S., Firoze, A., Ahmed, S. S., Tahmeed, M. A., ... & Rahman, R. M. (2019). Oboyob: A sequential-semantic bengali image captioning engine. Journal of Intelligent & Fuzzy Systems, 37(6), 7427-7439.
Al Faraby, H., Azad, M. M., Fedous, M. R., & Morol, M. K. (2020, December). Image to Bengali caption generation using deep CNN and bidirectional gated recurrent unit. In 2020 23rd international conference on computer and information technology (ICCIT) (pp. 1-6). IEEE.
Humaira, M., Shimul, P., Jim, M. A. R. K., Ami, A. S., & Shah, F. M. (2021). A hybridized deep learning method for bengali image captioning. International Journal of Advanced Computer Science and Applications, 12(2).
Ami, A. S., Humaira, M., Jim, M. A. R. K., Paul, S., & Shah, F. M. (2020, December). Bengali image captioning with visual attention. In 2020 23rd International Conference on Computer and Information Technology (ICCIT) (pp. 1-5). IEEE.
Hossain, M. A., Hasan, M. A. R., Hossen, E., Asraful, M., Faruk, M. O., Abadin, A. Z., & Ali, M. S. Automatic Bangla Image Captioning Based on Transformer Model in Deep Learning.
Tan, M., & Le, Q. (2021, July). Efficientnetv2: Smaller models and faster training. In International conference on machine learning (pp. 10096-10106). PMLR.
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976-11986).
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017, February). Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1).
Khan, M. F., Shifath, S. M., & Islam, M. S. (2022). BAN-cap: a multi-purpose English-Bangla image descriptions dataset. arXiv preprint arXiv:2205.14462.
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
Chen, X., Kang, B., Wang, D., Li, D., & Lu, H. (2022, October). Efficient visual tracking via hierarchical cross-attention transformer. In European Conference on Computer Vision (pp. 461-477). Cham: Springer Nature Switzerland.
Arafat, M. Y., Fahrin, S., Islam, M. J., Siddiquee, M. A., Khan, A., Kotwal, M. R. A., & Huda, M. N. (2014, December). Speech synthesis for bangla text to speech conversion. In The 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014) (pp. 1-6). IEEE.
Terven, J., Cordova-Esparza, D. M., Ramirez-Pedraza, A., & Chavez-Urbiola, E. A. (2023). Loss functions and metrics in deep learning. A review. arXiv preprint arXiv:2307.02694.
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).
Denkowski, M., & Lavie, A. (2014, June). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation (pp. 376-380).
Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81).
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566-4575)

There are 48 citations in total.

Details

Primary Language	English
Subjects	Information Systems (Other)
Journal Section	Articles
Authors	Biswajit Patra 0009-0001-3001-1769 Dakshina Ranjan Kisku 0000-0003-1116-2972
Project Number	NA
Early Pub Date	January 17, 2025
Publication Date	January 20, 2025
Submission Date	June 30, 2024
Acceptance Date	August 22, 2024
Published in Issue	Year 2025 Volume: 9 Issue: 1

Cite

APA	Patra, B., & Kisku, D. R. (2025). Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders. Turkish Journal of Engineering, 9(1), 64-78. https://doi.org/10.31127/tuje.1507442
AMA	Patra B, Kisku DR. Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders. TUJE. January 2025;9(1):64-78. doi:10.31127/tuje.1507442
Chicago	Patra, Biswajit, and Dakshina Ranjan Kisku. “Exploring Bengali Image Descriptions through the Combination of Diverse CNN Architectures and Transformer Decoders”. Turkish Journal of Engineering 9, no. 1 (January 2025): 64-78. https://doi.org/10.31127/tuje.1507442.
EndNote	Patra B, Kisku DR (January 1, 2025) Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders. Turkish Journal of Engineering 9 1 64–78.
IEEE	B. Patra and D. R. Kisku, “Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders”, TUJE, vol. 9, no. 1, pp. 64–78, 2025, doi: 10.31127/tuje.1507442.
ISNAD	Patra, Biswajit - Kisku, Dakshina Ranjan. “Exploring Bengali Image Descriptions through the Combination of Diverse CNN Architectures and Transformer Decoders”. Turkish Journal of Engineering 9/1 (January 2025), 64-78. https://doi.org/10.31127/tuje.1507442.
JAMA	Patra B, Kisku DR. Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders. TUJE. 2025;9:64–78.
MLA	Patra, Biswajit and Dakshina Ranjan Kisku. “Exploring Bengali Image Descriptions through the Combination of Diverse CNN Architectures and Transformer Decoders”. Turkish Journal of Engineering, vol. 9, no. 1, 2025, pp. 64-78, doi:10.31127/tuje.1507442.
Vancouver	Patra B, Kisku DR. Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders. TUJE. 2025;9(1):64-78.

Article Files

Full Text