Akıllı Telefonlar için Birleştirme Modeli Tabanlı Görüntü Altyazılama

Muharrem Baran; Özge Taylan Moral; Volkan Kılıç

doi:10.31590/ejosat.950924

Conference Paper

Akıllı Telefonlar için Birleştirme Modeli Tabanlı Görüntü Altyazılama

Year 2021, , 191 - 196, 31.07.2021

Muharrem Baran , Özge Taylan Moral , Volkan Kılıç

https://doi.org/10.31590/ejosat.950924

Cited By: 3

Abstract

Görüntü altyazılama, bir görüntünün metinsel açıklamasını doğal dil işleme ve bilgisayarlı görü kullanılarak oluşturma işlemidir. Bir görüntünün görsel içeriğini makineye tanımlatmak, potansiyel uygulamaları nedeniyle son yıllarda artarak ilgi görmüştür. Bu çalışmada, akıllı telefonlarda uygulanabilir, kodlayıcı-kod çözücü yaklaşımına dayanan birleştirme modeli tabanlı bir görüntü altyazılama sistemi önerilmektedir. Önerilen birleştirme modelinde kodlayıcı olarak görüntü özniteliklerini çıkarmak için VGG16 evrişimsel sinir ağları ve kelime özelliklerini çıkarmak için uzun-kısa dönemli bellek yapısı kullanılmıştır. Bu iki işlem sonrası, görüntü özniteliklerinin ve oluşturulan kelime özelliklerinin kodlanmış biçimleri önerilen modelde birleştirilmiştir. Bu iki kodlanmış girdinin kombinasyonu daha sonra dizideki bir sonraki kelimeyi oluşturmak için çok basit bir kod çözücü modeli tarafından kullanılarak görüntülerin doğal dile uygun altyazıları başarıyla üretilmiştir. Önerilen sistem Flickr8k/30k veri kümeleri üzerinde BLEUn metriği kullanılarak test edilmiş ve literatürdeki çalışmalarla kıyaslanarak sağladığı üstünlük gösterilmiştir. Önerilen sistem, ayrıca, benzer çalışmalardan farklı olarak internet bağlantısı olmadan görüntü altyazısı üretebilecek şekilde geliştirdiğimiz ImCap adlı Android uygulamamız üzerinde de başarıyla çalıştırılmıştır. Bu uygulama ile görüntü altyazılamanın daha çok kullanıcıya ulaşması amaçlanmıştır.

Keywords

Görüntü Altyazılama, Bilgisayarlı Görü, Doğal Dil işleme, Android

References

Bigham, J. P., Jayant, C., Ji, H., Little, G., Miller, A., Miller, R. C., . . . White, S. (2010). Vizwiz: nearly real-time answers to visual questions. Paper presented at the Proceedings of the 23rd annual ACM symposium on User interface software and technology.
Brownlee, J. (2019). A gentle introduction to pooling layers for convolutional neural networks. Machine Learning Mastery, 22.
Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile Application Based Automatic Caption Generation for Visually Impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. J arXiv preprint arXiv:.00325.
Chen, X., & Zitnick, C. L. (2014). Learning a recurrent visual representation for image caption generation. J arXiv preprint arXiv:1411.5654.
Elliott, D., & Keller, F. (2013). Image description using visual dependency representations. Paper presented at the Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.
Flair, D. (2019). Python based Project – Learn to Build Image Caption Generator with CNN and LSTM.
Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., & Darrell, T. (2016). Deep compositional captioning: Describing novel object categories without paired training data. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. J Journal of Artificial Intelligence Research, 47, 853-899.
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey of deep learning for image captioning. ACM Computing Surveys, 51(6), 1-36.
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. J arXiv preprint arXiv:.
Kuznetsova, P., Ordonez, V., Berg, T. L., & Choi, Y. (2014). Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2, 351-362.
Leon, V., Mouselinos, S., Koliogeorgi, K., Xydis, S., Soudris, D., & Pekmestzi, K. (2020). A tensorflow extension framework for optimized generation of hardware cnn inference engines. J Technologies, 8(1), 6.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., . . . Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Paper presented at the European Conference on Computer Vision.
Makav, B., & Kılıç, V. (2019a). A new image captioning approach for visually impaired people. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
Makav, B., & Kılıç, V. (2019b). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:.
Mason, R., & Charniak, E. (2014). Nonparametric method for data-driven image captioning. Paper presented at the Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Mathews, A., Xie, L., & He, X. (2018). Semstyle: Learning to generate stylised image captions using unaligned text. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., & Ng, A. Y. (2014). Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2, 207-218.
Tanti, M., Gatt, A., & Camilleri, K. P. (2018). Where to put the image in an image caption generator. Natural Language Engineering, 24(3), 467-489.
Wang, H., Wang, H., & Xu, K. (2020). Evolutionary Recurrent Neural Network for Image Captioning. Neurocomputing.
Zhang, L., Wang, S., & Liu, B. (2018). Deep learning for sentiment analysis: A survey. Journal of Wiley Interdisciplinary Reviews: Data Mining Knowledge Discovery, 8(4), e1253.

Merge Model Based Image Captioning for Smartphones

Year 2021, , 191 - 196, 31.07.2021

Muharrem Baran , Özge Taylan Moral , Volkan Kılıç

https://doi.org/10.31590/ejosat.950924

Cited By: 3

Abstract

Image Captioning is the process of generating a textual description of an image by using both natural language processing and computer vision. Definition of the visual content of an image to the machine has attracted increasing attention in recent years due to its potential applications. In this study, an image captioning system based on an encoder-decoder merge model approach, applicable to smartphones, is proposed. In the proposed merge model, VGG16 convolutional neural networks are used to extract the image features and long-short term memory are used to extract the word features as encoder. After these two processes, the encoded forms of the images and the word features were merged in the proposed model. Image captioning was done successfully after the combination of these two encoded inputs had been used by a very simple decoder model to generate the next word in the sequence. The proposed system was tested using the BLEUn metric on the Flickr8k/30k dataset and its superiority was shown by comparing it with the studies in the literature. The proposed system was also integrated with our Android application called ImCap, which we have developed to generate captions without an internet connection, unlike other similar studies. With this application, image captioning is aimed to reach more users.

Keywords

Image Captioning, Computer Vision, Natural Language Processing, Android

References

Bigham, J. P., Jayant, C., Ji, H., Little, G., Miller, A., Miller, R. C., . . . White, S. (2010). Vizwiz: nearly real-time answers to visual questions. Paper presented at the Proceedings of the 23rd annual ACM symposium on User interface software and technology.
Brownlee, J. (2019). A gentle introduction to pooling layers for convolutional neural networks. Machine Learning Mastery, 22.
Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile Application Based Automatic Caption Generation for Visually Impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. J arXiv preprint arXiv:.00325.
Chen, X., & Zitnick, C. L. (2014). Learning a recurrent visual representation for image caption generation. J arXiv preprint arXiv:1411.5654.
Elliott, D., & Keller, F. (2013). Image description using visual dependency representations. Paper presented at the Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.
Flair, D. (2019). Python based Project – Learn to Build Image Caption Generator with CNN and LSTM.
Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., & Darrell, T. (2016). Deep compositional captioning: Describing novel object categories without paired training data. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. J Journal of Artificial Intelligence Research, 47, 853-899.
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey of deep learning for image captioning. ACM Computing Surveys, 51(6), 1-36.
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. J arXiv preprint arXiv:.
Kuznetsova, P., Ordonez, V., Berg, T. L., & Choi, Y. (2014). Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2, 351-362.
Leon, V., Mouselinos, S., Koliogeorgi, K., Xydis, S., Soudris, D., & Pekmestzi, K. (2020). A tensorflow extension framework for optimized generation of hardware cnn inference engines. J Technologies, 8(1), 6.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., . . . Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Paper presented at the European Conference on Computer Vision.
Makav, B., & Kılıç, V. (2019a). A new image captioning approach for visually impaired people. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
Makav, B., & Kılıç, V. (2019b). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:.
Mason, R., & Charniak, E. (2014). Nonparametric method for data-driven image captioning. Paper presented at the Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Mathews, A., Xie, L., & He, X. (2018). Semstyle: Learning to generate stylised image captions using unaligned text. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., & Ng, A. Y. (2014). Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2, 207-218.
Tanti, M., Gatt, A., & Camilleri, K. P. (2018). Where to put the image in an image caption generator. Natural Language Engineering, 24(3), 467-489.
Wang, H., Wang, H., & Xu, K. (2020). Evolutionary Recurrent Neural Network for Image Captioning. Neurocomputing.
Zhang, L., Wang, S., & Liu, B. (2018). Deep learning for sentiment analysis: A survey. Journal of Wiley Interdisciplinary Reviews: Data Mining Knowledge Discovery, 8(4), e1253.

There are 24 citations in total.

Details

Primary Language	Turkish
Subjects	Engineering
Journal Section	Articles
Authors	Muharrem Baran 0000-0001-7394-3649 Özge Taylan Moral 0000-0003-0482-267X Volkan Kılıç 0000-0002-3164-1981
Publication Date	July 31, 2021
Published in Issue	Year 2021

Cite

APA	Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı Telefonlar için Birleştirme Modeli Tabanlı Görüntü Altyazılama. Avrupa Bilim Ve Teknoloji Dergisi(26), 191-196. https://doi.org/10.31590/ejosat.950924

Cited By

Resnet based Deep Gated Recurrent Unit for Image Captioning on Smartphone

European Journal of Science and Technology

https://doi.org/10.31590/ejosat.1107035

A Benchmark for Feature-injection Architectures in Image Captioning

European Journal of Science and Technology

https://doi.org/10.31590/ejosat.1013329

Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units

European Journal of Science and Technology

https://doi.org/10.31590/ejosat.1071835

Article Files

Full Text