jnse

Journal of Naval Sciences and Engineering

1304-2025

Millî Savunma Üniversitesi

10.56850/jnse.1828189

Computer Vision Natural Language Processing

Bilgisayar Görüşü Doğal Dil İşleme

Mobil Uygulama İle Derin Öğrenme Tabanlı Nesne Tespiti ve Büyük Dil Modeli İle İfade Üretme

DEEP LEARNING-BASED OBJECT DETECTION WITH MOBILE APPLICATION AND EXPRESSION GENERATION USING A LARGE LANGUAGE MODEL

https://orcid.org/0009-0009-6072-6990

Dere

Nurcihan

Architecht Information Systems and Marketing Trade

https://orcid.org/0000-0001-6999-1410

Yıldız

Kazım

MARMARA UNIVERSITY

https://orcid.org/0000-0003-4540-663X

Demir

Önder

MARMARA UNIVERSITY

Advanced Online Publication 69 93 11 21 2025 12 16 2025

2003

Journal of Naval Sciences and Engineering

Bu çalışma, kullanıcıların çevrelerindeki nesneleri algılamalarını, bu nesnelerin uzaklıklarını ölçmelerini ve nesneler arasındaki konumsal ilişkileri anlamalarını sağlayan bütünleşik bir mobil çözüm sunmaktadır. Sistem, YOLOv11 tabanlı gerçek zamanlı nesne tespiti, LiDAR destekli mesafe ölçümü ve GPT-4o’nun ifade üretimini bir araya getirerek kullanıcının istediği nesneyi bulmasını ve nesnenin çevresindeki diğer nesneleri de öğrenmesini sağlamaktadır. Bu sayede kullanıcı yalnızca nesnelerin varlığını değil, aynı zamanda konumlarını ve birbirleriyle olan konumsal düzenlerini de öğrenebilmektedir. Çalışmada, nesne tespiti sırasında görüntüler mobil uygulama ile yakalanarak nesnenin her zaman görsel çerçeve içerisinde yer alması sağlanır. Bu, görme engelli kullanıcıların oluşturduğu fotoğraflarda sıklıkla karşılaşılan bulanıklık ve yanlış çerçeveleme gibi sorunların önüne geçer. Deneysel sonuçlar, YOLOv11 modelinin 0.77 F1 puanı ve 0.806 mAP değeri ile etkili bir performans ortaya koyduğunu göstermektedir. Ayrıca ince ayar gerçekleştirilen GPT-4o modeli, görüntülerdeki nesne konumlarını doğru biçimde belirleyerek nesneyi ve etrafındaki diğer nesneleri içeren ifadeler üretmektedir. Bu çalışma, nesne tespiti, LiDAR tabanlı mesafe ölçümü ve büyük bir dil modelinin ifade üretimini birleştiren bir sistem önermektedir. Gelecekte daha gelişmiş çözümlerin uygulanması için bir referans oluşturmaktadır.

This work presents an integrated mobile solution that allows users to detect objects in their environment, measure their distances, and understand the spatial relationships between them. The system combines YOLOv11-based real-time object detection, LiDAR-assisted distance measurement, and GPT-4o expression generation, allowing users to locate desired objects and learn about nearby objects. This allows the user to understand not only the presence of objects but also their locations and their spatial relationships. In this study, images are captured with a mobile application during object detection, ensuring that the object is always within the frame. This prevents problems such as blurring and incorrect framing, which are frequently encountered in photos created by visually impaired users. Experimental results show that the YOLOv11 model demonstrates effective performance with an F1 score of 0.77 and a mAP value of 0.806. Furthermore, the fine-tuned GPT-4o model identifies object locations in images and generates expressions that include other surrounding objects. The present work proposes a system that integrates object detection, LiDAR-based distance measurement, and expression generation from a large language model. It provides a reference for the implementation of more advanced solutions in the future.

Object Detection YOLOv11 Deep Learning GPT-4o Mobile Application.

Nesne Tespiti YOLOv11 Derin Öğrenme GPT-4o Mobil Uygulama

Abed, A. A., Al-Ibadi, A., & Abed, I. A. (2023). Real-time multiple face mask and fever detection using YOLOv3 and TensorFlow lite platforms. Bulletin of Electrical Engineering and Informatics, 12(2), 922-929.

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., & Anadkat, S. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Alamsyah, D. P., Ramdhani, Y., Syam, A. T., & Setiadi, A. (2022). Augmented Reality English Education Based iOS with MobileNetV2 Image Recognition Model. 2022 Seventh International Conference on Informatics and Computing (ICIC),

Alemdar, K. D., Kayacı Çodur, M., Codur, M. Y., & Uysal, F. (2023). Environmental Effects of Driver Distraction at Traffic Lights: Mobile Phone Use. Sustainability, 15(20), 15056.

Boyar, T., & Yıldız, K. (2022). Powdery mildew detection in hazelnut with deep learning. Hittite Journal of Science and Engineering, 9(3), 159-166.

Chen, C., Anjum, S., & Gurari, D. (2022). Grounding answers for visual questions asked by visually impaired people. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Chen, C., Tseng, Y.-Y., Li, Z., Venkatesh, A., & Gurari, D. (2025). Acknowledging Focus Ambiguity in Visual Questions. arXiv preprint arXiv:2501.02201.

Chen, J., & Zhu, Z. (2023). Real-time 3D object detection, recognition and presentation using a mobile device for assistive navigation. SN Computer Science, 4(5), 543. Furniture Computer Vision Dataset. (2022). Retrieved 19.11.2025 from https://universe.roboflow.com/objectdetection-uzld5/furniture-ngpea-h6zxi/

Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., & Bigham, J. P. (2018). Vizwiz grand challenge: Answering visual questions from blind people. Proceedings of the IEEE conference on computer vision and pattern recognition, Han, X., Zhang, Z., Ding, N., Gu, Y., Liu, X., Huo, Y., Qiu, J., Yao, Y., Zhang, A., & Zhang, L. (2021). Pre-trained models: Past, present and future. AI Open, 2, 225-250. He, L., Zhou, Y., Liu, L., Zhang, Y., & Ma, J. (2025). Application of the YOLOv11-seg algorithm for AI-based landslide detection and recognition. Scientific Reports, 15(1), 12421.

HomeObjects. (2025). Retrieved 19.11.2025 from https://app.roboflow.com/objectdetection-uzld5/homeobjects/4

Huh, M., Xu, F., Peng, Y.-H., Chen, C., Gurari, D., Choi, E., & Pavel, A. (2024). Long-form answers to visual questions from blind and low vision people. Workshop on Demographic Diversity in Computer Vision@ CVPR 2025,

Khoshsirat, S., & Kambhamettu, C. (2023). Embedding attention blocks for the vizwiz answer grounding challenge. VizWiz Grand Challenge Workshop,

Kotthapalli, M., Ravipati, D., & Bhatia, R. (2025). YOLOv1 to YOLOv11: A comprehensive survey of real-time object detection innovations and challenges. arXiv preprint arXiv:2508.02067.

Kumar, S., Ratan, R., & Desai, J. (2022). Cotton disease detection using tensorflow machine learning technique. Advances in Multimedia, 2022.

Liao, Y., Li, L., Xiao, H., Xu, F., Shan, B., & Yin, H. (2025). YOLO-MECD: citrus detection algorithm based on YOLOv11. Agronomy, 15(3), 687.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13,

Mahi, A. B. S., Eshita, F. S., & Helaly, T. (2023). An automated system for wrong-way vehicle detection using yolo and deepsort. 2023 5th International Conference on Sustainable Technologies for Industry 5.0 (STI).

Massiceti, D., Zintgraf, L., Bronskill, J., Theodorou, L., Harris, M. T., Cutrell, E., Morrison, C., Hofmann, K., & Stumpf, S. (2021). Orbit: A real-world few-shot dataset for teachable object recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision.

Moreira, F. W. R., Hermes, G., & de Lima, J. M. M. (2024). Development of a Cross Platform Mobile Application Using Gemini to Assist Visually Impaired Individuals. 2024 9th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS). Morishita, M., Fukuda, H., Yamaguchi, S., Muraoka, K., Nakamura, T., Hayashi, M., Yoshioka, I., Ono, K., & Awano, S. (2024). An exploratory assessment of GPT-4o and GPT-4 performance on the Japanese National Dental Examination. The Saudi Dental Journal, 36(12), 1577-1581.

Open Neural Network Exchange. Retrieved 10.12.2025 from https://onnx.ai Prechelt, L. (2002). Early stopping-but when? In Neural Networks: Tricks of the trade (pp. 55-69). Springer.

Pudari, R., Bhutada, S., & Mudavath, S. P. (2020). Real Time Face Recognition Using Convoluted Neural Networks. arXiv preprint arXiv:2010.04517. Sujaini, H., Ramadhan, E. Y., & Novriando, H. (2021). Comparing the performance of linear regression versus deep learning on detecting melanoma skin cancer using apple core ML. Bulletin of Electrical Engineering and Informatics, 10(6), 3110-3120.

Tautkute, I., Możejko, A., Stokowiec, W., Trzciński, T., Brocki, Ł., & Marasek, K. (2017). What looks good with my sofa: Multimodal search engine for interior design. 2017 Federated Conference on Computer Science and Information Systems (FedCSIS).

Tinn, R., Cheng, H., Gu, Y., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2023). Fine-tuning large neural language models for biomedical natural language processing. Patterns, 4(4).

Wang, Z., Li, C., Xu, H., Zhu, X., & Li, H. (2025). Mamba YOLO: A Simple Baseline for Object Detection with State Space Model. Proceedings of the AAAI Conference on Artificial Intelligence.

Wehr, A., & Lohr, U. (1999). Airborne laser scanning—an introduction and overview. ISPRS Journal of photogrammetry and remote sensing, 54(2-3), 68-82.