Design of Audio Description System Using Cloud Based Computer Vision

Emre Karagöz; Kutan Koruyan

doi:10.31200/makuubd.651261

Research Article

Design of Audio Description System Using Cloud Based Computer Vision

Year 2020, Volume: 4 Issue: 1, 74 - 85, 13.03.2020

Emre Karagöz Kutan Koruyan

https://doi.org/10.31200/makuubd.651261

Cited By: 1

Abstract

Developments
and changes in multimedia tools are actively used in many areas of life and
bring a huge value to them. Nowadays, the concept of artificial intelligence is
highly developed and there are hundreds of practices and methods to support the
living standards especially for people with disabilities. The system developed
in this study enables automatic visualization of the media output scenes such
as movies, documentaries, etc., which are visually impaired people by means of
computer vision technique, and the results are transferred to the users by
voice command. HTML5 and CSS are used for visualizing the system, PHP and
JAVASCRIPT are used for programming. MySQL is preferred as the database of the
system. Computer vision, translation from text to speech and translation from
one language to another are the main instruments used in this study.
Cloud-based Microsoft AZURE Computer Vision API is used for computer vision,
Javascript Responce.js library is used for text-to-speech translation, Google
Cloud Text-To-Speech and Microsoft Azure Text to Speech APIs are used for
translation from one language to another one.

Keywords

Audio Description, Computer Vision, Text to Speech Translation, Machine Translation, Cloud Computing

References

ADI AD Guidelines Committee (2003), Guidelines for Audio Description, Retrieved June 23, 2019, from http://www.acb.org/adp/guidelines.html, 23.06.2019.
Aslan, E. (2018). Otomatik Çeviri Araçlarının Yabancı Dil Öğretiminde Kullanımı: Google Çeviri Örneği, Selçuk Üniversitesi Edebiyat Fakültesi Dergisi, 0(39), 87-104.
Aydemir, E. (2018). Weka ile Yapay Zeka, Ankara: Seçkin Yayıncılık.
Benecke, B. (2004). Audio-Description, Meta, 49 (1), 78–80.
Carvalho, P., Trancoso, I.M., & Oliveira, L.C. (1998). Automatic Segment Alignment for Concatenative Speech Synthesis in Portuguese, Proc. of the 10th Portuguese Conference on Pattern Recognition, RECPAD'98, Lisbon.
Dawson-Howe, K. (2014). A Practical Introduction to Computer Vision with OpenCV, John Wiley & Sons.
Delgado, H., Matamala, A. & Serrano, J. (2015). Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?, Cadernos de Tradução, 35(2), 308-324.
Gagnon, L., Chapdelaine, C., Byrns, D., Foucher, S., Heritier, M. & Gupt, V. (2010). A computer-vision-assisted system for Videodescription scripting. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, San Francisco, CA, USA, 1-8.
Google Cloud, (n.d.). Cloud Translation, Retrieved May 15, 2019, https://cloud.google.com/translate/
Jang, I., Ahn, C. & Jang, Y. (2014). Semi-automatic DVS Authoring Method. Computers Helping People with Special Needs: 14th International Conference, ICCHP 2014. Springer International Publishing, Switzerland.
Klancnik, S., Ficko, M., Balic, J. & Pahole, I. (2015). Computer Vision-Based Approach to End Mill Tool Monitoring, International Journal of Simulation Modelling, 14(4), 571–583.
Krishna, R. (2017). Computer Vision: Foundations and Applications, Stanford: Stanford University.
Lakritz, J. & Salway, A. (2006). The semi-automatic generation of audio description from screenplays, Dept. of Computing Technical Report, University of Surrey, UK.
Microsoft Azure, (n.d.). Translator Text API, Retrieved June 26, 2019, https://azure.microsoft.com/en-gb/services/cognitive-services/translator-text-api
Nabiyev, V.V. (2016). Yapay Zeka (5th ed.), Ankara: Seçkin Yayıncılık.
Netflix, Netflix Audio Description Style Guide v2.1, Retrieved November 12, 2019, https://partnerhelp.netflixstudios.com/hc/en-us/articles/215510667-Audio-Description-Style-Guide-v2-1.
O'Malley, M. H. (1990). Text-to-speech conversion technology, Computer, 23(8), 17-23.
Pagani, M. (2005). Encyclopedia of multimedia technology and networking, Hershey PA, USA: Idea Group Inc.
Remael, A., Reviers, N. & Vercauteren, G. (n.d.). ADLAB Audio Description guideline, Retrieved June 24, 2019, http://www.adlabproject.eu/Docs/adlab%20book/index.html.
Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A. & Schiele, B. (2017). Movie Description, International Journal of Computer Vision, 123(1), 94–120.
Whitehead, J. (2015). What is audio description. International Congress Series, 1282, 960-963.

Bulut Tabanlı Bilgisayarlı Görü Kullanılarak Sesli Betimleme Sistem Tasarımı

Year 2020, Volume: 4 Issue: 1, 74 - 85, 13.03.2020

Emre Karagöz Kutan Koruyan

https://doi.org/10.31200/makuubd.651261

Cited By: 1

Abstract

Multimedya
araçlarındaki gelişim ve değişimler hayatın birçok alanında aktif şekilde
kullanılmakta ve büyük oranda artı değer kazandırmaktadır. Yapay zekâ
kavramının son derece gelişmiş olduğu günümüzde, özellikle engelli bireylerin
yaşam standartlarını destekleyecek yüzlerce uygulama ve metot bulunmaktadır. Bu
çalışmada geliştirilen sistem özellikle görme engelli bireylerin izledikleri
film, belgesel gibi video formatındaki medya çıktı sahnelerinin görüntü
imgeleme tekniği sayesinde otomatik olarak betimlenmesini ve sonuçların
kullanıcılara sesli olarak aktarılmasını sağlamaktadır. Sistemin
görselleştirilmesinde HTML5 ve CSS, programlanmasında PHP ve JAVASCRIPT dilleri
kullanılmıştır. Sistemin veritabanı olarak MySQL tercih edilmiştir. Yapay zekâ
ve bilişim teknolojilerinden olan bilgisayarlı görü, metinden konuşmaya çevirme
ve bir dilden başka bir dile çeviri, bu çalışmada kullanılan temel
enstrümanlardır. Görüntü imgeleme işlemleri için bulut tabanlı Microsoft AZURE
Computer Vision API, metinden sese çevirme için Javascript Responce.js
kütüphanesi, bir dilden başka bir dile çeviri işlemlerinde ise Google Cloud
Text-To-Speech ve Microsoft Azure Text to Speech API’leri kullanılmıştır.

Keywords

Sesli Betimleme, Bilgisayarlı Görü, Metinden Konuşmaya Çeviri, Bulut Bilişim, Makina Çevirisi, Bulut Bilişim

References

ADI AD Guidelines Committee (2003), Guidelines for Audio Description, Retrieved June 23, 2019, from http://www.acb.org/adp/guidelines.html, 23.06.2019.
Aslan, E. (2018). Otomatik Çeviri Araçlarının Yabancı Dil Öğretiminde Kullanımı: Google Çeviri Örneği, Selçuk Üniversitesi Edebiyat Fakültesi Dergisi, 0(39), 87-104.
Aydemir, E. (2018). Weka ile Yapay Zeka, Ankara: Seçkin Yayıncılık.
Benecke, B. (2004). Audio-Description, Meta, 49 (1), 78–80.
Carvalho, P., Trancoso, I.M., & Oliveira, L.C. (1998). Automatic Segment Alignment for Concatenative Speech Synthesis in Portuguese, Proc. of the 10th Portuguese Conference on Pattern Recognition, RECPAD'98, Lisbon.
Dawson-Howe, K. (2014). A Practical Introduction to Computer Vision with OpenCV, John Wiley & Sons.
Delgado, H., Matamala, A. & Serrano, J. (2015). Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?, Cadernos de Tradução, 35(2), 308-324.
Gagnon, L., Chapdelaine, C., Byrns, D., Foucher, S., Heritier, M. & Gupt, V. (2010). A computer-vision-assisted system for Videodescription scripting. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, San Francisco, CA, USA, 1-8.
Google Cloud, (n.d.). Cloud Translation, Retrieved May 15, 2019, https://cloud.google.com/translate/
Jang, I., Ahn, C. & Jang, Y. (2014). Semi-automatic DVS Authoring Method. Computers Helping People with Special Needs: 14th International Conference, ICCHP 2014. Springer International Publishing, Switzerland.
Klancnik, S., Ficko, M., Balic, J. & Pahole, I. (2015). Computer Vision-Based Approach to End Mill Tool Monitoring, International Journal of Simulation Modelling, 14(4), 571–583.
Krishna, R. (2017). Computer Vision: Foundations and Applications, Stanford: Stanford University.
Lakritz, J. & Salway, A. (2006). The semi-automatic generation of audio description from screenplays, Dept. of Computing Technical Report, University of Surrey, UK.
Microsoft Azure, (n.d.). Translator Text API, Retrieved June 26, 2019, https://azure.microsoft.com/en-gb/services/cognitive-services/translator-text-api
Nabiyev, V.V. (2016). Yapay Zeka (5th ed.), Ankara: Seçkin Yayıncılık.
Netflix, Netflix Audio Description Style Guide v2.1, Retrieved November 12, 2019, https://partnerhelp.netflixstudios.com/hc/en-us/articles/215510667-Audio-Description-Style-Guide-v2-1.
O'Malley, M. H. (1990). Text-to-speech conversion technology, Computer, 23(8), 17-23.
Pagani, M. (2005). Encyclopedia of multimedia technology and networking, Hershey PA, USA: Idea Group Inc.
Remael, A., Reviers, N. & Vercauteren, G. (n.d.). ADLAB Audio Description guideline, Retrieved June 24, 2019, http://www.adlabproject.eu/Docs/adlab%20book/index.html.
Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A. & Schiele, B. (2017). Movie Description, International Journal of Computer Vision, 123(1), 94–120.
Whitehead, J. (2015). What is audio description. International Congress Series, 1282, 960-963.

There are 21 citations in total.

Details

Primary Language	English
Journal Section	Articles
Authors	Emre Karagöz 0000-0002-4887-8168 Kutan Koruyan 0000-0002-3115-5676
Publication Date	March 13, 2020
Acceptance Date	February 1, 2020
Published in Issue	Year 2020 Volume: 4 Issue: 1

Cite

APA	Karagöz, E., & Koruyan, K. (2020). Design of Audio Description System Using Cloud Based Computer Vision. Mehmet Akif Ersoy Üniversitesi Uygulamalı Bilimler Dergisi, 4(1), 74-85. https://doi.org/10.31200/makuubd.651261