Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar

Arman Savran

doi:10.38016/jista.1400047

Araştırma Makalesi

Temporal Convolutional Networks for Efficient Voice Activity Detection with Event Camera

Yıl 2024, Cilt: 7 Sayı: 2, 102 - 115, 26.09.2024

Arman Savran

https://doi.org/10.38016/jista.1400047

Öz

Voice activity detection (VAD) is a widely used essential pre-processing for human-computer interfaces. The presence of complex acoustic background noise requires the use of large deep neural networks at the expense of heavy computational load. Visual VAD is a preferable alternative approach since there is no background noise problem. Also, the video channel is the only option when access to audio data is impossible. However, visual VAD, which is generally expected to operate continuously for long periods of time, causes significant energy consumption due to the requirements of video camera hardware and video data processing. In this study, the use of the event camera, whose efficiency is much higher than the traditional video camera thanks to neuromorphic technology, was examined for VAD through vision. Thanks to the event camera's detection at high time resolutions, the spatial dimension is completely reduced and extremely lightweight but successful models that work only in the time dimension have been designed. Designs are made with combinations of different types of dilated convolution, down-sampling methods, and separable convolution techniques, taking into account temporal receptive field sizes. In the experiments, the robustness of VAD against various facial actions was measured. The results show that down-sampling is necessary for high performance and efficiency, and for this, max-pooling achieves superior performance than down-sampling with stepwise convolution. This high-performance standard design operates at 1.57 million floating point operations (MFLOPS). By performing dilated convolution with a constant factor and combining it with down-subsampling, it was found that the processing requirement was reduced by more than half, with similar performance. Additionally, by also applying depthwise separation, the processing requirement was reduced to 0.30 MFLOPS, less than one-fifth of the standard model.

Anahtar Kelimeler

Voice Activity Detection, Event Camera, Efficient, Visual Speech, Dilated Convolution, Separable Convolution

Proje Numarası

BAP112

Kaynakça

Amir, A., Taba, B., Berg, D., Melano, T., McKinstry, J., Di Nolfo, C., Nayak, T., Andreopoulos, A., Garreau, G., Mendoza, M., Kusnitz, J., Debole, M., Esser, S., Delbruck, T., Flickner, M., Modha, D., 2017. A Low Power, Fully Event-Based Gesture Recognition System. CVPR2017, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.
Araujo, A., Norris, W., Sim, J., 2019. Computing Receptive Fields of Convolutional Neural Networks. Distill, https://distill.pub/2019/computing-receptive-fields.
Ariav, I., Dov, D., Cohen, I., 2018. A deep architecture for audio-visual voice activity detection in the presence of transients. Signal Processing 142, 69–74.
Arriandiaga, A., Morrone, G., Pasa, L., Badino, L., Bartolozzi, C., 2021. Audio-Visual Target Speaker Enhancement on Multi-Talker Environment Using Event-Driven Cameras. ISCAS 2021, IEEE International Symposium on Circuits and Systems, Daegu, South Korea, May 22-28, 2021.
Bai, S., Kolter, J.Z., Koltun, V., 2018. Convolutional Sequence Modeling Revisited. ICLRW2018, 6th International Conference on Learning Representations - Workshop Track Proceedings, April 30 - May 3, 2018, Vancouver, BC, Canada.
Barua, S., Miyatani, Y., Veeraraghavan, A., 2016. Direct face detection and video reconstruction from event cameras. WACV2016, Winter Conference on Applications of Computer Vision, March 7-10, 2016, Lake Placid, NY, USA.
Berlincioni, L., Cultrera, L., Albisani, C., Cresti, L., Leonardo, A., Picchioni, S., Becattini, F., Del Bimbo, A., 2023. Neuromorphic Event-based Facial Expression Recognition. CVPRW2017, The IEEE/CVF Conference on Computer Vision and Pattern Recognition - Workshop Track., June, 2023, Vancouver, Canada, pp. 4108–4118.
Çubukçu, A., Kuncan, M., Kaplan, K., Ertunç, H.M., 2015. Development of a voice-controlled home automation using Zigbee module. In: 23nd Signal Processing and Communications Applications Conference (SIU). pp. 1801–1804.
Deng, Y., Chen, H., Liu, H., Li, Y., 2022. A Voxel Graph CNN for Object Classification With Event Cameras. CVPR2022, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Gallego, G., Lund, J.E.A., Mueggler, E., Rebecq, H., Delbrück, T., Scaramuzza, D., 2018. Event-Based, 6-DOF Camera Tracking from Photometric Depth Maps. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2402–2412.
Gallego, G., Delbrück, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A.J., Conradt, J., Daniilidis, K., Scaramuzza, D., 2022. Event-Based Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 154–180.
Gehrig, D., Loquercio, A., Derpanis, K.G., Scaramuzza, D., 2019. End-to-End Learning of Representations for Asynchronous Event-Based Data, ICCV2019, The IEEE International Conference on Computer Vision, October 2019.
Ghaemmaghami, H., Dean, D., Kalantari, S., Sridharan, S., Fookes, C., 2015. Complete-linkage clustering for voice activity detection in audio and visual speech. Interspeech, Dresden, Germany, 2015.
Guy, S., Lathuilière, S., Mesejo, P., Horaud, R., 2020. Learning Visual Voice Activity Detection with an Automatically Annotated Dataset. ICPR2020, 25th International Conference on Pattern Recognition, January 10-15, 2020, Milan, Italy.
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arxiv:1704.04861.
Kim, J., Hwang, I., Kim, Y.M., 2022. Ev-TTA: Test-Time Adaptation for Event-Based Object Recognition. CVPR2022, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Korkmaz, Y., Boyacı, A., 2023. Hybrid voice activity detection system based on LSTM and auditory speech features. Biomedical Signal Processing and Control 80, 104408.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet Classification with Deep Convolutional Neural Networks. NIPS2012, Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems 2012, December 3-8, 2012, Lake Tahoe, Nevada, USA.
Lenz, G., Ieng, S.H., Benosman, R.B., 2020. Event-based Face Detection and Tracking using the Dynamics of Eye Blinks. Frontiers in Neuroscience 14, 587.
Li, J., Li, J., Zhu, L., Xiang, X., Huang, T., Tian, Y., 2022. Asynchronous Spatio-Temporal Memory Network for Continuous Event-Based Object Detection. IEEE Transactions on Image Processing 31, 2975–2987.
Li, X., Neil, D., Delbruck, T., Liu, S., 2019. Lip Reading Deep Network Exploiting Multi-Modal Spiking Visual and Auditory Sensors. ISCAS 2019, IEEE International Symposium on Circuits and Systems, May, 2019.
Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation. CVPR2015, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2015, Boston, USA.
Maqueda, A.I., Loquercio, A., Gallego, G., Garcı́a, N., Scaramuzza, D., 2018. Event-Based Vision Meets Deep Learning on Steering Prediction for Self-Driving Cars. CVPR2018, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, June 2018.
Moreira, G., Graça, A., Silva, B., Martins, P., Batista, J.P., 2022. Neuromorphic Event-based Face Identity Recognition. ICPR2022, 26th International Conference on Pattern Recognition, Montreal, August 21-25, 2022, QC, Canada, pp. 922–929.
Neil, D., Pfeiffer, M., Liu, S.-C., 2016. Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences. NIPS2016, Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, pp. 3889–3897.
Pan, L., Scheerlinck, C., Yu, X., Hartley, R., Liu, M., Dai, Y., 2019. Bringing a Blurry Frame Alive at High Frame-Rate With an Event Camera. CVPR2019, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 2019.
Paredes-Valles, F., de Croon, G.C.H.E., 2021. Back to Event Basics: Self-Supervised Learning of Image Reconstruction for Event Cameras via Photometric Constancy. CVPR2021, The IEEE/CVF Conference on Conference on Computer Vision and Pattern Recognition, June 2021.
Patrona, F., Iosifidis, A., Tefas, A., Nikolaidis, N., Pitas, I., 2016. Visual Voice Activity Detection in the Wild. IEEE Transactions on Multimedia 18, 967–977.
Perot, E., de Tournemire, P., Nitti, D., Masci, J., Sironi, A., 2020. Learning to Detect Objects with a 1 Megapixel Event Camera. NIPS2020, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, December 6-12, 2020.
Rebecq, H., Ranftl, R., Koltun, V., Scaramuzza, D., 2019. Events-To-Video: Bringing Modern Computer Vision to Event Cameras. CVPR2019, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 2019.
Rethage, D., Pons, J., Serra, X., 2018. A Wavenet for Speech Denoising. ICASSP2018, IEEE International Conference on Acoustics, Speech and Signal Processing, April 15–20, 2018 Calgary, Alberta, Canada, pp. 5069–5073.
Ryan, C., O’Sullivan, B., Elrasad, A., Cahill, A., Lemley, J., Kielty, P., Posch, C., Perot, E., 2021. Real-time face & eye tracking and blink detection using event cameras. Neural Networks 141, 87–97.
Savran, A., Tavarone, R., Higy, B., Badino, L., Bartolozzi, C., 2018. Energy and Computation Efficient Audio-Visual Voice Activity Detection Driven by Event-Cameras. FG2018, 13th IEEE International Conference on Automatic Face & Gesture Recognition, May 15-19 2018, Xi'an, China.
Savran, A., Bartolozzi, C., 2020. Face Pose Alignment with Event Cameras. Special Issue: Sensor Systems for Gesture Recognition, Vol. 20, Issue 24, Article 7079.
Savran, A., 2023. Multi-timescale boosting for efficient and improved event camera face pose alignment. Computer Vision and Image Understanding, Vol. 236, 103817.
Savran, A., 2023a. Fully Convolutional Event-camera Voice Activity Detection Based on Event Intensity. ASYU2023, IEEE Innovations in Intelligent Systems and Applications Conference, October, 2023, Sivas, Türkiye.
Savran, A., 2023b. Comparison of Timing Strategies for Face Pose Alignment with Event Camera. In: 8th International Conference on Computer Science and Engineering (UBMK). pp. 97–101.
Schaefer, S., Gehrig, D., Scaramuzza, D., 2022. AEGNN: Asynchronous Event-Based Graph Neural Networks. CVPR2022, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Shahid, M., Beyan, C., Murino, V., 2021. S-VVAD: Visual Voice Activity Detection by Motion Segmentation. WACV2021, Winter Conference on Applications of Computer Vision, January 3-8, 2021, Waikoloa, HI, USA, pp. 2331-2340
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions. CVPR2015, The IEEE/CVF Conference on Conference on Computer Vision and Pattern Recognition, June 2015, Boston, USA.
Sharma, R., Somandepalli, K., Narayanan, S.S., 2019. Toward Visual Voice Activity Detection for Unconstrained Videos. ICIP2019, International Conference on Image Processing, September 22-25, 2019, Taipei, Taiwan.
Tan, G., Wang, Y., Han, H., Cao, Y., Wu, F., Zha, Z.-J., 2022. Multi-Grained Spatio-Temporal Features Perceived Network for Event-Based Lip-Reading. CVPR2022, The IEEE/CVF Conference on Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Tulyakov, S., Bochicchio, A., Gehrig, D., Georgoulis, S., Li, Y., Scaramuzza, D., 2022. Time Lens++: Event-Based Frame Interpolation With Parametric Non-Linear Flow and Multi-Scale Fusion. CVPR2022, The IEEE Conference on Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Wang, D., Xiao, X., Kanda, N., Yoshioka, T., Wu, J., 2023. Target Speaker Voice Activity Detection with Transformers and Its Integration with End-To-End Neural Diarization. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Wang, Y., Du, B., Shen, Y., Wu, K., Zhao, G., Sun, J., Wen, H., 2019. EV-Gait: Event-Based Robust Gait Recognition Using Dynamic Vision Sensors. The IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 2019.
Wang, Y., Zhang, X., Shen, Y., Du, B., Zhao, G., Cui, L., Wen, H., 2022. Event-Stream Representation for Human Gaits Identification Using Deep Neural Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3436–3449.
Wrench, A., 2006. MOCHA-TIMIT, www.cstr.ed.ac.uk/research/projects/artic/mocha.html.
Yu, F., Koltun, V., 2016. Multi-Scale Context Aggregation by Dilated Convolutions. 4th International Conference on Learning Representations, ICLR, San Juan, Puerto Rico, May 2016.
Zhang, X.-L., Wang, D., 2016. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 252–264.
Zhang, J., Dong, B., Zhang, H., Ding, J., Heide, F., Yin, B., Yang, X., 2022. Spiking Transformers for Event-Based Single Object Tracking. CVPR2022, The IEEE Conference on Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Zhu, L., Wang, X., Chang, Y., Li, J., Huang, T., Tian, Y., 2022. Event-Based Video Reconstruction via Potential-Assisted Spiking Neural Network. CVPR2022, The IEEE Conference on Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.

Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar

Yıl 2024, Cilt: 7 Sayı: 2, 102 - 115, 26.09.2024

Arman Savran

https://doi.org/10.38016/jista.1400047

Öz

Konuşma sesi tespiti (KST), insan bilgisayar arayüzleri için yaygın olarak kullanılan gerekli bir ön-işlemedir. Karmaşık akustik arka plan gürültülerinin varlığı, büyük derin sinir ağlarının ağır hesaplama yükü pahasına kullanımlarını gerekli kılmaktadır. Görü yoluyla KST ise, arka plan gürültüsü problemi olmadığından, tercih edilebilen alternatif bir yaklaşımdır. Görü kanalı, ses verisine erişimin mümkün olmadığı durumlarda ise zaten tek seçenektir. Ancak, genelde uzun süreler aralıksız çalışması beklenen görsel KST, video kamerası donanım ve video verisi işleme gereksinimlerinden dolayı önemli enerji sarfiyatına sebep olur. Bu çalışmada, görü yoluyla KST için, nöromorfik teknoloji sayesinde verimliliği geleneksel video kameradan oldukça yüksek olan olay kamerasının kullanımı incelenmiştir. Olay kamerasının yüksek zaman çözünürlüklerinde algılama yapması sayesinde, uzamsal boyut tamamen indirgenerek sadece zaman boyutundaki örüntülerin öğrenilmesine dayanan son derece hafif fakat başarılı modeller tasarlanmıştır. Tasarımlar, zamansal alıcı alan genişlikleri gözetilerek, farklı evrişim genleştirme tiplerinin, aşağı-örnekleme yöntemlerinin ve evrişim ayırma tekniklerinin bileşimleri ile yapılır. Deneylerde, KST’nin çeşitli yüz eylemleri karşısındaki dayanıklıkları ölçülmüştür. Sonuçlar, aşağı-örneklemenin yüksek başarım ve verimlilik için gerekli olduğunu ve bunun için, maksimum-havuzlamanın adımlı evrişim yöntemiyle aşağı-örnekleme yapmaktan daha üstün başarım elde ettiğini göstermektedir. Bu şekilde üstün başarımlı standart tasarım 1.57 milyon kayan nokta işlemle (MFLOPS) çalışır. Evrişim genleştirmesinin sabit bir faktörle yapılıp aşağı-alt örnekleme ile birleştirilmesiyle de, benzer başarımla, işlem gereksiniminin yarıdan fazla azaldığı bulunmuştur. Ayrıca, derinlemesine ayrışım da uygulanarak işlem gereksinimi 0.30 MFLOPS’a, yani standart modelin beşte birinden daha aşağısına indirilmiştir.

Anahtar Kelimeler

Konuşma Sesi Tespiti, Olay Kamerası, Verimli, Görsel Konuşma, Genleştirilmiş Evrişim, Ayrılabilir Evrişim

Destekleyen Kurum

Yaşar Üniversitesi

Proje Numarası

BAP112

Teşekkür

Bu çalışma, Yaşar Üniversitesi Proje Değerlendirme Komisyonu (PDK) tarafından kabul edilen BAP112 no.lu ve “Nöromorfik Kamera ile Dinamik Yüz Analizi” başlıklı proje kapsamında deskteklenmiştir.

Kaynakça

Amir, A., Taba, B., Berg, D., Melano, T., McKinstry, J., Di Nolfo, C., Nayak, T., Andreopoulos, A., Garreau, G., Mendoza, M., Kusnitz, J., Debole, M., Esser, S., Delbruck, T., Flickner, M., Modha, D., 2017. A Low Power, Fully Event-Based Gesture Recognition System. CVPR2017, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.
Araujo, A., Norris, W., Sim, J., 2019. Computing Receptive Fields of Convolutional Neural Networks. Distill, https://distill.pub/2019/computing-receptive-fields.
Ariav, I., Dov, D., Cohen, I., 2018. A deep architecture for audio-visual voice activity detection in the presence of transients. Signal Processing 142, 69–74.
Arriandiaga, A., Morrone, G., Pasa, L., Badino, L., Bartolozzi, C., 2021. Audio-Visual Target Speaker Enhancement on Multi-Talker Environment Using Event-Driven Cameras. ISCAS 2021, IEEE International Symposium on Circuits and Systems, Daegu, South Korea, May 22-28, 2021.
Bai, S., Kolter, J.Z., Koltun, V., 2018. Convolutional Sequence Modeling Revisited. ICLRW2018, 6th International Conference on Learning Representations - Workshop Track Proceedings, April 30 - May 3, 2018, Vancouver, BC, Canada.
Barua, S., Miyatani, Y., Veeraraghavan, A., 2016. Direct face detection and video reconstruction from event cameras. WACV2016, Winter Conference on Applications of Computer Vision, March 7-10, 2016, Lake Placid, NY, USA.
Berlincioni, L., Cultrera, L., Albisani, C., Cresti, L., Leonardo, A., Picchioni, S., Becattini, F., Del Bimbo, A., 2023. Neuromorphic Event-based Facial Expression Recognition. CVPRW2017, The IEEE/CVF Conference on Computer Vision and Pattern Recognition - Workshop Track., June, 2023, Vancouver, Canada, pp. 4108–4118.
Çubukçu, A., Kuncan, M., Kaplan, K., Ertunç, H.M., 2015. Development of a voice-controlled home automation using Zigbee module. In: 23nd Signal Processing and Communications Applications Conference (SIU). pp. 1801–1804.
Deng, Y., Chen, H., Liu, H., Li, Y., 2022. A Voxel Graph CNN for Object Classification With Event Cameras. CVPR2022, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Gallego, G., Lund, J.E.A., Mueggler, E., Rebecq, H., Delbrück, T., Scaramuzza, D., 2018. Event-Based, 6-DOF Camera Tracking from Photometric Depth Maps. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2402–2412.
Gallego, G., Delbrück, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A.J., Conradt, J., Daniilidis, K., Scaramuzza, D., 2022. Event-Based Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 154–180.
Gehrig, D., Loquercio, A., Derpanis, K.G., Scaramuzza, D., 2019. End-to-End Learning of Representations for Asynchronous Event-Based Data, ICCV2019, The IEEE International Conference on Computer Vision, October 2019.
Ghaemmaghami, H., Dean, D., Kalantari, S., Sridharan, S., Fookes, C., 2015. Complete-linkage clustering for voice activity detection in audio and visual speech. Interspeech, Dresden, Germany, 2015.
Guy, S., Lathuilière, S., Mesejo, P., Horaud, R., 2020. Learning Visual Voice Activity Detection with an Automatically Annotated Dataset. ICPR2020, 25th International Conference on Pattern Recognition, January 10-15, 2020, Milan, Italy.
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arxiv:1704.04861.
Kim, J., Hwang, I., Kim, Y.M., 2022. Ev-TTA: Test-Time Adaptation for Event-Based Object Recognition. CVPR2022, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Korkmaz, Y., Boyacı, A., 2023. Hybrid voice activity detection system based on LSTM and auditory speech features. Biomedical Signal Processing and Control 80, 104408.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet Classification with Deep Convolutional Neural Networks. NIPS2012, Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems 2012, December 3-8, 2012, Lake Tahoe, Nevada, USA.
Lenz, G., Ieng, S.H., Benosman, R.B., 2020. Event-based Face Detection and Tracking using the Dynamics of Eye Blinks. Frontiers in Neuroscience 14, 587.
Li, J., Li, J., Zhu, L., Xiang, X., Huang, T., Tian, Y., 2022. Asynchronous Spatio-Temporal Memory Network for Continuous Event-Based Object Detection. IEEE Transactions on Image Processing 31, 2975–2987.
Li, X., Neil, D., Delbruck, T., Liu, S., 2019. Lip Reading Deep Network Exploiting Multi-Modal Spiking Visual and Auditory Sensors. ISCAS 2019, IEEE International Symposium on Circuits and Systems, May, 2019.
Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation. CVPR2015, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2015, Boston, USA.
Maqueda, A.I., Loquercio, A., Gallego, G., Garcı́a, N., Scaramuzza, D., 2018. Event-Based Vision Meets Deep Learning on Steering Prediction for Self-Driving Cars. CVPR2018, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, June 2018.
Moreira, G., Graça, A., Silva, B., Martins, P., Batista, J.P., 2022. Neuromorphic Event-based Face Identity Recognition. ICPR2022, 26th International Conference on Pattern Recognition, Montreal, August 21-25, 2022, QC, Canada, pp. 922–929.
Neil, D., Pfeiffer, M., Liu, S.-C., 2016. Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences. NIPS2016, Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, pp. 3889–3897.
Pan, L., Scheerlinck, C., Yu, X., Hartley, R., Liu, M., Dai, Y., 2019. Bringing a Blurry Frame Alive at High Frame-Rate With an Event Camera. CVPR2019, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 2019.
Paredes-Valles, F., de Croon, G.C.H.E., 2021. Back to Event Basics: Self-Supervised Learning of Image Reconstruction for Event Cameras via Photometric Constancy. CVPR2021, The IEEE/CVF Conference on Conference on Computer Vision and Pattern Recognition, June 2021.
Patrona, F., Iosifidis, A., Tefas, A., Nikolaidis, N., Pitas, I., 2016. Visual Voice Activity Detection in the Wild. IEEE Transactions on Multimedia 18, 967–977.
Perot, E., de Tournemire, P., Nitti, D., Masci, J., Sironi, A., 2020. Learning to Detect Objects with a 1 Megapixel Event Camera. NIPS2020, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, December 6-12, 2020.
Rebecq, H., Ranftl, R., Koltun, V., Scaramuzza, D., 2019. Events-To-Video: Bringing Modern Computer Vision to Event Cameras. CVPR2019, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 2019.
Rethage, D., Pons, J., Serra, X., 2018. A Wavenet for Speech Denoising. ICASSP2018, IEEE International Conference on Acoustics, Speech and Signal Processing, April 15–20, 2018 Calgary, Alberta, Canada, pp. 5069–5073.
Ryan, C., O’Sullivan, B., Elrasad, A., Cahill, A., Lemley, J., Kielty, P., Posch, C., Perot, E., 2021. Real-time face & eye tracking and blink detection using event cameras. Neural Networks 141, 87–97.
Savran, A., Tavarone, R., Higy, B., Badino, L., Bartolozzi, C., 2018. Energy and Computation Efficient Audio-Visual Voice Activity Detection Driven by Event-Cameras. FG2018, 13th IEEE International Conference on Automatic Face & Gesture Recognition, May 15-19 2018, Xi'an, China.
Savran, A., Bartolozzi, C., 2020. Face Pose Alignment with Event Cameras. Special Issue: Sensor Systems for Gesture Recognition, Vol. 20, Issue 24, Article 7079.
Savran, A., 2023. Multi-timescale boosting for efficient and improved event camera face pose alignment. Computer Vision and Image Understanding, Vol. 236, 103817.
Savran, A., 2023a. Fully Convolutional Event-camera Voice Activity Detection Based on Event Intensity. ASYU2023, IEEE Innovations in Intelligent Systems and Applications Conference, October, 2023, Sivas, Türkiye.
Savran, A., 2023b. Comparison of Timing Strategies for Face Pose Alignment with Event Camera. In: 8th International Conference on Computer Science and Engineering (UBMK). pp. 97–101.
Schaefer, S., Gehrig, D., Scaramuzza, D., 2022. AEGNN: Asynchronous Event-Based Graph Neural Networks. CVPR2022, The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Shahid, M., Beyan, C., Murino, V., 2021. S-VVAD: Visual Voice Activity Detection by Motion Segmentation. WACV2021, Winter Conference on Applications of Computer Vision, January 3-8, 2021, Waikoloa, HI, USA, pp. 2331-2340
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions. CVPR2015, The IEEE/CVF Conference on Conference on Computer Vision and Pattern Recognition, June 2015, Boston, USA.
Sharma, R., Somandepalli, K., Narayanan, S.S., 2019. Toward Visual Voice Activity Detection for Unconstrained Videos. ICIP2019, International Conference on Image Processing, September 22-25, 2019, Taipei, Taiwan.
Tan, G., Wang, Y., Han, H., Cao, Y., Wu, F., Zha, Z.-J., 2022. Multi-Grained Spatio-Temporal Features Perceived Network for Event-Based Lip-Reading. CVPR2022, The IEEE/CVF Conference on Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Tulyakov, S., Bochicchio, A., Gehrig, D., Georgoulis, S., Li, Y., Scaramuzza, D., 2022. Time Lens++: Event-Based Frame Interpolation With Parametric Non-Linear Flow and Multi-Scale Fusion. CVPR2022, The IEEE Conference on Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Wang, D., Xiao, X., Kanda, N., Yoshioka, T., Wu, J., 2023. Target Speaker Voice Activity Detection with Transformers and Its Integration with End-To-End Neural Diarization. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Wang, Y., Du, B., Shen, Y., Wu, K., Zhao, G., Sun, J., Wen, H., 2019. EV-Gait: Event-Based Robust Gait Recognition Using Dynamic Vision Sensors. The IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 2019.
Wang, Y., Zhang, X., Shen, Y., Du, B., Zhao, G., Cui, L., Wen, H., 2022. Event-Stream Representation for Human Gaits Identification Using Deep Neural Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3436–3449.
Wrench, A., 2006. MOCHA-TIMIT, www.cstr.ed.ac.uk/research/projects/artic/mocha.html.
Yu, F., Koltun, V., 2016. Multi-Scale Context Aggregation by Dilated Convolutions. 4th International Conference on Learning Representations, ICLR, San Juan, Puerto Rico, May 2016.
Zhang, X.-L., Wang, D., 2016. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 252–264.
Zhang, J., Dong, B., Zhang, H., Ding, J., Heide, F., Yin, B., Yang, X., 2022. Spiking Transformers for Event-Based Single Object Tracking. CVPR2022, The IEEE Conference on Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.
Zhu, L., Wang, X., Chang, Y., Li, J., Huang, T., Tian, Y., 2022. Event-Based Video Reconstruction via Potential-Assisted Spiking Neural Network. CVPR2022, The IEEE Conference on Conference on Computer Vision and Pattern Recognition, New Orleans, USA, June 2022.

Toplam 51 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Bilgisayar Görüşü, Görüntü İşleme, Örüntü Tanıma, Video İşleme, Derin Öğrenme, Yapay Görme
Bölüm	Araştırma Makalesi
Yazarlar	Arman Savran 0000-0001-5142-6384
Proje Numarası	BAP112
Yayımlanma Tarihi	26 Eylül 2024
Gönderilme Tarihi	4 Aralık 2023
Kabul Tarihi	18 Nisan 2024
Yayımlandığı Sayı	Yıl 2024 Cilt: 7 Sayı: 2

Kaynak Göster

APA	Savran, A. (2024). Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar. Journal of Intelligent Systems: Theory and Applications, 7(2), 102-115. https://doi.org/10.38016/jista.1400047
AMA	Savran A. Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar. jista. Eylül 2024;7(2):102-115. doi:10.38016/jista.1400047
Chicago	Savran, Arman. “Olay Kamerası Ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar”. Journal of Intelligent Systems: Theory and Applications 7, sy. 2 (Eylül 2024): 102-15. https://doi.org/10.38016/jista.1400047.
EndNote	Savran A (01 Eylül 2024) Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar. Journal of Intelligent Systems: Theory and Applications 7 2 102–115.
IEEE	A. Savran, “Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar”, jista, c. 7, sy. 2, ss. 102–115, 2024, doi: 10.38016/jista.1400047.
ISNAD	Savran, Arman. “Olay Kamerası Ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar”. Journal of Intelligent Systems: Theory and Applications 7/2 (Eylül 2024), 102-115. https://doi.org/10.38016/jista.1400047.
JAMA	Savran A. Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar. jista. 2024;7:102–115.
MLA	Savran, Arman. “Olay Kamerası Ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar”. Journal of Intelligent Systems: Theory and Applications, c. 7, sy. 2, 2024, ss. 102-15, doi:10.38016/jista.1400047.
Vancouver	Savran A. Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar. jista. 2024;7(2):102-15.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin

Zeki Sistemler Teori ve Uygulamaları Dergisi