dubi̇ted

Duzce University Journal of Science and Technology

2148-2446

Duzce University

10.29130/dubited.1793166

Machine Learning Algorithms

Makine Öğrenmesi Algoritmaları

Yapay Sinir Ağlarının Donanım Hataları Altındaki Hata Toleransı ve Zafiyetinin Değerlendirilmesi

Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors

https://orcid.org/0000-0002-1104-9307

Aktaş Aydın

Hatice

SİVAS BİLİM VE TEKNOLOJİ ÜNİVERSİTESİ

https://orcid.org/0000-0002-1138-0577

Kahira

Albert Njoroge

AstraZeneca

https://orcid.org/0000-0003-3929-8126

Yalçın

Gülay

ABDULLAH GUL UNIVERSITY

https://orcid.org/0000-0002-0544-9697

Ünsal

Osman

Barcelona Supercomputing Center

04 19 2026

14 2 537 550 09 29 2025 02 24 2026

2013

Duzce University Journal of Science and Technology

Yapay Sinir Ağları (YSA), yapay zekaya olan ilgi ve gelişmelerin artması, Yüksek Performanslı Hesaplama (YBH) sistemlerinin sunduğu hesaplama gücünün artması nedeniyle tekrar popülerlik kazanmıştır. Sinir ağı uygulamaları büyük veri merkezlerinde ve YBH sistemlerinde kullanıldığından, bu sistemlerde yaygın olan kayıtlarda ve bellek yapılarında bit kayması gibi benzer güvenilirlik sorunlarıyla karşı karşıyadırlar. Bu nedenle, sistem maliyetini önemli ölçüde artırabilen özel sağlamlık ve koruma mekanizmaları gerektirirler. Ancak, donanım arızalarının YBH uygulamalarının farklı bileşenleri üzerindeki etkisini anlamak, hangi parçaların daha savunmasız olduğunu ve daha yüksek güvenilirlik gerektirdiğini belirlemeye yardımcı olabilir. Bu çalışmada, YBH sistemlerinde ve büyük ölçekli veri merkezlerinde çalıştırıldığında donanım arızalarının YBH uygulamaları üzerindeki etkileri değerlendirilmiş ve böylece güvenilirlik maliyetlerinin düşürülmesi hedeflenmiştir. Geleneksel tekniklerle gerçekleştirilen hata enjeksiyon deneyleri YBH uygulamaları için oldukça zaman alıcı olabilir. Bu nedenle, bu tür uygulamalarda hata enjeksiyon süresini azaltmak için bir yöntem sunulmuştur. CPU tabanlı (Intel Xeon) ve GPU tabanlı (NVIDIA V100) yüksek performanslı bilgi işlem (HPC) sistemlerinde çalışan Yapay Sinir Ağı (YSA) uygulamaları üzerinde donanım arızalarının etkilerini değerlendirdiğimizde, sonuçlarımız YSA'ların bazı donanım arızalarına, özellikle belirli katmanlarda ve mimari kayıtlarda oluşan arızalara karşı savunmasız olduğunu göstermektedir.

Artificial Neural Networks (ANN) have gained popularity again due to the increasing interest and developments in artificial intelligence, as well as the increased computational power offered by High Performance Computing (HPC) systems. Since neural network applications are used in large data centers and HPC systems, they face similar reliability issues such as bit slippage in registers and memory structures that are common in these systems. Therefore, they require special robustness and protection mechanisms that can significantly increase the system cost. However, understanding the impact of hardware failures on different components of ANN applications can help determine which parts are more vulnerable and require higher reliability. In this study, the effects of hardware faults on ANN applications when they are run in HPC systems and large-scale data centers are evaluated, and thus, the reliability costs are aimed to be reduced. Fault injection experiments performed with traditional techniques can be quite time-consuming for ANN applications. Therefore, a method is presented to reduce the fault injection time in such applications. When we evaluate the effects of hardware faults on Artificial Neural Network (ANN) applications running on CPU-based (Intel Xeon) and GPU-based (NVIDIA V100) high-performance computing (HPC) systems, our results show that ANNs are vulnerable to some hardware faults, especially those occurring in certain layers and architectural registers.

Fault tolerance Reliability Machine learning Artificial neural networks Artificial intelligence

Hata Toleransı güvenilirlik makine öğrenmesi yapay sinir ağları yapay zeka

This research received no external funding.

Alobaid, A., Bonny, T., & Alrahhal, M. (2025). Disruptive attacks on artificial neural networks: A systematic review of attack techniques, detection methods, and protection strategies. Intelligent Systems with Applications, 26(1), 200529. https://doi.org/10.1016/j.iswa.2025.200529

Bautista Gomez, L. A. B., & Cappello, F. (2015). Detecting and correcting data corruption in stencil applications through multivariate interpolation. In Proceedings of the IEEE International Conference on Cluster Computing (pp. 595–602). IEEE. https://doi.org/10.1109/CLUSTER.2015.108

Bengio, Y., Lecun, Y., & Hinton, G. (2021). Deep learning for AI. Communications of the ACM, 64(7), 58–65. https://doi.org/10.1145/3448250

Borkar, S., & Chien, A. A. (2011). The future of microprocessors. Communications of the ACM, 54(5), 67–77. https://doi.org/10.1145/1941487.1941507

Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., & Snir, M. (2009). Toward exascale resilience. The International Journal of High Performance Computing Applications, 23(4), 374–388. https://doi.org/10.1177/1094342009347767

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255). IEEE. https://doi.org/10.1109/CVPR.2009.5206848

Di Martino, C., Kramer, W., Kalbarczyk, Z., & Iyer, R. (2015). Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (pp. 25–36). IEEE. https://doi.org/10.1109/DSN.2015.50

Henning, J. L. (2006). SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4), 1–17. https://doi.org/10.1145/1186736.1186737

Kulakov, A., Zwolinski, M., & Reeve, J. (2015). Fault tolerance in distributed neural computing [Preprint]. https://doi.org/10.13140/RG.2.1.1387.0800

Luk, C. K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., & Hazelwood, K. (2005). Pin: Building customized program analysis tools with dynamic instrumentation. ACM SIGPLAN Notices, 40(6), 190–200. https://doi.org/10.1145/1064978.1065034

LeCun, Y., Cortes, C., & Burges, C. J. C. (n.d.). MNIST handwritten digit database. Retrieved March 18, 2026, from https://yann.lecun.org/exdb/mnist/index.html

Nazari, N., Makrani, H. M., Fang, C., Sayadi, H., Rafatirad, S., Khasawneh, K. N., & Homayoun, H. (2024). Forget and rewire: Enhancing the resilience of transformer-based models against bit-flip attacks. In Proceedings of the 33rd USENIX Security Symposium (pp. 1348-1366). https://www.usenix.org/conference/usenixsecurity24/presentation/nazari

Oh, N., Shirvani, P. P., & McCluskey, E. J. (2002). Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability, 51(1), 63–75. https://doi.org/10.1109/24.994913

Piuri, V. (2001). Analysis of fault tolerance in artificial neural networks. Journal of Parallel and Distributed Computing, 61(1), 18–48. https://doi.org/10.1006/jpdc.2000.1663

Rajagede, R. A., Santriaji, M. H., Fikriansyah, M. A., Nuha, H. H., Fu, Y., & Solihin, Y. (2025). NAPER: Fault protection for real-time resource-constrained deep neural networks. In Proceedings of the IEEE 31st International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE. https://doi.org/10.1109/IOLTS65288.2025.11116827

Ruospo, A., Gavarini, G., de Sio, C., Guerrero, J., Sterpone, L., Reorda, M. S., Sanchez, E., Mariani, R., Aribido, J., & Athavale, J. (2023). Assessing convolutional neural networks reliability through statistical fault injections. In Proceedings of the Design, Automation and Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE. https://doi.org/10.23919/DATE56975.2023.10136998

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. [arXiv preprint]. https://arxiv.org/pdf/1409.1556

Su, F., Yuan, P., Wang, Y., & Zhang, C. (2016). The superior fault tolerance of artificial neural network training with a fault/noise injection-based genetic algorithm. Protein & Cell, 7(10), 735–748. https://doi.org/10.1007/s13238-016-0302-5

Tchernev, E. B., Mulvaney, R. G., & Phatak, D. S. (2005). Investigating the fault tolerance of neural networks. Neural Computation, 17(7), 1646–1664. https://doi.org/10.1162/0899766053723096

Tiwari, D., Gupta, S., Rogers, J., Maxwell, D., Rech, P., Vazhkudai, S., Oliveira, D., Londo, D., Debardeleben, N., Navaux, P., Carro, L., & Bland, A. (2015). Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (pp. 331–342). IEEE. https://doi.org/10.1109/HPCA.2015.7056044

Tsai, T., Hari, S. K. S., Sullivan, M., Villa, O., & Keckler, S. W. (2021). NVBitFI: Dynamic Fault Injection for GPUs. In Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (pp. 284–291). IEEE. https://doi.org/10.1109/DSN48987.2021.00041

Villa, O., Stephenson, M., Nellans, D., & Keckler, S. W. (2019). NVBit: A dynamic binary instrumentation framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (pp. 372–383). https://doi.org/10.1145/3352460.3358307

Vinck, T., Jonckers, N., Dekkers, G., Prinzie, J., & Karsmakers, P. (2025). Mitigating multiple single-event upsets during deep neural network inference using fault-aware training. Journal of Instrumentation, 20(02), Article C02044. https://doi.org/10.1088/1748-0221/20/02/C02044

Wang, C., Zhao, P., Wang, S., & Lin, X. (2024). Detection and recovery against deep neural network fault injection attacks based on contrastive learning [arXiv preprint]. http://arxiv.org/abs/2401.16766