Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors

Hatice Aktaş Aydın; Albert Njoroge Kahira; Gülay Yalçın; Osman Ünsal

doi:10.29130/dubited.1793166

TR EN

Yapay Sinir Ağlarının Donanım Hataları Altındaki Hata Toleransı ve Zafiyetinin Değerlendirilmesi

Abstract

Yapay Sinir Ağları (YSA), yapay zekaya olan ilgi ve gelişmelerin artması, Yüksek Performanslı Hesaplama (YBH) sistemlerinin sunduğu hesaplama gücünün artması nedeniyle tekrar popülerlik kazanmıştır. Sinir ağı uygulamaları büyük veri merkezlerinde ve YBH sistemlerinde kullanıldığından, bu sistemlerde yaygın olan kayıtlarda ve bellek yapılarında bit kayması gibi benzer güvenilirlik sorunlarıyla karşı karşıyadırlar. Bu nedenle, sistem maliyetini önemli ölçüde artırabilen özel sağlamlık ve koruma mekanizmaları gerektirirler. Ancak, donanım arızalarının YBH uygulamalarının farklı bileşenleri üzerindeki etkisini anlamak, hangi parçaların daha savunmasız olduğunu ve daha yüksek güvenilirlik gerektirdiğini belirlemeye yardımcı olabilir. Bu çalışmada, YBH sistemlerinde ve büyük ölçekli veri merkezlerinde çalıştırıldığında donanım arızalarının YBH uygulamaları üzerindeki etkileri değerlendirilmiş ve böylece güvenilirlik maliyetlerinin düşürülmesi hedeflenmiştir. Geleneksel tekniklerle gerçekleştirilen hata enjeksiyon deneyleri YBH uygulamaları için oldukça zaman alıcı olabilir. Bu nedenle, bu tür uygulamalarda hata enjeksiyon süresini azaltmak için bir yöntem sunulmuştur. CPU tabanlı (Intel Xeon) ve GPU tabanlı (NVIDIA V100) yüksek performanslı bilgi işlem (HPC) sistemlerinde çalışan Yapay Sinir Ağı (YSA) uygulamaları üzerinde donanım arızalarının etkilerini değerlendirdiğimizde, sonuçlarımız YSA'ların bazı donanım arızalarına, özellikle belirli katmanlarda ve mimari kayıtlarda oluşan arızalara karşı savunmasız olduğunu göstermektedir.

Keywords

Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors

Abstract

Artificial Neural Networks (ANN) have gained popularity again due to the increasing interest and developments in artificial intelligence, as well as the increased computational power offered by High Performance Computing (HPC) systems. Since neural network applications are used in large data centers and HPC systems, they face similar reliability issues such as bit slippage in registers and memory structures that are common in these systems. Therefore, they require special robustness and protection mechanisms that can significantly increase the system cost. However, understanding the impact of hardware failures on different components of ANN applications can help determine which parts are more vulnerable and require higher reliability. In this study, the effects of hardware faults on ANN applications when they are run in HPC systems and large-scale data centers are evaluated, and thus, the reliability costs are aimed to be reduced. Fault injection experiments performed with traditional techniques can be quite time-consuming for ANN applications. Therefore, a method is presented to reduce the fault injection time in such applications. When we evaluate the effects of hardware faults on Artificial Neural Network (ANN) applications running on CPU-based (Intel Xeon) and GPU-based (NVIDIA V100) high-performance computing (HPC) systems, our results show that ANNs are vulnerable to some hardware faults, especially those occurring in certain layers and architectural registers.

Keywords

Supporting Institution

This research received no external funding.

Ethical Statement

This study does not involve human or animal participants. All procedures followed scientific and ethical principles, and all referenced studies are appropriately cited.

Thanks

The authors declare that there are no acknowledgements.

References

Alobaid, A., Bonny, T., & Alrahhal, M. (2025). Disruptive attacks on artificial neural networks: A systematic review of attack techniques, detection methods, and protection strategies. Intelligent Systems with Applications, 26(1), 200529. https://doi.org/10.1016/j.iswa.2025.200529
Bautista Gomez, L. A. B., & Cappello, F. (2015). Detecting and correcting data corruption in stencil applications through multivariate interpolation. In Proceedings of the IEEE International Conference on Cluster Computing (pp. 595–602). IEEE. https://doi.org/10.1109/CLUSTER.2015.108
Bengio, Y., Lecun, Y., & Hinton, G. (2021). Deep learning for AI. Communications of the ACM, 64(7), 58–65. https://doi.org/10.1145/3448250
Borkar, S., & Chien, A. A. (2011). The future of microprocessors. Communications of the ACM, 54(5), 67–77. https://doi.org/10.1145/1941487.1941507
Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., & Snir, M. (2009). Toward exascale resilience. The International Journal of High Performance Computing Applications, 23(4), 374–388. https://doi.org/10.1177/1094342009347767
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255). IEEE. https://doi.org/10.1109/CVPR.2009.5206848
Di Martino, C., Kramer, W., Kalbarczyk, Z., & Iyer, R. (2015). Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (pp. 25–36). IEEE. https://doi.org/10.1109/DSN.2015.50
Henning, J. L. (2006). SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4), 1–17. https://doi.org/10.1145/1186736.1186737

Kulakov, A., Zwolinski, M., & Reeve, J. (2015). Fault tolerance in distributed neural computing [Preprint]. https://doi.org/10.13140/RG.2.1.1387.0800
Luk, C. K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., & Hazelwood, K. (2005). Pin: Building customized program analysis tools with dynamic instrumentation. ACM SIGPLAN Notices, 40(6), 190–200. https://doi.org/10.1145/1064978.1065034
LeCun, Y., Cortes, C., & Burges, C. J. C. (n.d.). MNIST handwritten digit database. Retrieved March 18, 2026, from https://yann.lecun.org/exdb/mnist/index.html
Nazari, N., Makrani, H. M., Fang, C., Sayadi, H., Rafatirad, S., Khasawneh, K. N., & Homayoun, H. (2024). Forget and rewire: Enhancing the resilience of transformer-based models against bit-flip attacks. In Proceedings of the 33rd USENIX Security Symposium (pp. 1348-1366). https://www.usenix.org/conference/usenixsecurity24/presentation/nazari
Oh, N., Shirvani, P. P., & McCluskey, E. J. (2002). Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability, 51(1), 63–75. https://doi.org/10.1109/24.994913
Piuri, V. (2001). Analysis of fault tolerance in artificial neural networks. Journal of Parallel and Distributed Computing, 61(1), 18–48. https://doi.org/10.1006/jpdc.2000.1663
Rajagede, R. A., Santriaji, M. H., Fikriansyah, M. A., Nuha, H. H., Fu, Y., & Solihin, Y. (2025). NAPER: Fault protection for real-time resource-constrained deep neural networks. In Proceedings of the IEEE 31st International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE. https://doi.org/10.1109/IOLTS65288.2025.11116827
Ruospo, A., Gavarini, G., de Sio, C., Guerrero, J., Sterpone, L., Reorda, M. S., Sanchez, E., Mariani, R., Aribido, J., & Athavale, J. (2023). Assessing convolutional neural networks reliability through statistical fault injections. In Proceedings of the Design, Automation and Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE. https://doi.org/10.23919/DATE56975.2023.10136998
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. [arXiv preprint]. https://arxiv.org/pdf/1409.1556
Su, F., Yuan, P., Wang, Y., & Zhang, C. (2016). The superior fault tolerance of artificial neural network training with a fault/noise injection-based genetic algorithm. Protein & Cell, 7(10), 735–748. https://doi.org/10.1007/s13238-016-0302-5
Tchernev, E. B., Mulvaney, R. G., & Phatak, D. S. (2005). Investigating the fault tolerance of neural networks. Neural Computation, 17(7), 1646–1664. https://doi.org/10.1162/0899766053723096
Tiwari, D., Gupta, S., Rogers, J., Maxwell, D., Rech, P., Vazhkudai, S., Oliveira, D., Londo, D., Debardeleben, N., Navaux, P., Carro, L., & Bland, A. (2015). Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (pp. 331–342). IEEE. https://doi.org/10.1109/HPCA.2015.7056044
Tsai, T., Hari, S. K. S., Sullivan, M., Villa, O., & Keckler, S. W. (2021). NVBitFI: Dynamic Fault Injection for GPUs. In Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (pp. 284–291). IEEE. https://doi.org/10.1109/DSN48987.2021.00041
Villa, O., Stephenson, M., Nellans, D., & Keckler, S. W. (2019). NVBit: A dynamic binary instrumentation framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (pp. 372–383). https://doi.org/10.1145/3352460.3358307
Vinck, T., Jonckers, N., Dekkers, G., Prinzie, J., & Karsmakers, P. (2025). Mitigating multiple single-event upsets during deep neural network inference using fault-aware training. Journal of Instrumentation, 20(02), Article C02044. https://doi.org/10.1088/1748-0221/20/02/C02044
Wang, C., Zhao, P., Wang, S., & Lin, X. (2024). Detection and recovery against deep neural network fault injection attacks based on contrastive learning [arXiv preprint]. http://arxiv.org/abs/2401.16766

Details

Primary Language

English

Subjects

Machine Learning Algorithms

Journal Section

Research Article

Authors

Hatice Aktaş Aydın ^*
0000-0002-1104-9307
Türkiye

Albert Njoroge Kahira
0000-0002-1138-0577
Spain

Gülay Yalçın
0000-0003-3929-8126
Türkiye

Osman Ünsal
0000-0002-0544-9697
Spain

Publication Date

April 19, 2026

Submission Date

September 29, 2025

Acceptance Date

February 24, 2026

Published in Issue

Year 2026 Volume: 14 Number: 2

DOI

https://doi.org/10.29130/dubited.1793166

IZ

https://izlik.org/JA49BT43WU

Cite

RIS / Bibtex

APA

Aktaş Aydın, H., Kahira, A. N., Yalçın, G., & Ünsal, O. (2026). Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors. Duzce University Journal of Science and Technology, 14(2), 537-550. https://doi.org/10.29130/dubited.1793166

AMA

1.Aktaş Aydın H, Kahira AN, Yalçın G, Ünsal O. Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors. DUBİTED. 2026;14(2):537-550. doi:10.29130/dubited.1793166

Chicago

Aktaş Aydın, Hatice, Albert Njoroge Kahira, Gülay Yalçın, and Osman Ünsal. 2026. “Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors”. Duzce University Journal of Science and Technology 14 (2): 537-50. https://doi.org/10.29130/dubited.1793166.

EndNote

Aktaş Aydın H, Kahira AN, Yalçın G, Ünsal O (April 1, 2026) Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors. Duzce University Journal of Science and Technology 14 2 537–550.

IEEE

[1]H. Aktaş Aydın, A. N. Kahira, G. Yalçın, and O. Ünsal, “Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors”, DUBİTED, vol. 14, no. 2, pp. 537–550, Apr. 2026, doi: 10.29130/dubited.1793166.

ISNAD

Aktaş Aydın, Hatice - Kahira, Albert Njoroge - Yalçın, Gülay - Ünsal, Osman. “Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors”. Duzce University Journal of Science and Technology 14/2 (April 1, 2026): 537-550. https://doi.org/10.29130/dubited.1793166.

JAMA

1.Aktaş Aydın H, Kahira AN, Yalçın G, Ünsal O. Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors. DUBİTED. 2026;14:537–550.

MLA

Aktaş Aydın, Hatice, et al. “Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors”. Duzce University Journal of Science and Technology, vol. 14, no. 2, Apr. 2026, pp. 537-50, doi:10.29130/dubited.1793166.

Vancouver

1.Hatice Aktaş Aydın, Albert Njoroge Kahira, Gülay Yalçın, Osman Ünsal. Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors. DUBİTED. 2026 Apr. 1;14(2):537-50. doi:10.29130/dubited.1793166