Yapay Sinir Ağlarının Donanım Hataları Altındaki Hata Toleransı ve Zafiyetinin Değerlendirilmesi
Year 2026,
Volume: 14 Issue: 2
,
537
-
550
,
19.04.2026
Hatice Aktaş Aydın
,
Albert Njoroge Kahira
,
Gülay Yalçın
,
Osman Ünsal
Abstract
Yapay Sinir Ağları (YSA), yapay zekaya olan ilgi ve gelişmelerin artması, Yüksek Performanslı Hesaplama (YBH) sistemlerinin sunduğu hesaplama gücünün artması nedeniyle tekrar popülerlik kazanmıştır. Sinir ağı uygulamaları büyük veri merkezlerinde ve YBH sistemlerinde kullanıldığından, bu sistemlerde yaygın olan kayıtlarda ve bellek yapılarında bit kayması gibi benzer güvenilirlik sorunlarıyla karşı karşıyadırlar. Bu nedenle, sistem maliyetini önemli ölçüde artırabilen özel sağlamlık ve koruma mekanizmaları gerektirirler. Ancak, donanım arızalarının YBH uygulamalarının farklı bileşenleri üzerindeki etkisini anlamak, hangi parçaların daha savunmasız olduğunu ve daha yüksek güvenilirlik gerektirdiğini belirlemeye yardımcı olabilir. Bu çalışmada, YBH sistemlerinde ve büyük ölçekli veri merkezlerinde çalıştırıldığında donanım arızalarının YBH uygulamaları üzerindeki etkileri değerlendirilmiş ve böylece güvenilirlik maliyetlerinin düşürülmesi hedeflenmiştir. Geleneksel tekniklerle gerçekleştirilen hata enjeksiyon deneyleri YBH uygulamaları için oldukça zaman alıcı olabilir. Bu nedenle, bu tür uygulamalarda hata enjeksiyon süresini azaltmak için bir yöntem sunulmuştur. CPU tabanlı (Intel Xeon) ve GPU tabanlı (NVIDIA V100) yüksek performanslı bilgi işlem (HPC) sistemlerinde çalışan Yapay Sinir Ağı (YSA) uygulamaları üzerinde donanım arızalarının etkilerini değerlendirdiğimizde, sonuçlarımız YSA'ların bazı donanım arızalarına, özellikle belirli katmanlarda ve mimari kayıtlarda oluşan arızalara karşı savunmasız olduğunu göstermektedir.
References
-
Alobaid, A., Bonny, T., & Alrahhal, M. (2025). Disruptive attacks on artificial neural networks: A systematic review of attack techniques, detection methods, and protection strategies. Intelligent Systems with Applications, 26(1), 200529. https://doi.org/10.1016/j.iswa.2025.200529
-
Bautista Gomez, L. A. B., & Cappello, F. (2015). Detecting and correcting data corruption in stencil applications through multivariate interpolation. In Proceedings of the IEEE International Conference on Cluster Computing (pp. 595–602). IEEE. https://doi.org/10.1109/CLUSTER.2015.108
-
Bengio, Y., Lecun, Y., & Hinton, G. (2021). Deep learning for AI. Communications of the ACM, 64(7), 58–65. https://doi.org/10.1145/3448250
-
Borkar, S., & Chien, A. A. (2011). The future of microprocessors. Communications of the ACM, 54(5), 67–77. https://doi.org/10.1145/1941487.1941507
-
Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., & Snir, M. (2009). Toward exascale resilience. The International Journal of High Performance Computing Applications, 23(4), 374–388. https://doi.org/10.1177/1094342009347767
-
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255). IEEE. https://doi.org/10.1109/CVPR.2009.5206848
-
Di Martino, C., Kramer, W., Kalbarczyk, Z., & Iyer, R. (2015). Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (pp. 25–36). IEEE. https://doi.org/10.1109/DSN.2015.50
-
Henning, J. L. (2006). SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4), 1–17. https://doi.org/10.1145/1186736.1186737
-
Kulakov, A., Zwolinski, M., & Reeve, J. (2015). Fault tolerance in distributed neural computing [Preprint]. https://doi.org/10.13140/RG.2.1.1387.0800
-
Luk, C. K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., & Hazelwood, K. (2005). Pin: Building customized program analysis tools with dynamic instrumentation. ACM SIGPLAN Notices, 40(6), 190–200. https://doi.org/10.1145/1064978.1065034
-
LeCun, Y., Cortes, C., & Burges, C. J. C. (n.d.). MNIST handwritten digit database. Retrieved March 18, 2026, from https://yann.lecun.org/exdb/mnist/index.html
-
Nazari, N., Makrani, H. M., Fang, C., Sayadi, H., Rafatirad, S., Khasawneh, K. N., & Homayoun, H. (2024). Forget and rewire: Enhancing the resilience of transformer-based models against bit-flip attacks. In Proceedings of the 33rd USENIX Security Symposium (pp. 1348-1366). https://www.usenix.org/conference/usenixsecurity24/presentation/nazari
-
Oh, N., Shirvani, P. P., & McCluskey, E. J. (2002). Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability, 51(1), 63–75. https://doi.org/10.1109/24.994913
-
Piuri, V. (2001). Analysis of fault tolerance in artificial neural networks. Journal of Parallel and Distributed Computing, 61(1), 18–48. https://doi.org/10.1006/jpdc.2000.1663
-
Rajagede, R. A., Santriaji, M. H., Fikriansyah, M. A., Nuha, H. H., Fu, Y., & Solihin, Y. (2025). NAPER: Fault protection for real-time resource-constrained deep neural networks. In Proceedings of the IEEE 31st International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE. https://doi.org/10.1109/IOLTS65288.2025.11116827
-
Ruospo, A., Gavarini, G., de Sio, C., Guerrero, J., Sterpone, L., Reorda, M. S., Sanchez, E., Mariani, R., Aribido, J., & Athavale, J. (2023). Assessing convolutional neural networks reliability through statistical fault injections. In Proceedings of the Design, Automation and Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE. https://doi.org/10.23919/DATE56975.2023.10136998
-
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y
-
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. [arXiv preprint]. https://arxiv.org/pdf/1409.1556
-
Su, F., Yuan, P., Wang, Y., & Zhang, C. (2016). The superior fault tolerance of artificial neural network training with a fault/noise injection-based genetic algorithm. Protein & Cell, 7(10), 735–748. https://doi.org/10.1007/s13238-016-0302-5
-
Tchernev, E. B., Mulvaney, R. G., & Phatak, D. S. (2005). Investigating the fault tolerance of neural networks. Neural Computation, 17(7), 1646–1664. https://doi.org/10.1162/0899766053723096
-
Tiwari, D., Gupta, S., Rogers, J., Maxwell, D., Rech, P., Vazhkudai, S., Oliveira, D., Londo, D., Debardeleben, N., Navaux, P., Carro, L., & Bland, A. (2015). Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (pp. 331–342). IEEE. https://doi.org/10.1109/HPCA.2015.7056044
-
Tsai, T., Hari, S. K. S., Sullivan, M., Villa, O., & Keckler, S. W. (2021). NVBitFI: Dynamic Fault Injection for GPUs. In Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (pp. 284–291). IEEE. https://doi.org/10.1109/DSN48987.2021.00041
-
Villa, O., Stephenson, M., Nellans, D., & Keckler, S. W. (2019). NVBit: A dynamic binary instrumentation framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (pp. 372–383). https://doi.org/10.1145/3352460.3358307
-
Vinck, T., Jonckers, N., Dekkers, G., Prinzie, J., & Karsmakers, P. (2025). Mitigating multiple single-event upsets during deep neural network inference using fault-aware training. Journal of Instrumentation, 20(02), Article C02044. https://doi.org/10.1088/1748-0221/20/02/C02044
-
Wang, C., Zhao, P., Wang, S., & Lin, X. (2024). Detection and recovery against deep neural network fault injection attacks based on contrastive learning [arXiv preprint]. http://arxiv.org/abs/2401.16766
Evaluating the Fault Tolerance and Vulnerability of Artificial Neural Networks Under Hardware Errors
Year 2026,
Volume: 14 Issue: 2
,
537
-
550
,
19.04.2026
Hatice Aktaş Aydın
,
Albert Njoroge Kahira
,
Gülay Yalçın
,
Osman Ünsal
Abstract
Artificial Neural Networks (ANN) have gained popularity again due to the increasing interest and developments in artificial intelligence, as well as the increased computational power offered by High Performance Computing (HPC) systems. Since neural network applications are used in large data centers and HPC systems, they face similar reliability issues such as bit slippage in registers and memory structures that are common in these systems. Therefore, they require special robustness and protection mechanisms that can significantly increase the system cost. However, understanding the impact of hardware failures on different components of ANN applications can help determine which parts are more vulnerable and require higher reliability. In this study, the effects of hardware faults on ANN applications when they are run in HPC systems and large-scale data centers are evaluated, and thus, the reliability costs are aimed to be reduced. Fault injection experiments performed with traditional techniques can be quite time-consuming for ANN applications. Therefore, a method is presented to reduce the fault injection time in such applications. When we evaluate the effects of hardware faults on Artificial Neural Network (ANN) applications running on CPU-based (Intel Xeon) and GPU-based (NVIDIA V100) high-performance computing (HPC) systems, our results show that ANNs are vulnerable to some hardware faults, especially those occurring in certain layers and architectural registers.
Ethical Statement
This study does not involve human or animal participants. All procedures followed scientific and ethical principles, and all referenced studies are appropriately cited.
Supporting Institution
This research received no external funding.
Thanks
The authors declare that there are no acknowledgements.
References
-
Alobaid, A., Bonny, T., & Alrahhal, M. (2025). Disruptive attacks on artificial neural networks: A systematic review of attack techniques, detection methods, and protection strategies. Intelligent Systems with Applications, 26(1), 200529. https://doi.org/10.1016/j.iswa.2025.200529
-
Bautista Gomez, L. A. B., & Cappello, F. (2015). Detecting and correcting data corruption in stencil applications through multivariate interpolation. In Proceedings of the IEEE International Conference on Cluster Computing (pp. 595–602). IEEE. https://doi.org/10.1109/CLUSTER.2015.108
-
Bengio, Y., Lecun, Y., & Hinton, G. (2021). Deep learning for AI. Communications of the ACM, 64(7), 58–65. https://doi.org/10.1145/3448250
-
Borkar, S., & Chien, A. A. (2011). The future of microprocessors. Communications of the ACM, 54(5), 67–77. https://doi.org/10.1145/1941487.1941507
-
Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., & Snir, M. (2009). Toward exascale resilience. The International Journal of High Performance Computing Applications, 23(4), 374–388. https://doi.org/10.1177/1094342009347767
-
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255). IEEE. https://doi.org/10.1109/CVPR.2009.5206848
-
Di Martino, C., Kramer, W., Kalbarczyk, Z., & Iyer, R. (2015). Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (pp. 25–36). IEEE. https://doi.org/10.1109/DSN.2015.50
-
Henning, J. L. (2006). SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4), 1–17. https://doi.org/10.1145/1186736.1186737
-
Kulakov, A., Zwolinski, M., & Reeve, J. (2015). Fault tolerance in distributed neural computing [Preprint]. https://doi.org/10.13140/RG.2.1.1387.0800
-
Luk, C. K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., & Hazelwood, K. (2005). Pin: Building customized program analysis tools with dynamic instrumentation. ACM SIGPLAN Notices, 40(6), 190–200. https://doi.org/10.1145/1064978.1065034
-
LeCun, Y., Cortes, C., & Burges, C. J. C. (n.d.). MNIST handwritten digit database. Retrieved March 18, 2026, from https://yann.lecun.org/exdb/mnist/index.html
-
Nazari, N., Makrani, H. M., Fang, C., Sayadi, H., Rafatirad, S., Khasawneh, K. N., & Homayoun, H. (2024). Forget and rewire: Enhancing the resilience of transformer-based models against bit-flip attacks. In Proceedings of the 33rd USENIX Security Symposium (pp. 1348-1366). https://www.usenix.org/conference/usenixsecurity24/presentation/nazari
-
Oh, N., Shirvani, P. P., & McCluskey, E. J. (2002). Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability, 51(1), 63–75. https://doi.org/10.1109/24.994913
-
Piuri, V. (2001). Analysis of fault tolerance in artificial neural networks. Journal of Parallel and Distributed Computing, 61(1), 18–48. https://doi.org/10.1006/jpdc.2000.1663
-
Rajagede, R. A., Santriaji, M. H., Fikriansyah, M. A., Nuha, H. H., Fu, Y., & Solihin, Y. (2025). NAPER: Fault protection for real-time resource-constrained deep neural networks. In Proceedings of the IEEE 31st International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE. https://doi.org/10.1109/IOLTS65288.2025.11116827
-
Ruospo, A., Gavarini, G., de Sio, C., Guerrero, J., Sterpone, L., Reorda, M. S., Sanchez, E., Mariani, R., Aribido, J., & Athavale, J. (2023). Assessing convolutional neural networks reliability through statistical fault injections. In Proceedings of the Design, Automation and Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE. https://doi.org/10.23919/DATE56975.2023.10136998
-
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y
-
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. [arXiv preprint]. https://arxiv.org/pdf/1409.1556
-
Su, F., Yuan, P., Wang, Y., & Zhang, C. (2016). The superior fault tolerance of artificial neural network training with a fault/noise injection-based genetic algorithm. Protein & Cell, 7(10), 735–748. https://doi.org/10.1007/s13238-016-0302-5
-
Tchernev, E. B., Mulvaney, R. G., & Phatak, D. S. (2005). Investigating the fault tolerance of neural networks. Neural Computation, 17(7), 1646–1664. https://doi.org/10.1162/0899766053723096
-
Tiwari, D., Gupta, S., Rogers, J., Maxwell, D., Rech, P., Vazhkudai, S., Oliveira, D., Londo, D., Debardeleben, N., Navaux, P., Carro, L., & Bland, A. (2015). Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (pp. 331–342). IEEE. https://doi.org/10.1109/HPCA.2015.7056044
-
Tsai, T., Hari, S. K. S., Sullivan, M., Villa, O., & Keckler, S. W. (2021). NVBitFI: Dynamic Fault Injection for GPUs. In Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (pp. 284–291). IEEE. https://doi.org/10.1109/DSN48987.2021.00041
-
Villa, O., Stephenson, M., Nellans, D., & Keckler, S. W. (2019). NVBit: A dynamic binary instrumentation framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (pp. 372–383). https://doi.org/10.1145/3352460.3358307
-
Vinck, T., Jonckers, N., Dekkers, G., Prinzie, J., & Karsmakers, P. (2025). Mitigating multiple single-event upsets during deep neural network inference using fault-aware training. Journal of Instrumentation, 20(02), Article C02044. https://doi.org/10.1088/1748-0221/20/02/C02044
-
Wang, C., Zhao, P., Wang, S., & Lin, X. (2024). Detection and recovery against deep neural network fault injection attacks based on contrastive learning [arXiv preprint]. http://arxiv.org/abs/2401.16766