Silent Data Errors in GPUs: Challenges and Mitigation in Modern Silicon

Authors

  • Sameeksha Gupta

Keywords:

silent data errors, GPU reliability, cosmic radiation sensitivity, architectural vulnerability, workload resilience, error mitigation strategies

Abstract

Silent data errors in graphics processing units SDEs represent a critical challenge for modern computational systems that rely on these accelerators in high-performance computing artificial intelligence and data center operations These errors propagate through calculations without triggering detection mechanisms potentially compromising results in critical applications from autonomous vehicles to medical diagnosis Quantitative analysis reveals disturbing error rates 8 15 10 -3 FIT per device at sea level one error per 14 000 device-hours with error rates increasing 17-32 when running at full computational capacity in data centers The physical causes of SDEs include cosmic radiation causing 61 7 of faults to propagate undetected in streaming multiprocessors manufacturing variations contributing to 4 3 of silent computational failures thermal stress cycles voltage fluctuations and aging effects that impact semiconductor reliability Architectural vulnerability varies significantly register files exhibit 36 silent data corruption rates versus 23 for shared memory and 11 for global memory while instruction vulnerability ranges from 6 1 for integer operations to 42 7 for atomic operations Workload characteristics dramatically affect error sensitivity with machine learning inference showing up to 19 3 accuracy reduction from moderate error rates in transformer models versus 8 6 in convolutional networks Mitigation strategies span hardware ECC reducing corruption by 78 5 firmware and software domains with recent selective redundancy techniques achieving 91 error coverage with only 32 performance overhead Cross-layer resilience approaches demonstrated in recent research can reduce critical data integrity errors by up to 93 4 compared to default protection methods Understanding these complex interactions and implementing targeted protection systems is essential for developing resilient GPU computing platforms that maintain both performance at scale and reliability

Downloads

How to Cite

Silent Data Errors in GPUs: Challenges and Mitigation in Modern Silicon. (1970). Global Journal of Computer Science and Technology, 25(A1), 9-16. https://doi.org/10.34257/GJCSTAVOL25IS1PG9

References

Published

1970-10-14

How to Cite

Silent Data Errors in GPUs: Challenges and Mitigation in Modern Silicon. (1970). Global Journal of Computer Science and Technology, 25(A1), 9-16. https://doi.org/10.34257/GJCSTAVOL25IS1PG9