search-icon
Paper
:
Optimizing Software-Directed Instruction Replication for GPU Error Detection
Event Type
Paper
Registration Categories
TP
Tags
Algorithms
Architectures
GPUs
Linear Algebra
Networks
Resiliency
TimeThursday, November 15th1:30pm - 2pm
LocationC141/143/149
DescriptionApplication execution on safety-critical and high-performance computer systems must be resilient to transient errors. As GPUs become more pervasive in such systems, they must supplement ECC/parity for major storage structures with reliability techniques that cover more of the GPU hardware logic. Instruction duplication has been explored for CPU resilience; however, it has never been studied in the context of GPUs, and it is unclear whether the performance and design choices it presents makes it a feasible GPU solution. This paper describes a practical methodology to employ instruction duplication for GPUs and identifies implementation challenges that can incur high overheads (69% on average). It explores GPU-specific software optimizations that trade fine-grained recoverability for performance. It also proposes simple ISA extensions with limited hardware changes and area costs to further improve performance, cutting the runtime overheads by more than half to an average of 30%.
Archive
Back To Top Button