Error Analysis in HPC Applications Using Algorithmic Differentiation
TimeSunday, November 11th2:30pm - 2:33pm
DescriptionComputer applications running on supercomputers are used to solve critical problems. These systems are expected to perform tasks not just quickly, but also correctly. Various factors that can affect correctness of programs include faults, reduced precision, lossy data reduction, iteration and truncation. In the presence of these errors, how do we know whether our program is producing correct results? I have developed a method to understand the impact of these errors on a computer program. The method employs algorithmic differentiation (AD) to analyze the sensitivity of the simulation output to errors in program variables. A tool that we developed based on this method evaluates a given computer program and identifies vulnerable regions that need to be protected from errors. We use this to selectively protect variables against Silent Data Corruptions (SDC). We also use this method to study floating point sensitivity of the code and develop mixed-precision configurations to achieve performance improvement without affecting accuracy. Using this tool we can ensure that the computer simulation applications give us the correct results in the presence of these errors, so that scientists and policy makers relying on these results can make accurate predictions that can have lasting impact.