Fault-Tolerance for High Performance and Distributed Computing: Theory and Practice
TimeSunday, November 11th8:30am - 5pm
DescriptionReliability is one of the major concerns when envisioning future exascale platforms. The International Exascale Software Project forecasts an increase in node performance and concurrency by one or two orders of magnitude, which translates, even under the most optimistic perspectives, in a mechanical decrease of the mean time to interruption of at least one order of magnitude. Because of this trend, platform providers, software implementors, and high-performance application users who target capability runs on such machines cannot regard the occurrence of interruption due to a failure as a rare dramatic event, but must consider faults inevitable, and therefore design and develop software components that have some form of fault-tolerance integrated at their core.
In this tutorial, we present a comprehensive survey on the techniques proposed to deal with failures in high performance and distributed systems. At the end of the tutorial, each attendee will have a better understanding of the fault tolerance premises and constraints, will know some of the available techniques, and will be able to determine, integrate, and adapt the technique which best suits their applications. In addition, the participants will learn how to employ existing fault tolerant infrastructure software to support more productive application development and deployment.