The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems
Authors: Sudharshan S. Vazhkudai (Oak Ridge National Laboratory), Bronis R. de Supinski (Lawrence Livermore National Laboratory), Arthur S. Bland (Oak Ridge National Laboratory), Al Geist (Oak Ridge National Laboratory), James Sexton (IBM), Jim Kahle (IBM), Christopher J. Zimmer (Oak Ridge National Laboratory), Scott Atchley (Oak Ridge National Laboratory), Sarp H. Oral (Oak Ridge National Laboratory), Don E. Maxwell (Oak Ridge National Laboratory), Veronica G. Vergara Larrea (Oak Ridge National Laboratory), Adam Bertsch (Lawrence Livermore National Laboratory), Robin Goldstone (Lawrence Livermore National Laboratory), Wayne Joubert (Oak Ridge National Laboratory), Chris Chambreau (Lawrence Livermore National Laboratory), David Appelhans (IBM), Robert Blackmore (IBM), Ben Casses (Lawrence Livermore National Laboratory), George Chochia (IBM), Gene Davison (IBM), Matthew A. Ezell (Oak Ridge National Laboratory), Tom Gooding (IBM), Elsa Gonsiorowski (Lawrence Livermore National Laboratory), Leopold Grinberg (IBM), Bill Hanson (IBM), Bill Hartner (IBM), Ian Karlin (Lawrence Livermore National Laboratory), Matthew L. Leininger (Lawrence Livermore National Laboratory), Dustin Leverman (Oak Ridge National Laboratory), Chris Marroquin (IBM), Adam Moody (Lawrence Livermore National Laboratory), Martin Ohmacht (IBM), Ramesh Pankajakshan (Lawrence Livermore National Laboratory), Fernando Pizzano (IBM), James H. Rogers (Oak Ridge National Laboratory), Bryan Rosenburg (IBM), Drew Schmidt (Oak Ridge National Laboratory), Mallikarjun Shankar (Oak Ridge National Laboratory), Feiyi Wang (Oak Ridge National Laboratory), Py Watson (Lawrence Livermore National Laboratory), Bob Walkup (IBM), Lance D. Weems (Lawrence Livermore National Laboratory), Junqi Yin (Oak Ridge National Laboratory)
Abstract: CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and Sequoia systems. Summit and Sierra are currently ranked No. 1 and No. 3, respectively, on the Top500 list. We discuss the design and key differences of the systems. Our evaluation of the systems highlights the following. Applications that fit in HBM see the most benefit and may prefer more GPUs; however, for some applications, the CPU-GPU bandwidth is more important than the number of GPUs. The node-local burst buffer scales linearly, and can achieve a 4X improvement over the parallel file system for large jobs; smaller jobs, however, may benefit from writing directly to the PFS. Finally, several CPU, network and memory bound analytics and GPU-bound deep learning codes achieve up to a 11X and 79X speedup/node, respectively over Titan.
Presentation: file
Back to Technical Papers Archive Listing