RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management

<span class="var-sub_title">RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management</span> SC18 Proceedings

RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management

Authors: Maxime Martinasso (Swiss National Supercomputing Centre), Miguel Gila (Swiss National Supercomputing Centre), Mauro Bianco (Swiss National Supercomputing Centre), Sadaf R. Alam (Swiss National Supercomputing Centre), Colin McMurtrie (Swiss National Supercomputing Centre), Thomas C. Schulthess (Swiss National Supercomputing Centre)

Abstract: Leading hybrid and heterogeneous supercomputing systems process hundreds of thousands of jobs using complex scheduling algorithms and parameters. The centers operating these systems aim to achieve higher levels of resource utilization while being restricted by compliance with policy constraints. There is a critical need for a high-fidelity, high-performance tool with familiar interfaces that allows not only tuning and optimization of the operational job scheduler but also enables exploration of new resource management algorithms. We propose a new methodology and a tool called RM-Replay which is not a simulator but instead a fast replay engine for production workloads. Slurm is used as a platform to demonstrate the capabilities of our replay engine.

The tool accuracy is discussed and our investigation shows that, by providing better job runtime estimation or using topology-aware allocation, scheduling metric values vary. The presented methodology to create fast replay engines can be extended to other complex systems.

Presentation: file

Back to Technical Papers Archive Listing