RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management
Event Type
Registration Categories
Resource Management
State of the Practice
System Software
TimeTuesday, November 13th3:30pm - 4pm
DescriptionLeading hybrid and heterogeneous supercomputing systems process hundreds of thousands of jobs using complex scheduling algorithms and parameters. The centers operating these systems aim to achieve higher levels of resource utilization while being restricted by compliance with policy constraints. There is a critical need for a high-fidelity, high-performance tool with familiar interfaces that allows not only tuning and optimization of the operational job scheduler but also enables exploration of new resource management algorithms. We propose a new methodology and a tool called RM-Replay which is not a simulator but instead a fast replay engine for production workloads. Slurm is used as a platform to demonstrate the capabilities of our replay engine.

The tool accuracy is discussed and our investigation shows that, by providing better job runtime estimation or using topology-aware allocation, scheduling metric values vary. The presented methodology to create fast replay engines can be extended to other complex systems.
Back To Top Button