<span class="var-sub_title">Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing


Authors: Staci A. Smith (University of Arizona), Clara E. Cromey (University of Arizona), David K. Lowenthal (University of Arizona), Jens Domke (Tokyo Institute of Technology), Nikhil Jain (Lawrence Livermore National Laboratory), Jayaraman J. Thiagarajan (Lawrence Livermore National Laboratory), Abhinav Bhatele (Lawrence Livermore National Laboratory)

Abstract: On most high performance computing platforms, applications share network resources with other jobs running concurrently on the system. Inter-job network interference can have a significant impact on the performance of communication-intensive applications, and no satisfactory solutions yet exist for mitigating this degradation.

In this paper, we analyze network congestion caused by multi-job workloads on two production systems that use popular network topologies---fat-tree and dragonfly. For each system, we establish a regression model to relate network hotspots to application performance degradation, showing that current routing strategies are insufficient to load-balance network traffic and mitigate interference on production systems. We then propose an alternative type of adaptive routing strategy, which we call adaptive flow-aware routing. We implement a prototype of our strategy, and tests on the fat-tree system show up to a 46% improvement in job run time when compared to the default routing.



Presentation: file


Back to Technical Papers Archive Listing