DescriptionOn most high performance computing platforms, applications share network resources with other jobs running concurrently on the system. Inter-job network interference can have a significant impact on the performance of communication-intensive applications, and no satisfactory solutions yet exist for mitigating this degradation.
In this paper, we analyze network congestion caused by multi-job workloads on two production systems that use popular network topologies---fat-tree and dragonfly. For each system, we establish a regression model to relate network hotspots to application performance degradation, showing that current routing strategies are insufficient to load-balance network traffic and mitigate interference on production systems. We then propose an alternative type of adaptive routing strategy, which we call adaptive flow-aware routing. We implement a prototype of our strategy, and tests on the fat-tree system show up to a 46% improvement in job run time when compared to the default routing.