Study of Performance Variability on Dragonfly Systems
TimeSunday, November 11th2:15pm - 2:18pm
DescriptionDragonfly networks are being widely adopted in high-performance computing systems. On these networks, however, interference caused by resource sharing can lead to significant network congestion and performance variability. On a shared network, different job placement policies lead to different traffic distributions. Contiguous job placement policy achieves localized communication by assigning adjacent compute nodes to the same job. Random job placement policy, on the other hand, achieves balanced network traffic by placing application processes sparsely across the network to uniformly distribute the message load. Localized communication and balanced network traffic have opposite advantages and drawbacks. Localizing communication reduces the number of hops for message transfers at the cost of potential network congestion, while balancing network traffic reduces potential local congestion at the cost of increased message transfer hops.
In this study, we first present a comparative analysis exploring the trade-off between localizing communication and balancing network traffic using trace-based simulations, and demonstrate the effect of external network interference by introducing background traffic and show that localized communication can help reduce the application performance variation caused by network sharing. We then introduce an online simulation framework that improves performance and scalability, and discuss the validation of the simulation observations to a production Dragonfly system in respect of performance variability.