Performance, Power, and Scalability Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on the Cray XC40 Theta
Abstract: Training scientific deep learning models requires the large amount of computing power provided by HPC systems. In this paper, we use the distributed deep learning framework Horovod to parallelize NT3, a Python benchmark from the exploratory research project CANDLE (Cancer Distributed Learning Environment). We analyze NT3's scalability, performance, and power characteristics with different batch sizes and learning rates under two memory modes, cache and flat, on the DOE pre-exascale production system Cray XC40 Theta at Argonne National Laboratory. Our experimental results indicate that the power profiles for the node, CPU, and memory are useful in showing how the Horovod NT3 benchmark behaves on the underlying system. Using the communication timeline of this benchmark, we found that the Horovod communication overhead in NT3 increases significantly with the number of nodes although Horovod has the ability to scale up.
The benchmark leads to smaller runtime and lower power consumption for the node and CPU under the cache mode than under the flat mode. Furthermore, increasing the batch size leads to a runtime decrease and slightly impacts the power. Increasing the learning rate results in a slight decrease in runtime and node power and an increase in accuracy. Several issues raised by the Horovod NT3 benchmark results are discussed, and suggestions are proposed for further work.
Back to 8th Workshop on Python for High-Performance and Scientific Computing Archive Listing
Back to Full Workshop Archive Listing