HPL and DGEMM Performance Variability on the Xeon Platinum 8160 Processor
TimeTuesday, November 13th2:30pm - 3pm
DescriptionDuring initial testing of a large cluster equipped with Xeon Platinum 8160 processors, we observed infrequent, but significant, performance drops in HPL benchmark results. The variability was seen in both single node and multi-node runs, with approximately 0.4% of results more than 10% slower than the median. We were able to reproduce this behavior with a single-socket (24-core) DGEMM benchmark. Performance counter analysis of several thousand DGEMM runs showed that increased DRAM read traffic is the primary driver of increased execution time. Increased DRAM traffic in this benchmark is primarily generated by dramatically elevated snoop filter evictions, which arise due to the interaction of high-order (physical) address bits with the hash used to map addresses across the 24 coherence agents on the processor. These conflicts (and the associated performance variability) were effectively eliminated (for both DGEMM and HPL) by using 1 GiB large pages.