<span class="var-sub_title">Large-Scale Clustering Using MPI-Based Canopy</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Machine Learning in HPC Environments

Large-Scale Clustering Using MPI-Based Canopy

Authors: Thomas Heinis (Imperial College, London)

Abstract: Analyzing massive amounts of data and extracting value from it has become key across different disciplines. Many approaches have been developed to extract insight from the plethora of data available. As the amount of data grow rapidly, however, current approaches for analysis struggle to scale. This is particularly true for clustering algorithms which try to find patterns in the data.

A wide range of clustering approaches has been developed in recent years. What they all share is that they require parameters (number of clusters, size of clusters etc.) to be set a priori. Typically these parameters are determined through trial and error in several iterations or through pre-clustering algorithms. Several pre-clustering algorithms have been developed, but similarly to clustering algorithms, they do not scale well for the rapidly growing amounts of data.

In this paper, we thus take one such pre-clustering algorithm, Canopy, and develop a parallel version based on MPI. As we show, doing so is not straightforward and without optimization, a considerable amount of time is spent waiting for synchronization, severely limiting scalability. We thus optimize our approach to spend as little time as possible with idle cores and synchronization barriers. As our experiments show, our approach scales near linear with increasing dataset size.

Archive Materials

Back to Machine Learning in HPC Environments Archive Listing

Back to Full Workshop Archive Listing