Large-Scale Clustering Using MPI-Based Canopy
TimeMonday, November 12th11am - 11:30am
DescriptionAnalyzing massive amounts of data and extracting value from it has become key across different disciplines. Many approaches have been developed to extract insight from the plethora of data available. As the amount of data grow rapidly, however, current approaches for analysis struggle to scale. This is particularly true for clustering algorithms which try to find patterns in the data.
A wide range of clustering approaches has been developed in recent years. What they all share is that they require parameters (number of clusters, size of clusters etc.) to be set a priori. Typically these parameters are determined through trial and error in several iterations or through pre-clustering algorithms. Several pre-clustering algorithms have been developed, but similarly to clustering algorithms, they do not scale well for the rapidly growing amounts of data.
In this paper, we thus take one such pre-clustering algorithm, Canopy, and develop a parallel version based on MPI. As we show, doing so is not straightforward and without optimization, a considerable amount of time is spent waiting for synchronization, severely limiting scalability. We thus optimize our approach to spend as little time as possible with idle cores and synchronization barriers. As our experiments show, our approach scales near linear with increasing dataset size.