PARALLEL ALGORITHMS FOR LARGE-SCALE GRAPH CLUSTERING ON DISTRIBUTED MEMORY ARCHITECTURES
MetadataShow full item record
Graph algorithms on parallel architectures present an interesting case study for irregular applications. We address one such irregular application -- one of clustering real world graphs constructed out of biological data and open-source communities data using parallel computers. While theoretical formulations of the clustering operation are either intractable or computationally prohibitive, efficient heuristics exist to tackle the problem in practice. Yet, implementing these heuristics under a parallel setting becomes a significant challenge owing to a combination of factors including: irregular data access and movement patterns, dependence of computational workload on the input, and a general need to maintain auxiliary pointer-based data structures. We present the design and evaluation of several parallel implementations of a popular serial graph clustering heuristic called the Shingling heuristic, which was originally developed by Gibson et al. Our MapReduce implementation, targets distributed memory clusters running Hadoop and MPI. We also extend the original algorithm to handle weighed graphs. Operating on an input graph that can be represented as a list of edges or adjacency list, our algorithm uses a combination of shuffling and sorting operations, and pipelined MapReduce stages to implement the various phases of the algorithm. As a concrete case for application, we apply the methods developed on large-scale biological graphs obtained from a metagenomic community. Experimental results show both qualitative and performance improvements over previous executions of a baseline version of the clustering method. We also compare our results against other popular generic tools designed for community detection. As another applied case study of our research, we design and evaluate a cluster-based approach for socio-technical coordination in open-source community networks. The research experience in both these domains serve to demonstrate the high utility of cluster-based approaches in scientific domains.