Parallel and Scalable Combinatorial String and Graph Algorithms on Distributed Memory Systems
TimeWednesday, November 14th8:30am - 5pm
DescriptionMethods for processing and analyzing DNA and genomic data are built upon combinatorial graph and string algorithms. The advent of high-throughput DNA sequencing is enabling the generation of billions of reads per experiment. Classical and sequential algorithms can no longer deal with these growing data sizes, which for the last 10 years have greatly out-paced advances in processor speeds. To process and analyze state-of-the-art genomic data sets require the design of scalable and efficient parallel algorithms and the use of large computing clusters. Here, we present our distributed-memory parallel algorithms for indexing large genomic datasets, including algorithms for construction of suffix- and LCP arrays, solving the All-Nearest-Smaller-Values problem and its application to the construction of suffix trees. Our parallel algorithms exhibit superior runtime complexity and practical performance compared to the state-of-the-art. Furthermore, we present distributed-memory algorithms for clustering de-Bruijn graphs and its application to solving a grand challenge metagenomic dataset.