Scalable Methods For Genome Assembly
MetadataShow full item record
De novo genome assembly is a fundamental problem in the field of computational biology. The goal is to reconstruct an unknown genome from short DNA fragments (called “reads”) obtained from it. Over the last decade, with the advent of numerous next-generation sequencing (NGS) platforms (e.g., Illumina, 454 Roche), billions of reads can be generated in a matter of hours, leading to vast amounts of data accumulation per day. This has necessitated efficient parallelization of the assembly process to meet the growing data demands. While multiple parallel solutions to the problem have been proposed in the past, there still exists a gap in terms of the processing power between massively parallel NGS technologies and the ability of current state-of-the-art assemblers to analyze and assemble large and complex genomes. Conducting genome assembly at scale remains a challenge owing to the intense computational and memory requirements of the problem, coupled with inherent complexities in existing parallel tools associated with data movement, use of complex data structures, unstructured memory accesses and repeated I/O operations. In this dissertation, we address the challenges of conducting genome assembly at scale and develop new methods for conducting extreme-scale genome assembly for microbial and complex eukaryotic genomes. Our approach to the problem is two-fold, wherein we make the following contributions: i) FastEtch- a new method targeting fast and space-efficient assemblies, using probabilistic data structures (Count-Min sketch) that executes efficiently on shared-memory platforms with a minimal computational footprint (both memory and time). ii) PaKman- a fully distributed method that tackles assembly of large genomes through the combination of a novel data-structure (PaK-Graph) and algorithmic strategies to simplify the communication and I/O footprint during the assembly process. We present an extensive performance and qualitative evaluation of both our algorithms including comparisons to other state-of-the-art methods. Our results demonstrate that FastEtch can yield one of the best time-memory-quality trade-offs, when compared against many state-of-the-art genome assemblers. PaKman has shown the ability to achieve near-linear speedups on up to 8K cores; outperform state-of-the-art distributed and shared memory tools in performance while delivering comparable (if not better) quality; and reduce time to solution significantly.