Show simple item record

dc.contributor.advisorKalyanaraman, Ananth
dc.creatorWu, Changjun
dc.date.accessioned2011-08-19T21:58:24Z
dc.date.available2011-08-19T21:58:24Z
dc.date.issued2011
dc.identifier.urihttp://hdl.handle.net/2376/2889
dc.descriptionThesis (Ph.D.), Department of Electrical Engineering and Computer Science, Washington State Universityen_US
dc.description.abstractDeveloping high performance computing solutions for modern day biological problems present a unique set of challenges. The field is experiencing a data revolution due to a rapid introduction of several disruptive experimental technologies. Consequently, computational methods that analyze biological data are currently being put to the test in their capability to scale to massive data sizes. Added to this data-intensiveness, is the brand of computation that is quite different in flavor to that in other, perhaps more traditional scientific computing fields. The problems are dominated by integer arithmetic, string matching, combinatorial space exploration, and graph-theoretic formulations that introduce irregularity in computation and communication patterns.In this thesis, we report on our efforts to bridge the gap between biological data processing and high performance computing solutions. Specifically, we focus on the problem of clustering very large collections of protein sequences on distributed memory supercomputers. Given a set of amino acid sequences we reduce the problem to one of constructing sequence homology graph and subsequently detecting arbitrarily-sized dense subgraphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. Preliminary tests on an arbitrary collection of 2 million protein sequences from the Global Ocean Sampling project database reveal that our new approach is able to improve sensitivity, recruit more sequences, while considerably reducing the time to solution and memory requirement. The algorithmic techniques developed as part of this research have a wider applicability to other applications in computational biology wherever the need for conducting large-scale sequence analysis is the primary bottleneck.en_US
dc.description.sponsorshipDepartment of Computer Science, Washington State Universityen_US
dc.language.isoEnglish
dc.rightsIn copyright
dc.rightsPublicly accessible
dc.rightsopenAccess
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.rights.urihttp://www.ndltd.org/standards/metadata
dc.rights.urihttp://purl.org/eprint/accessRights/OpenAccess
dc.subjectComputer Scienceen_US
dc.subjectBioinformaticsen_US
dc.subjectBioinformaticsen_US
dc.subjectComputational Biologyen_US
dc.subjectGraph Algorithmsen_US
dc.subjectGraph Constructionen_US
dc.subjectHigh Performance Computingen_US
dc.subjectSequence Clusteringen_US
dc.titleParallel Algorithms for Large-scale Computational Metagenomics
dc.typeElectronic Thesis or Dissertation


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record