supporting the computation needs of structural genomics

21
Zach Miller Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Supporting the Computation Needs of Structural Genomics

Upload: tim

Post on 22-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Supporting the Computation Needs of Structural Genomics. Overview. What is structural genomics? Problems we are trying to solve Applications we use and how they interface with Condor Future work Conclusion. What is structural genomics?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Supporting the Computation Needs of Structural Genomics

Zach MillerComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Supporting the Computation Needs of Structural Genomics

Page 2: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

Overview

› What is structural genomics?

› Problems we are trying to solve

› Applications we use and how they interface with Condor

› Future work

› Conclusion

Page 3: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

What is structural genomics?

› It is the branch of genomics that attempts to determine the three dimensional structure of proteins.

› This often requires high-throughput computing to do.

Page 4: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

Problems we are trying to solve

› Target selection – which protein sequences are interesting and worth spending time calculating structures of? BLAST

› Protein structure determination – what is the 3D shape of a given protein sequence? CNS CYANA

Page 5: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

BLAST

› BLAST is developed and supported by NCBI, part of the NIH.

› The NCBI BLAST home page is http://www.ncbi.nlm.nih.gov/

› BLAST is a search tool with special allowances for incomplete data and partial matches.

Page 6: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

BLAST target selection

› By comparing different sets of whole or partial sequences against other databases of known sequences, you can determine if the sequence you are trying to discover is already part of another database.

› In this way you can determine the interesting sequences to work on.

Page 7: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

BLAST and Condor

› Large BLAST searches are easily split into smaller chunks that can be executed in parallel.

› There are two basic approaches: Split the input query into smaller

chunks (our approach) Split the database into smaller chunks

(mpiBLAST approach)

Page 8: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

BLAST and Condor

› Doing thousands of queries against multiple databases is easy using the Condor/BLAST framework.

› Features of the framework: Input queries can come from a file,

ftp, or http Input queries can be in FASTA or XML

format

Page 9: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

BLAST and Condor

› More features of the framework: Databases can also be local files or

automatically fetched via ftp or http and also in either FASTA or XML format

Database Indexes can be automatically built using formatdb

Multiple input files are joined or split as appropriate to fine-tune throughput

Output can be delivered via ftp

Page 10: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

Some statistics

› The BMRB here at the UW is using this framework to compare over 100,000 input sequences against five different databases: nr ( 2726333 sequences ) pdb ( 50137 sequences ) pdboh ( 1122 sequences ) sg ( 53986 sequences ) bmrb ( 2736 sequences)

Page 11: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

Some statistics

› All in all, the BMRB is doing over 8 billion sequence comparisons for their weekly run.

› Condor completes this in roughly eight hours of wall-clock time.

› This is now a weekly routine which is fully automated, very reliable, and requires almost no “babysitting”.

Page 12: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

Structure Calculation

› CNS Available from http://cns.csb.yale.edu/

› CYANA Available from

http://www.guentert.com/

› Both do structure calculations but use different methods

Page 13: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

CNS and Condor

› Using CNS can take a relatively long time to compute for a given entry (protein sequence) depending on the number of possible intermediate structures.

› Each structure takes about 5 – 30 minutes depending on length of sequence

› At 200 structures per entry, this ends up being between 16 and 100 hours.

Page 14: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

CYANA

› Cyana takes only about 2 – 16 hours per entry depending on the sequence length.

› The cyana results are post-processed with CNS to refine them, which takes an additional 4 – 20 hours per entry

Page 15: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

CNS, CYANA, and Condor

› Until now, each different group doing structure calculations would process their own entries using different programs or input parameters, making comparisons between different groups difficult.

› By processing large numbers of entries in exactly the same way, it is possible to then compare apples to apples.

Page 16: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

CNS, CYANA, and Condor

› Working with the BMRB, I created a framework which allows you to easily process multiple entries at once with both CNS and CYANA.

› Using this framework, Condor calculated structures for 600 entries (about 50,000 hours) in just 10 days.

Page 17: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

CNS, CYANA, and Condor

› The structure calculation framework is also very reliable and requires very little human time to do a fairly massive amount of computing.

› This process can now be easily automated and done on a routine basis.

Page 18: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

Challenges

› Creating a job flow that doesn’t need babysitting requires that the framework be able to handle a variety of problems.

› To this end, it employs some other Condor technologies: Many things are wrapped in ftsh. Condor watches for “misbehaving” jobs and

kills them using the PERIODIC_REMOVE feature. DAGMan oversees the whole run and

retries failed jobs.

Page 19: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

Future Work

› BLAST Use STORK for data transfer which will

improve reliability of all file transfers and instantly add support for many more methods of transferring input and output.

Create a wrapper around the framework which behaves just like NCBI’s BLAST but uses Condor behind the scenes.

Include this framework with the Condor distribution so it is BLAST-ready “out of the box”.

Page 20: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

Future Work

› CNS & CYANA Use sequence length to better

estimate runtime for fine-tuning throughput.

Use STORK for file transfer.

Page 21: Supporting the Computation Needs of Structural Genomics

www.cs.wisc.edu/condor

Conclusion› I have created tools which allow users to run

coordinated BLAST, CNS, and CYANA runs on very large scales.

› This makes it easy to process not only your data but other groups’ too, and end up with results that were all computed with the same protocols and inputs.

› This will enable better collaboration by providing more consistency between the results of different groups.