lecture6.pptx
TRANSCRIPT
![Page 1: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/1.jpg)
Lecture 6: Parallel compu0ng, cloud compu0ng and working on Amazon
Web Services
Greg Caporaso [email protected]
![Page 2: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/2.jpg)
Some last thoughts on regular expressions
![Page 3: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/3.jpg)
Robust searches
• Some0mes your queries will fail – Won’t produce output (good) – Will produce incorrect output (bad)
• Fail loudly! Produce a (useful) error message on failure.
![Page 4: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/4.jpg)
Designing robust searches
• Make assump0ons explicit – If you’re assuming that your records start with ‘>’, search for ‘^>’ to avoid matching ‘>’ characters that show up in other places
• Match full lines by including ^ and $ in your search query
• Check the number of matches that were made: is it reasonable?
![Page 5: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/5.jpg)
Tes0ng of soWware
• Start thinking about what posi0ve and nega0ve controls for these terms might look like. SoWware tes0ng is something we’ll be discussing regularly through-‐out the semester.
![Page 6: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/6.jpg)
Why is parallel compu0ng important in bioinforma0cs?
![Page 7: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/7.jpg)
Cluster compu0ng
• Many computers connected to one another to serve as a larger compute resource.
• Compute-‐intensive jobs can be split over many systems and run in parallel.
• Similar to desktop compute hardware, but different casing, no (or only few) displays/keyboards directly connected.
• Owned and maintained “in-‐house”.
![Page 8: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/8.jpg)
Why is parallel compu0ng important in bioinforma0cs?
Pla$orm Sanger 454 (Titanium)
Illumina Genome Analyzer II
Illumina HiSeq2000 Illumina MiSeq
Read Length (bases) ~1000 ~400 150 (single end) 100 (single end) 150 (single end); 250 soon
Number of reads 96 or 384 ~1,000,000 ~100,000,000 ~1,600,000,000 ~10,000,000
Maximum number of samples per run n/a 1000 12,000 (barcode-‐
limited) 24,000 (barcode-‐
limited) 2500 (barcode-‐
limited)
Sequences per $1 (sequencing costs only) 0.44 100 5000 200,000 12,500
![Page 9: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/9.jpg)
The “benchtop” sequencer
![Page 10: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/10.jpg)
OTU picking: example of a compute intensive process
What taxa are represented in each sample?
>PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATGCGCCGCAGGTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATACTGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGCAGTTATCCCGGACACATGGGCTAGG>PC.634_2 FLP3FBN01EG8AXTTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCCTATCCATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAACGCATCCCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCGGAAGAACTATGCCATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGT>PC.354_3 FLP3FBN01EEWKDTTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGGCTATGCATCATTGCCTTGGTAAGCCGTTACCTTACCAACTAGCTAATGCACCGCAGGTCCATCCAAGAGTGATAGCAGAACCATCTTTCAAACTCTAGACATGCGTCTAGTGTTGTTATCCGGTATTAGCATCTGTTTCCAGGTGTTATCCCAGTCTCTTGGG...
Reference treeof non-redundant
full length sequences
BLAST againstreference tree
![Page 11: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/11.jpg)
OTU picking: example of a compute intensive process
What taxa are represented in each sample?
Clusters of “Operational Taxonomic Units” (OTUs); Per sample hits on reference tree;
Taxonomic assignments
BLAST againstreference tree
>PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATGCGCCGCAGGTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATACTGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGCAGTTATCCCGGACACATGGGCTAGG>PC.634_2 FLP3FBN01EG8AXTTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCCTATCCATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAACGCATCCCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCGGAAGAACTATGCCATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGT>PC.354_3 FLP3FBN01EEWKDTTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGGCTATGCATCATTGCCTTGGTAAGCCGTTACCTTACCAACTAGCTAATGCACCGCAGGTCCATCCAAGAGTGATAGCAGAACCATCTTTCAAACTCTAGACATGCGTCTAGTGTTGTTATCCGGTATTAGCATCTGTTTCCAGGTGTTATCCCAGTCTCTTGGG...
Reference treeof non-redundant
full length sequences
![Page 12: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/12.jpg)
OTU Picking
• For 1 billion sequence reads, the ini0al step ran for ~116 hours on 110 processors requiring 4GB of RAM per job for workers and 64GB of RAM for master.
![Page 13: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/13.jpg)
OTU Picking
• For 1 billion sequence reads, the ini0al step ran for ~116 hours on 110 processors requiring 4GB of RAM per job for workers and 64GB of RAM for the master job.
• So, on a single processor desktop with 64GB of RAM… 12760 hours or 532 days!
![Page 14: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/14.jpg)
OTU Picking
• For 1 billion sequence reads, the ini0al step ran for ~116 hours on 110 processors requiring 4GB of RAM per job for workers and 64GB of RAM for the master job.
• So, on a single processor desktop with 64GB of RAM… 12760 hours or 532 days!
• One HiSeq2000 generates this data in a week!
![Page 15: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/15.jpg)
Cloud compu0ng
• Implemented on a cluster (or grid), but compute power is rented as a service to support arbitrary applica0ons.
![Page 16: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/16.jpg)
Maintaining hardware is expensive
• Temperature (redundant cooling systems) • Redundant network connec0ons • Hardware maintenance (e.g., replacing hard drives)
• Fire suppression • Back-‐up power • System administrator ($$)
![Page 17: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/17.jpg)
Pay-‐as-‐you-‐go compute power
• Public clouds (e.g., Amazon) rent compute resources
• Log in, boot virtual machine image, run analyses, and terminate instance.
• Cheaper for many tasks than buying, maintaining, and suppor0ng a compute cluster.
![Page 18: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/18.jpg)
Types of cloud offerings
• Applica0ons/SaaS (e.g., Google Docs, gmail, Dropbox, iCloud)
• Compu0ng planorm/PaaS (e.g., Google App Engine)
• Raw compute resources/IaaS (e.g., Amazon Elas0c Compute Cloud (EC2))
![Page 19: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/19.jpg)
Cloud compu0ng op0ons
• Amazon Elas0c Compute Cloud (EC2) • Magellan – Argonne's DOE Cloud Compu0ng • Data Intensive Academic Grid (DIAG) – Ins0tute for Genome Sciences (IGS), University of Maryland School of Medicine (UMSOM)
![Page 20: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/20.jpg)
Interac0ng with the Amazon Cloud
• Boot virtual machine image via web interface (or a third-‐party tool like StarCluster).
• Log in and work via terminal (or via web interface with IPython Notebook)
• Move data back and forth via sWp/scp or a graphical sWp client (e.g., Cyberduck [free/cross-‐planorm])
![Page 21: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/21.jpg)
Virtual machines
• A “guest” opera0ng system running within a “host” opera0ng system
• A soWware implementa0on of a computer, that operates like a physical computer.
• A developer can create a virtual machine image which contains their tools pre-‐installed. Users can then instan)ate that image to work with those tools.
Browse this page: hvp://en.wikipedia.org/wiki/Virtual_machine
![Page 22: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/22.jpg)
Benefits that virtual machines offer bioinforma0cs
• Reproducibility: can publish protocols with a virtual machine instance id.
• Updates are burden of developer, not user. • Coupled with cloud compu0ng, it’s the perfect model for users with sporadic compute needs.
![Page 23: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/23.jpg)
EC2 costs: www.ec2instances.info
![Page 24: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/24.jpg)
QIIME virtual machine
• The QIIME package distributes an EC2 virtual machine with QIIME and its (many) dependencies pre-‐installed.
• Dependencies include commonly used tools like BLAST, muscle, FastTree, uclust, IPython, and a lot more. A par0al list is available here: hvp://qiime.org/install/install.html
• Latest machine iden0fier can always be found at: hvp://qiime.org/home_sta0c/dataFiles.html
![Page 25: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/25.jpg)
I think there is a world market for maybe five computers. -‐ Thomas Watson, IBM Founder, 1943
![Page 26: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/26.jpg)
I think there is a world market for maybe five computers. -‐ Thomas Watson, IBM Founder, 1943
All figures are in units of 1000. hvp://jeremyreimer.com/postman/node/329
Units sold by year
![Page 27: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/27.jpg)
The democra0za0on of DNA sequencing
+ +
Affordable sequencing
Cloud compu0ng
Open-‐source soWware
![Page 28: Lecture6.pptx](https://reader033.vdocument.in/reader033/viewer/2022060115/557d1535d8b42a4a498b4825/html5/thumbnails/28.jpg)
This work is licensed under the Crea0ve Commons Avribu0on 3.0 United States License. To view a copy of this license, visit hvp://crea0vecommons.org/licenses/by/3.0/us/ or send a lever to Crea0ve Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Feel free to use or modify these slides, but please credit me by placing the following avribu0on informa0on where you feel that it makes sense: Greg Caporaso, www.caporaso.us.