cloud computing - si.biostat.washington.edu · cloud computing stephanie gogarten adapted from...
TRANSCRIPT
![Page 1: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/1.jpg)
Cloud Computing
Stephanie Gogarten
Adapted from material by David Levine and Roy Kuraisa
![Page 2: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/2.jpg)
Not powerful enough for WGS data
• Memory• CPU• Disk space
2
![Page 3: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/3.jpg)
Single server OK for smaller data sets
3
![Page 4: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/4.jpg)
Large WGS data sets belong on a cluster
4
![Page 5: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/5.jpg)
What is a Cluster?• Hardware– Many computers (instances) each with
• Multiple processors (cores)• Own shared memory
– Shared file system– Network connectivity
• Software– Linux OS– Queuing system (SGE)– Jobs execute independently
• Pros: Many cores and lots of memory• Cons: Responsible for managing parallelism
* Standard distributed-memory 5
![Page 6: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/6.jpg)
Where to get a cluster?
• Owning is expensive, so rent (Cloud)• Pros– No/low infrastructure costs– Pay per use model– Scalable with increasing data set sizes– Variety of computers (RAM, CPU, disk, GPU)– Minimal management– Automatic software updates– Reliability and disaster recovery
6
![Page 7: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/7.jpg)
Where to get a cluster?• Owning is expensive, so rent (Cloud)• Cons– Ongoing monthly costs– Pay for debug runs, failed runs, instances left running– You are your own IT person (or still need one)– Manage much of your own security– Extra effort to minimize costs– Cloud vendor lock-in
Unless you use a managedgenomics platform
7
![Page 8: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/8.jpg)
Where to get a cluster?• Cloud-based genomics platforms• Pro: ease of use• Con: apps/workflows may be platform-specific
8
![Page 9: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/9.jpg)
Managing Pipeline Parallelism
• Dependencies• Synchronization• Heterogeneity• Autoscaling• Retry
Cloud environments add cost complexity
Job 2 chr23 seg n
Job 1
Job 2 - chr1 seg 1
Job 2 - chr1 seg n
. . .
Job 2 chr23 seg 1. . .
. . . . . .
Job 3 - chr1 combine
Job 3 - chr23 combine
Job 4- combine all Job 5- output results
9
![Page 10: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/10.jpg)
Managing Pipeline Parallelism
• Explicit management (command line tools)– Python, JSON– AWS Batch
• Embedded in a genomics application (GUI)– Seven Bridges, DNAnexus, Galaxy, Terra– Mitigate complexity– Centralize data access
10
![Page 11: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/11.jpg)
WGS major computational need
• Run one time– VCF to GDS file conversion
• Run a few times– Relatedness analysis
• Run many times– Association testing
11
![Page 12: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/12.jpg)
What influences cloud costs
• Number of samples• Number of variants & filtering• Number of variants per aggregation unit• Algorithm: Single variant, Aggregate• Implementation: sparse matrices, fastSKAT• Cloud hardware used (cores, RAM, disk)
12
![Page 13: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/13.jpg)
UW-GAC analysis pipeline
• https://github.com/UW-GAC/analysis_pipeline• TopmedPipeline R package• R scripts for various analysis tasks• Python scripts submit R scripts to a cluster or cloud
environment• TopmedPipeline.py defines cluster environments
13
![Page 14: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/14.jpg)
Cluster
SGE_Cluster AWS_Batch
UW_Cluster AWS_Cluster
Cluster class definitions• All Cluster objects have a submitJob method• Cluster defaults set in JSON file– Users can create custom JSON files to override default parameters
14
![Page 15: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/15.jpg)
Analysis configuration• Every python script requires a configuration file (space-delimited plain text)• Parameters include input and output file names, job-specific arguments• Python scripts create intermediate config files to pass to each R script• Examples in testdata directory (e.g., testdata/assoc_window_burden.config):
15
out_prefix "test"gds_file "testdata/1KG_phase3_subset_chr .gds"phenotype_file "testdata/1KG_phase3_subset_annot.RData"null_model_file "testdata/null_model.RData"null_model_params "testdata/null_model.params"variant_include_file "testdata/variant_include_chr .RData"alt_freq_max "0.1"test "burden"test_type "score"genome_build "hg19"
![Page 16: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/16.jpg)
16
Analysis PipelineGeneral Architecture
Analysis Cfg File
ClusterCfg File
Cluster Environment Class.py
Local Cluster
AWS Batch
Other Env
Analysis.py
Cluster Environment Object
Job References and Control
Local Computer (e.g., head node)
Cluster Environment (e.g., SGE)
Cluster of Computers
Job Scheduler
Job Queues
Network Shared Data (e.g,, NFS)
Input Data
Config Data
Output Data
Analysis Pipeline
![Page 17: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/17.jpg)
Parallelization• By chromosome• By segment– The genome is divided into segments based on length or number of
requested segments– Default segment length is 10 Mb– Each chromosome spawns a job per segment– Segments are combined into one file per chromosome
• Multithreading– Some jobs allow mutithreading, where the user can request the job be
divided among N cores
17
![Page 18: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/18.jpg)
Available scripts• Conversion to GDS
– vcf2gds.py• Relatedness and Population structure
– grm.py– ld_pruning.py– king.py– pcair.py– pcrelate.py
• Association tests– null_model.py– assoc.py– locuszoom.py
18
![Page 19: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/19.jpg)
ld_pruning_chr1
⋮ld_pruning_chr2
ld_pruning_chr22
combine_variants pca_byrel
pca_plots
find_unrelated
pca_corr
pca_corr_plots
Flow chart: pcair.py
19
![Page 20: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/20.jpg)
assoc_chr1_seg1
⋮ assoc_combine_chr1null_model
define_segments
assoc_chr1_segN
⋮assoc_chr23_seg1
⋮assoc_chr23_segN
assoc_combine_chr23
assoc_plots⋮
assoc_report
Flow chart: assoc.py
20
![Page 21: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/21.jpg)
Managing software dependencies
• R compiled with Intel MKL• Bioconductor packages
– SeqArray– SeqVarTools– SNPRelate– GENESIS
• CRAN packages– argparser (argument parsing for R scripts)– dplyr, tidyr (data frame manipulation)– ggplot2, GGally (plotting)
• Python 2.7• Command-line software
– bcftools– plink– king
21
![Page 22: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/22.jpg)
What is Docker?
• Platform for developing, deploying and running applications or systems
• A Docker image is:– built containing all software necessary to run the application• Usually built from a base image (e.g., ubuntu)• Includes all additional software to support an application or system (e.g., gnu
C/C++, python)• Typically composed of multiple layers (e.g., ubuntu layer, development tools
layer, R layer)– a read-only template used to create a Docker container
22
![Page 23: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/23.jpg)
What is Docker?
• A Docker container is:– a runnable instance of an image on a local or host computer (e.g.,
Windows 10, macOS, Ubuntu) – what the image becomes in memory when executed– runs natively on Linux– runs a Virtual Machine on macOS and Windows– the container is considered stateless - when the container stops all
changes to code and data are discarded (except for data on local host that is mapped to the container)
23
![Page 24: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/24.jpg)
What is Docker?
• What about accessing data on local host?– Data is typically not included in the Docker image– Data accessible on the local host can be mapped1 (or bind mounted)
to the Docker container– Any changes to data that is mapped to the local host is persisted
when the Docker container stops
24
1On macOS, file sharing is specified in the Docker Preferences
![Page 25: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/25.jpg)
Docker images
25
HPC (e.g., MKL)
Develop Tools (e.g., C++)
Ubuntu 16.04
Docker Image: Ubuntu-hpc
R 3.5.1
Ubuntu-hpc
Docker Image: r-3.5.1
Analysis PipelineR Packages
r-3.5.1
Docker Image: topmed
https://hub.docker.com/u/uwgac (images)https://github.com/UW-GAC/docker (Dockerfiles to build images)
![Page 26: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/26.jpg)
Docker container
26
Data Volume
docker container
Linux, macOS or Windows computer
Analysis PipelineR Packages
r-3.5.1
![Page 27: Cloud Computing - si.biostat.washington.edu · Cloud Computing Stephanie Gogarten Adapted from material by David Levine and Roy Kuraisa. Not powerful enough for WGS data • Memory](https://reader033.vdocument.in/reader033/viewer/2022053120/60a2ebdc24cfb82ec9151342/html5/thumbnails/27.jpg)
Docker on the cloud
27AWS Cloud
. . .Computer Instance
Docker Image/Container
Batc
h AP
I
Auto Scale and Launch
BATCH Services
Queue Definitions
Job Definitions
Compute Environments
Computer Instance
Docker Image/Container
Computer Instance
Docker Image/Container
Computer Instance
Docker Image/Container