new aws services for bioinformatics
TRANSCRIPT
![Page 1: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/1.jpg)
Lynn Langit
New AWS ServicesFor bioinformatics pipelines
Feb 2017
![Page 2: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/2.jpg)
New AWS Services
• Useful for scaling bioinformatics pipelines• Announced at re:Invent (Nov 2016)
• Athena• Step Functions• Batch• Glue• QuickSight
![Page 3: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/3.jpg)
Starting Point for CSIRO
![Page 4: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/4.jpg)
Serverless AWS Lambda Application
![Page 5: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/5.jpg)
Public Genomic Datasets
![Page 6: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/6.jpg)
About AWS AthenaServerless SQL queries on S3 data
![Page 7: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/7.jpg)
![Page 8: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/8.jpg)
AWS Athena Information
• Add table (structure) to database via DDL from input file(s)• Write and execute SQL query
• Optionally save query• Optionally review query history
• View results • Optionally download result set to .csv
![Page 9: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/9.jpg)
Athena - Demo
![Page 10: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/10.jpg)
Athena Genomics Query Example
![Page 11: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/11.jpg)
About AWS Step FunctionsServerless visual workflows for Lambdas
![Page 12: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/12.jpg)
AWS Step Functions
1. Define steps and services (activities or lambdas)2. Verify step execution(s)3. Monitor and scale
“Your application as a state machine.”
![Page 13: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/13.jpg)
AWS Step Functions – 1. Define Steps/Services
![Page 14: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/14.jpg)
AWS Step Functions – 2. Verify step execution
![Page 15: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/15.jpg)
Step Functions - Demo
![Page 16: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/16.jpg)
![Page 17: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/17.jpg)
About AWS BatchFully managed batch processing at scale
![Page 18: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/18.jpg)
What is batch computing?
Run jobs asynchronously and automatically across one or more computers.
Jobs may dependencies, making the sequencing and scheduling of multiple jobs complex and challenging.
![Page 19: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/19.jpg)
What is AWS Batch?
Fully Managed
No software to install or servers to manage.
Integrated with AWS
Batch jobs can easily and securely interact with
services such as Amazon S3, DynamoDB, and Rekognition
Cost-optimized Provisioning
Auto provisions compute resources tailored to the job
needs using EC2 & EC2 Spot
![Page 20: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/20.jpg)
AWS Batch Concepts
1. Jobs1. Job Definitions2. Job Queues3. Job States
2. Compute Environments3. Scheduler
Short Video -- here
![Page 21: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/21.jpg)
Jobs
Jobs are the unit of work executed by AWS Batch as containerized applications running on Amazon EC2.
Containerized jobs can reference a container image, command, and parameters or users can simply provide a .zip containing their application and we will run it on a default Amazon Linux container.
$ aws batch submit-job --job-name variant-calling --job-definition gatk --job-queue genomics
![Page 22: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/22.jpg)
Massively parallel jobs
• Now - users can submit a large number of independent “simple jobs.” • Soon – AWS will add support for “array jobs” that run many copies of an
application against an array of elements.
Array jobs are an efficient way to run:• Parametric sweeps• Monte Carlo simulations• Processing a large collection of objects
NOTE: These use cases are possible today, simply submit more jobs.
![Page 23: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/23.jpg)
Example Genomics Workflow
![Page 24: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/24.jpg)
Workflows, Pipelines, and Job Dependencies
Jobs can express a dependency on the successful completion of other jobs or specific elements of an array job.
Use your preferred workflow engine and language to submit jobs. Flow-based systems simply submit jobs serially, while DAG-based systems submit many jobs at once, identifying inter-job dependencies.
$ aws batch submit-job –depends-on 606b3ad1-aa31-48d8-92ec-f154bfc8215f ...
![Page 25: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/25.jpg)
Job Definitions
Batch Job Definitions specify how jobs are to be run. While each job must reference a job definition, many parameters can be overridden.
Some of the attributes specified in a job definition:• IAM role associated with the job• vCPU and memory requirements• Mount points• Container properties• Environment variables
$ aws batch register-job-definition --job-definition-name gatk --container-properties ...
![Page 26: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/26.jpg)
Job Queues
Jobs are submitted to a Job Queue, where they reside until they are able to be scheduled to a compute resource. Information related to completed jobs persists in the queue for 24 hours.
$ aws batch create-job-queue --job-queue-name genomics --priority 500 --compute-environment-order ...
![Page 27: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/27.jpg)
Compute EnvironmentsMapped from job queues to run containerized batch jobs.• Managed CEs - you describe your requirements (instance types,
min/max/desired vCPUs, and EC2 Spot bid as a % of On-Demand), AWS launches & scales resources for you. Pick specific instance types, instance families or simply choose “optimal”
• Unmanaged CEs - you can launch and manage your own resources. Your instances need to include the ECS agent and run supported versions of Linux and Docker. AWS Batch will then create an Amazon ECS cluster which can accept the instances you launch. Jobs can be scheduled to your Compute Environment as soon as your instances are healthy and register with the ECS Agent.
$ aws batch create-compute-environment --compute-environment-name unmanagedce --type UNMANAGED ...
![Page 28: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/28.jpg)
AWS Batch Scheduler
The Scheduler evaluates when, where, and how to run jobs that have been submitted to a job queue.
Jobs run in approximately the order in which they are submitted as long as all dependencies on other jobs have been met.
![Page 29: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/29.jpg)
Queued Job States
• SUBMITTED: Accepted into the queue, but not yet evaluated for execution• PENDING: Your job has dependencies on other jobs which have not yet
completed• RUNNABLE: Your job has been evaluated by the scheduler and is ready to run• STARTING: Your job is in the process of being scheduled to a compute
resource• RUNNING: Your job is currently running• SUCCEEDED: Your job has finished with exit code 0• FAILED: Your job finished with a non-zero exit code or was cancelled or
terminated.
![Page 30: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/30.jpg)
AWS Batch Actions
• CancelJob: Marks jobs that are not yet STARTING as FAILED.
• TerminateJob: Cancels jobs that are currently waiting in the queue. Stops jobs that are in a STARTING or RUNNING state and transitions them to FAILED.
NOTE: Requires a “reason” which is viewable via DescribeJobs
$ aws batch cancel-job --reason “Submitted to wrong queue” --jobId= 8a767ac8-e28a-4c97-875b-e5c0bcf49eb8
![Page 31: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/31.jpg)
AWS Batch Data Types
• ComputeEnvironmentDetail• ComputeEnvironmentOrder• ComputeResource• ContainerProperties• ContainerPropertiesResource• CounterProperties• Host
• Job• JobDefinition• JobQueueDetail• MountPoint• Parameter• Ulimit• Volume
![Page 32: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/32.jpg)
Batch - Demo
![Page 33: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/33.jpg)
AWS Batch Pricing and Functionality
There is no charge for AWS Batch; you only pay for the underlying resources that you consume!
NOTE: Support for Array Jobs, retries, and jobs executed as AWS Lambda functions coming soon!
![Page 34: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/34.jpg)
Use the Right Tool for the Job
Not all batch workloads are the same…
• ETL and Big Data processing/analytics?• Consider EMR, Data Pipeline, Redshift, and related services.
• Lots of small Cron jobs? AWS Batch is a great way to execute these jobs, but you will likely want a workflow or job-scheduling system to orchestrate job submissions.
• Efficiently run lots of big and small compute jobs on heterogeneous compute resources? Use AWS Batch
![Page 35: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/35.jpg)
Example: DNA Sequencing
![Page 36: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/36.jpg)
Example: Genomics on Unmanaged Compute Environments
![Page 37: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/37.jpg)
Fully Managed Integrated with AWS Cost-optimized Resource Provisioning
AWS Batch summarized
![Page 38: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/38.jpg)
About AWS GlueServerless managed, scalable ETL
![Page 39: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/39.jpg)
AWS Glue
1. Build a data catalog1. Discover and use your datasets via a Hive-compatible metastore2. Store versions, connection and credential info3. Use crawlers to auto-generate schema from S3 data & partitions
2. Generate and edit transforms using PySpark3. Schedule and run your jobs
1. On schedule, event or lambda
NOTE: Glue is announced, but no beta as of yet…video from re:Invent -- here
![Page 40: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/40.jpg)
![Page 41: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/41.jpg)
![Page 42: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/42.jpg)
![Page 43: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/43.jpg)
An aside…EC2 Elastic GPUs
![Page 44: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/44.jpg)
![Page 45: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/45.jpg)
About AWS QuickSightQuick and easy data dashboards
![Page 46: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/46.jpg)
![Page 47: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/47.jpg)
![Page 48: New AWS Services for Bioinformatics](https://reader034.vdocument.in/reader034/viewer/2022042722/58a3710c1a28abaa488b458f/html5/thumbnails/48.jpg)
Resources for new AWS Services
• Athena (SQL query on S3) – here• Batch (Optimized, chained EC2 batches) – here• Glue (Scaled ETL) -- here• Step Functions (Lambda workflows) – here • QuickSight (Data Dashboards) – here • Full list of AWS services announced at re:Invent 2016 -- here