Windows Azure for ResearchRoger Barga Architect
Cloud Computing Futures MSR
IEEE e-Science 2010 Conference
7 - 10 DECEMBER 2010
The Million Server Datacenter
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
HPC Node Architecture Moorersquos ldquoLawrdquo favored commodity systemsbull Specialized processors and systems falteredbull ldquoKiller microsrdquo and industry standard blades ledbull Inexpensive clusters now dominate
wwwtop500org
HPC Interconnects bull Ethernet for low end (cost sensitive)bull High end expectations bull Nearly flat networks and very large switchesbull Operating system bypass for low latency (microseconds)
wwwtop500org
6
Modern Data Center Network
InternetInternetCR CR
AR AR AR ARhellip
SSLB LB
Data CenterLayer 3
Internet
SS
A AA hellip
SS
A AA hellip
hellip
Layer 2
Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)
GigE
10 GigE
HPC Storage Systemsbull Local diskbull Scratch or non-existent
bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)
bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth
wwwnerscgov
~60 PB capacity
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
A Tour Around Windows Azure
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
The Million Server Datacenter
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
HPC Node Architecture Moorersquos ldquoLawrdquo favored commodity systemsbull Specialized processors and systems falteredbull ldquoKiller microsrdquo and industry standard blades ledbull Inexpensive clusters now dominate
wwwtop500org
HPC Interconnects bull Ethernet for low end (cost sensitive)bull High end expectations bull Nearly flat networks and very large switchesbull Operating system bypass for low latency (microseconds)
wwwtop500org
6
Modern Data Center Network
InternetInternetCR CR
AR AR AR ARhellip
SSLB LB
Data CenterLayer 3
Internet
SS
A AA hellip
SS
A AA hellip
hellip
Layer 2
Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)
GigE
10 GigE
HPC Storage Systemsbull Local diskbull Scratch or non-existent
bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)
bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth
wwwnerscgov
~60 PB capacity
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
A Tour Around Windows Azure
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
HPC Node Architecture Moorersquos ldquoLawrdquo favored commodity systemsbull Specialized processors and systems falteredbull ldquoKiller microsrdquo and industry standard blades ledbull Inexpensive clusters now dominate
wwwtop500org
HPC Interconnects bull Ethernet for low end (cost sensitive)bull High end expectations bull Nearly flat networks and very large switchesbull Operating system bypass for low latency (microseconds)
wwwtop500org
6
Modern Data Center Network
InternetInternetCR CR
AR AR AR ARhellip
SSLB LB
Data CenterLayer 3
Internet
SS
A AA hellip
SS
A AA hellip
hellip
Layer 2
Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)
GigE
10 GigE
HPC Storage Systemsbull Local diskbull Scratch or non-existent
bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)
bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth
wwwnerscgov
~60 PB capacity
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
A Tour Around Windows Azure
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
HPC Node Architecture Moorersquos ldquoLawrdquo favored commodity systemsbull Specialized processors and systems falteredbull ldquoKiller microsrdquo and industry standard blades ledbull Inexpensive clusters now dominate
wwwtop500org
HPC Interconnects bull Ethernet for low end (cost sensitive)bull High end expectations bull Nearly flat networks and very large switchesbull Operating system bypass for low latency (microseconds)
wwwtop500org
6
Modern Data Center Network
InternetInternetCR CR
AR AR AR ARhellip
SSLB LB
Data CenterLayer 3
Internet
SS
A AA hellip
SS
A AA hellip
hellip
Layer 2
Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)
GigE
10 GigE
HPC Storage Systemsbull Local diskbull Scratch or non-existent
bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)
bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth
wwwnerscgov
~60 PB capacity
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
A Tour Around Windows Azure
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
HPC Interconnects bull Ethernet for low end (cost sensitive)bull High end expectations bull Nearly flat networks and very large switchesbull Operating system bypass for low latency (microseconds)
wwwtop500org
6
Modern Data Center Network
InternetInternetCR CR
AR AR AR ARhellip
SSLB LB
Data CenterLayer 3
Internet
SS
A AA hellip
SS
A AA hellip
hellip
Layer 2
Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)
GigE
10 GigE
HPC Storage Systemsbull Local diskbull Scratch or non-existent
bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)
bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth
wwwnerscgov
~60 PB capacity
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
A Tour Around Windows Azure
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
6
Modern Data Center Network
InternetInternetCR CR
AR AR AR ARhellip
SSLB LB
Data CenterLayer 3
Internet
SS
A AA hellip
SS
A AA hellip
hellip
Layer 2
Keybull CR (L3 Border Router)bull AR (L3 Access Router)bull S (L2 Switch)bull LB (Load Balancer)bull A (20 Server RackTOR)
GigE
10 GigE
HPC Storage Systemsbull Local diskbull Scratch or non-existent
bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)
bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth
wwwnerscgov
~60 PB capacity
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
A Tour Around Windows Azure
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
HPC Storage Systemsbull Local diskbull Scratch or non-existent
bull Secondary storagebull SAN and parallel file systemsbull Hundreds of TBs (at most)
bull Tertiary storagebull Tape robot(s)bull 3-5 GBs bandwidth
wwwnerscgov
~60 PB capacity
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
A Tour Around Windows Azure
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
HPC and Clouds ndash Select Comparisons
bull Node and system architectures bull Communication fabricbull Storage systems and analyticsbull Physical plant and operationsbull Programming models (rest of tutorial)
A Tour Around Windows Azure
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
A Tour Around Windows Azure
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit ndash November 2010 Updatehttpresearchmicrosoftcomazure xcgngagemicrosoftcom
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
11
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
12
Application Model Comparison
Machines RunningIIS ASPNET
Machines RunningWindows Services
Machines RunningSQL Server
Ad Hoc Application Model
Web Role Instances Worker RoleInstances
Azure StorageBlob Queue Table
SQL Azure
Windows Azure Application Model
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Key ComponentsFabric Controller
bull Manages hardware and virtual machines for service
Computebull Web Roles
bull Web application front end
bull Worker Rolesbull Utility compute
bull VM Rolesbull Custom compute rolebull You own and customize the VM
Storagebull Blobs
bull Binary objects
bull Tablesbull Entity storage
bull Queuesbull Role coordination
bull SQL Azurebull SQL in the cloud
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Key ComponentsFabric Controller
bull Think of it as an automated IT departmentbull ldquoCloud Layerrdquo on top ofbull Windows Server 2008bull A custom version of Hyper-V called the Windows Azure Hypervisor
bull Allows for automated management of virtual machines
bull Itrsquos job is to provision deploy monitor and maintain applications in data centers
bull Applications have a ldquoshaperdquo and a ldquoconfigurationrdquobull The configuration definition describes the shape of a service
bull Role typesbull Role VM sizesbull External and internal endpointsbull Local storage
bull The configuration settings configures a servicebull Instance countbull Storage keysbull Application-specific settings
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Key ComponentsFabric Controller
bull Manages ldquonodesrdquo and ldquoedgesrdquo in the ldquofabricrdquo (the hardware)bull Power-on automation devicesbull Routers Switchesbull Hardware load balancersbull Physical serversbull Virtual servers
bull State transitionsbull Current Statebull Goal Statebull Does what is needed to reach and maintain the goal state
bull Itrsquos a perfect IT employeebull Never sleepsbull Doesnrsquot ever ask for raisebull Always does what you tell it to do in configuration definition and settings
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Creating a New Project
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Windows Azure Compute
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Key Components ndash ComputeWeb Roles
Web Front Endbull Cloud web serverbull Web pagesbull Web services
You can create the following typesbull ASPNET web rolesbull ASPNET MVC 2 web rolesbull WCF service web rolesbull Worker rolesbull CGI-based web roles
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Key Components ndash ComputeWorker Roles
bull Utility computebull Windows Server 2008bull Background processingbull Each role can define an amount of local storagebull Protected space on the local drive considered volatile
storage bull May communicate with outside servicesbull Azure Storagebull SQL Azurebull Other Web services
bull Can expose external and internal endpoints
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Suggested Application ModelUsing queues for reliable messaging
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Scalable Fault Tolerant Applications
Queues are the application gluebull Decouple parts of application easier to scale independentlybull Resource allocation different priority queues and backend
serversbull Mask faults in worker roles (reliable messaging)
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Key Components ndash ComputeVM Roles
bull Customized Rolebull You own the box
bull How it worksbull Download ldquoGuest OSrdquo to Server 2008 Hyper-Vbull Customize the OS as you need tobull Upload the differences VHDbull Azure runs your VM role usingbull Base OSbull Differences VHD
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Application Hosting
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
lsquoGrokkingrsquo the service modelbull Imagine white-boarding out your service architecture with boxes for
nodes and arrows describing how they communicate
bull The service model is the same diagram written down in a declarative format
bull You give the Fabric the service model and the binaries that go with each of those nodes
bull The Fabric can provision deploy and manage that diagram for you
bull Find hardware home
bull Copy and launch your app binaries
bull Monitor your app and the hardware
bull In case of failure take action Perhaps even relocate your app
bull At all times the lsquodiagramrsquo stays whole
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Automated Service ManagementProvide code + service modelbull Platform identifies and allocates resources deploys the service
manages service healthbull Configuration is handled by two files
ServiceDefinitioncsdefServiceConfigurationcscfg
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Service Definition
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Service Configuration
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
GUI
Double click on Role Name in Azure Project
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Deploying to the cloud
bull We can deploy from the portal or from scriptbull VS builds two filesbull Encrypted package of your codebull Your config file
bull You must create an Azure account then a service and then you deploy your code
bull Can take up to 20 minutes bull (which is better than six months)
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Service Management API
bullREST based API to manage your servicesbullX509-certs for authenticationbullLets you create delete change upgrade swaphellipbullLots of community and MSFT-built tools around the API- Easy to roll your own
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
The Secret Sauce ndash The Fabric The Fabric is the lsquobrainrsquo behind Windows Azure
1Process service model1 Determine resource requirements
2 Create role images
2Allocate resources
3Prepare nodes1 Place role images on nodes
2 Configure settings
3 Start roles
4Configure load balancers
5Maintain service health1 If role fails restart the role based on policy
2 If node fails migrate the role based on policy
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
StorageReplicated Highly Available Load Balanced
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Durable Storage At Massive Scale
Blob- Massive files eg videos logs
Drive- Use standard file system APIs
Tables- Non-relational but with few scale limits- Use SQL Azure for relational data
Queues- Facilitate loosely-coupled reliable systems
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Blob Features and Functionsbull Store Large Objects (up to 1TB
in size)
bull You can have as many containers and Blobs as you want
bull Standard REST Interfacebull PutBlob
bull Inserts a new blob overwrites the existing blob
bull GetBlobbull Get whole blob or a specific range
bull DeleteBlobbull CopyBlobbull SnapshotBlobbull LeaseBlob
bull Each Blob has an addressbull httpltstorageaccountgtblobcorewindowsnetltContainergtltBlobNamegtbull httpmovieconversionblobcorewindowsnetoriginalsbargampg
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Containers
bull Similar to a top level folderbull Has an unlimited capacitybull Can only contain BLOBs
Each container has an access level- Private
- Default will require the account key to access- Full public read- Public read only
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Two Types of Blobs Under the Hood
bull Block Blob bull Targeted at streaming
workloadsbull Each blob consists of a
sequence of blocksbull Each block is identified by a Block
ID
bull Size limit 200GB per blob
bull Page Blob bull Targeted at random
readwrite workloadsbull Each blob consists of an
arrayof pagesbull Each page is identified by its offset
from the start of the blob
bull Size limit 1TB per blob
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
bull You can upload a file in lsquoblocksrsquobull Each block has an idbull Then commit those blocks in any order into a
blobbull Final blob limited to 1 TB and up to 50000
blocksbull Can modify a blob by inserting updating and
removing blocksbull Blocks live for a week before being GCrsquod if not
committed to a blobbull Optimized for streaming
Blocks
Bigmpg1 6 8 3 5 4 7 2
Bigmpg
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Pagesbull Similar to block blobsbull Optimized for random readwrite operations and
provide the ability to write to a range of bytes in a blob
bull Call Put Blob set max size Then call Put Pagebull All pages must align 512-byte page boundariesbull Writes to page blobs happen in-place and are
immediately committed to the blobbull The maximum size for a page blob is 1 TB A
page written to a page blob may be up to 1 TB in size
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
BLOB Leases
bull Creates a 1 minute exclusive write lock on a BLOB
bull Operations Acquire Renew Release Break
bull Must have the lease id to perform operations
bull Can check LeaseStatus property
bull Currently can only be done through REST
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Windows Azure Drive
bull Provides a durable NTFS volume for Windows Azure applications to usebull Use existing NTFS APIs to access a durable
drivebull Durability and survival of data on application failover
bull Enables migrating existing NTFS applications tothe cloud
bullA Windows Azure Drive is a Page Blobbull Example mount Page Blob as X
bull httpltaccountnamegtblobcorewindowsnetltcontainernamegtltblobnamegt
bull All writes to drive are made durable to the Page Blobbull Drive made durable through standard Page Blob
replicationbull Drive persists even when not mounted as a Page
Blob
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Windows Azure Drive API
bull Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD
bull Initialize Cache ndash Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance
bull Mount Drive ndash Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using
bull Get Mounted Drives ndash Returns the list of mounted drives It consists of a list of the drive letter and Page Blob URLs for each mounted drive
bull Unmount Drive ndash Unmounts the drive and frees up the drive letter bull Snapshot Drive ndash Allows the client application to create a backup of the
drive (Page Blob) bull Copy Drive ndash Provides the ability to copy a drive or snapshot to another
drive (Page Blob) name to be used as a readwritable drive
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
BLOB Guidance
bull Manage connection stringskeys in cscfgbull Do not share keys wrap with a servicebull Strategy for accounts and containersbull You can assign a custom domain to your storage
accountbull There is no method to detect container
existence call FetchAttributes() and detect the error if it doesnrsquot exist
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Table Structure
Account MovieData
Star WarsStar TrekFan Boys
Table Name Movies
Brian H PrinceJason ArgonautBill Gates
Table Name Customers
Account
Table
Entity
Tables store entities Entity schema can vary in the same table
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Windows Azure Tables
bull Provides Structured Storagebull Massively Scalable Tablesbull Billions of entities (rows) and TBs of
databull Can use thousands of servers as traffic
grows
bull Highly Available amp Durablebull Data is replicated several times
bull Familiar and Easy to use APIbull WCF Data Services and ODatabull NET classes and LINQbull REST ndash with any platform or language
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Is not relationalCan Not-bull Create foreign key relationships between tablesbull Perform server side joins between tablesbull Create custom indexes on the tablesbull No server side Count() for example
All entities must have the following propertiesbull Timestampbull PartitionKeybull RowKey
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Windows Azure Queues
bull Queue are performance efficient highly available and provide reliable message deliverybull Simple asynchronous work
dispatch
bull Programming semantics ensure that a message can be processed at least once
bull Access is provided via REST
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Storage PartitioningUnderstanding partitioning is key to understanding
performance
bull Different for each data type (blobs entities queues)Every data object has a
partition key
bull A partition can be served by a single serverbull System load balances partitions based on traffic patternbull Controls entity locality
Partition key is unit of scale
bull Load balancing can take a few minutes to kick inbull Can take a couple of seconds for partition to be available on a
different serverSystem load balances
bull Use exponential backoff on ldquoServer Busyrdquobull Our system load balances to meet your traffic needsbull Single partition limits have been reached
Server Busy
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Partition Keys In Each Abstraction
bull Entities w same PartitionKey value served from same partitionEntities ndash TableName +
PartitionKeyPartitionKey (CustomerId) RowKey
(RowKind)Name CreditCardNumber OrderTotal
1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx
1 Order ndash 1 $3512
2 Customer-Bill Johnson Bill Johnson xxxx-xxxx-xxxx-xxxx
2 Order ndash 3 $1000
bull Every blob and its snapshots are in a single partitionBlobs ndash Container name +
Blob name
bull All messages for a single queue belong to the same partitionMessages ndash Queue Name
Container Name Blob Name
image annarborbighousejpg
image foxboroughgillettejpg
video annarborbighousejpg
Queue Message
jobs Message1
jobs Message2
workflow Message1
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Replication Guarantee
bull All Azure Storage data exists in three replicasbull Replicas are created as neededbull A write operation is not complete until it has
written to all three replicasbull Reads are only load balanced to replicas in
syncServer 1 Server 2 Server 3
P1
P2
Pn
P1
P2
Pn
P1
P2
Pn
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Scalability TargetsStorage Account
bull Capacity ndash Up to 100 TBsbull Transactions ndash Up to a few thousand requests per secondbull Bandwidth ndash Up to a few hundred megabytes per second
Single QueueTable Partition
bull Up to 500 transactions per second
To go above these numbers partition between multiple storage accounts and partitions
When limit is hit app will see lsquo503 server busyrsquo applications should implement exponential backoff
Single Blob Partition
bull Throughput up to 60 MBs
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
PartitionKey(Category)
RowKey(Title)
Timestamp ReleaseDate
Action Fast amp Furious hellip 2009
Action The Bourne Ultimatum hellip 2007
hellip hellip hellip hellip
Animation Open Season 2 hellip 2009
Animation The Ant Bully hellip 2006
hellip hellip hellip hellip
Comedy Office Space hellip 1999
hellip hellip hellip hellip
SciFi X-Men Origins Wolverine hellip 2009
hellip hellip hellip hellip
War Defiance hellip 2008
Partitions and Partition Ranges
Server BTable = Movies[Comedy - Max]
Server ATable = Movies[Min - Comedy)
Server ATable = Movies
[Min - Max]
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Key Selection Things to Consider
bullDistribute load as much as possiblebullHot partitions can be load balancedbullPartitionKey is critical for scalability
See httpwwwmicrosoftpdccom2009SVC09 and httpazurescopecloudappnet for more information
bull Avoid frequent large scansbull Parallelize queriesbull Point queries are most efficient
bullTransactions across a single partitionbullTransaction semantics amp Reduce round trips
Scalability
Query Efficiency amp Speed
Entity group transactions
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Expect Continuation Tokens ndash Seriously
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 1000 rows in a response
At the end of partition range boundary
Maximum of 5 seconds to execute the query
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Tables Recapbull Efficient for frequently used queriesbull Supports batch transactionsbull Distributes load
Select PartitionKey and RowKey that help scale
Avoid ldquoAppend onlyrdquo patterns
Always Handlecontinuation tokens
ldquoORrdquo predicates are not optimized
Implement back-offstrategy for retries
bull Distribute by using a hash etc as prefix
bull Expect continuation tokens for range queries
bull Execute the queries that form the ldquoORrdquo predicates as separate queries
bull Server busybull Load balance partitions to meet traffic needsbull Load on single partition has exceeded the limits
WCF Data Services
bull Use a new context for each logical operationbull AddObjectAttachTo can throw exception if entity is already being tracked
bull Point query throws an exception if resource does not exist Use IgnoreResourceNotFoundException
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
QueuesTheir Unique Role in Building Reliable Scalable Applicationsbull Want roles that work closely together but are not
bound togetherbull Tight coupling leads to brittlenessbull This can aid in scaling and performance
bull A queue can hold an unlimited number of messagesbull Messages must be serializable as XMLbull Limited to 8KB in sizebull Commonly use the work ticket pattern
bull Why not simply use a table
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Queue Terminology
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Message Lifecycle
Queue
Msg 1
Msg 2
Msg 3
Msg 4
Worker Role
Worker Role
PutMessage
Web Role
GetMessage (Timeout)RemoveMessage
Msg 2Msg 1
Worker Role
Msg 2
POST httpmyaccountqueuecorewindowsnetmyqueuemessages
HTTP11 200 OK Transfer-Encoding chunked Content-Type applicationxml Date Tue 09 Dec 2008 210430 GMT Server Nephos Queue Service Version 10 Microsoft-HTTPAPI20
ltxml version=10 encoding=utf-8gt ltQueueMessagesListgt ltQueueMessagegt ltMessageIdgt5974b586-0df3-4e2d-ad0c-18e3892bfca2ltMessageIdgt ltInsertionTimegtMon 22 Sep 2008 232920 GMTltInsertionTimegt ltExpirationTimegtMon 29 Sep 2008 232920 GMTltExpirationTimegt ltPopReceiptgtYzQ4Yzg1MDIGM0MDFiZDAwYzEwltPopReceiptgt ltTimeNextVisiblegtTue 23 Sep 2008 052920GMTltTimeNextVisiblegt ltMessageTextgtPHRlc3Q+dGdGVzdD4=ltMessageTextgt ltQueueMessagegt ltQueueMessagesListgt
DELETEhttpmyaccountqueuecorewindowsnetmyqueuemessagesmessageidpopreceipt=YzQ4Yzg1MDIGM0MDFiZDAwYzEw
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll
increases interval by 2x
A successful sets the interval back to 1
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
60
21
11
C1
C2
Removing Poison Messages
11
21
340
Producers Consumers
P2
P1
30
2 GetMessage(Q 30 s) msg 2
1 GetMessage(Q 30 s) msg 1
11
21
10
20
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
61
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
11
21
2 GetMessage(Q 30 s) msg 23 C2 consumed msg 24 DeleteMessage(Q msg 2)7 GetMessage(Q 30 s) msg 1
1 GetMessage(Q 30 s) msg 15 C1 crashed
11
21
6 msg1 visible 30 s after Dequeue30
12
11
12
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
62
C1
C2
Removing Poison Messages
340
Producers Consumers
P2
P1
12
2 Dequeue(Q 30 sec) msg 23 C2 consumed msg 24 Delete(Q msg 2)7 Dequeue(Q 30 sec) msg 18 C2 crashed
1 Dequeue(Q 30 sec) msg 15 C1 crashed10 C1 restarted11 Dequeue(Q 30 sec) msg 112 DequeueCount gt 213 Delete (Q msg1)1
2
6 msg1 visible 30s after Dequeue9 msg1 visible 30s after Dequeue
30
13
12
13
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Queues Recap
bullNo need to deal with failuresMake messageprocessing idempotent
bull Invisible messages result in out of orderDo not rely on order
bullEnforce threshold on messagersquos dequeue countUse Dequeue count to remove poison messages
bullMessages gt 8KBbullBatch messagesbullGarbage collect orphaned blobs
bullDynamically increasereduce workers
Use blob to storemessage data with
reference in message
Use message countto scale
bullNo need to deal with failures
bull Invisible messages result in out of order
bullEnforce threshold on messagersquos dequeue count
bullDynamically increasereduce workers
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Windows Azure Storage TakeawaysData abstractions to build your applications
Blobs ndash Files and large objectsDrives ndash NTFS APIs for migrating applicationsTables ndash Massively scalable structured storageQueues ndash Reliable delivery of messages
Easy to use via the Storage Client Library
More info on Windows Azure Storage at
httpblogsmsdncomwindowsazurestoragehttpazurescopecloudappnet
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Best Practices
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Picking the Right VM Size
bull Having the correct VM size can make a big difference in costs
bull Fundamental choice ndash larger fewer VMs vs many smaller instances
bull If you scale better than linear across cores larger VMs could save you money
bull Pretty rare to see linear scaling across 8 cores
bull More instances may provide better uptime and reliability (more failures needed to take your service down)
bull Only real right answer ndash experiment with multiple sizes and instance counts in order to measure and find what is ideal for you
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Using Your VM to the MaximumRememberbull 1 role instance == 1 VM running Windowsbull 1 role instance = one specific task for your codebull Yoursquore paying for the entire VM so why not use it
bull Common mistake ndash split up code into multiple roles each not using up CPU
bull Balance between using up CPU vs having free capacity in times of needbull Multiple ways to use your CPU to the fullest
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Exploiting Concurrencybull Spin up additional processes each with a specific task or as a
unit of concurrency
bull May not be ideal if number of active processes exceeds number of cores
bull Use multithreading aggressively
bull In networking code correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads
bull In NET 4 use the Task Parallel Library
bull Data parallelism
bull Task parallelism
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Finding Good Code Neighborsbull Typically code falls into one or more of these categories
bull Find code that is intensive with different resources to live togetherbull Example distributed network caches are typically network-
and memory-intensive they may be a good neighbor for storage IO-intensive code
MemoryIntensive
CPUIntensive
Network IO Intensive Storage IO Intensive
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Scaling Appropriatelybull Monitor your application and make sure yoursquore scaled appropriately (not
over-scaled)
bull Spinning VMs up and down automatically is good at large scale
bull Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running
bull Being too aggressive in spinning down VMs can result in poor user experience
bull Trade-off between risk of failurepoor user experience due to not having excess capacity and the costs of having idling VMs
Performance Cost
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Storage Costs
bullUnderstand an applicationrsquos storage profile and how storage billing works
bullMake service choices based on your app profilebull Eg SQL Azure has a flat fee while Windows Azure Tables charges per
transaction
bull Service choice can make a big cost difference based on your app profile
bull Caching and compressing They help a lot with storage costs
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web apprsquos billing profile
Sending fewer things over the wire often means getting fewer things from storage
Saving bandwidth costs often lead to savings inother places
Sending fewer things means your VM has time to do other tasks
All of these tips have the side benefit of improving your web apprsquos performance and user experience
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Compressing Content
1Gzip all output content
bull All modern browsers can decompress on the flybull Compared to Compress Gzip has much better
compression and freedom from patented algorithms
2Tradeoff compute costs for storage size
3Minimize image sizesbull Use Portable Network Graphics (PNGs)bull Crush your PNGsbull Strip needless metadatabull Make all PNGs palette PNGs
Uncompressed Content
Compressed Content
GzipMinify JavaScript
Minify CCSMinify Images
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Best Practices Summary
Doing lsquolessrsquo is the key to saving costs
Measure everything
Know your application profile in and out
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Cloud Computing for eScience Applications
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) bull The most important software in bioinformaticsbull Identify similarity between bio-sequences
Computationally intensivebull Large number of pairwise alignment operationsbull A BLAST running can take 700 ~ 1000 CPU hoursbull Sequence databases growing exponentiallybull GenBank doubled in size in about 15 months
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Opportunities for Cloud Computing
It is easy to parallelize BLASTbull Segment the input bull Segment processing (querying) is pleasingly parallel
bull Segment the database (eg mpiBLAST)bull Needs special result reduction processing
Large volume databull A normal Blast database can be as large as 10GBbull 100 nodes means the peak storage bandwidth could reach
to 1TB
bull The output of BLAST is usually 10-100x larger than the input
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
AzureBLAST
bull Parallel BLAST engine on Azure
bull Query-segmentation data-parallel patternbull split the input sequencesbull query partitions in parallelbull merge results together when done
bull Follows the general suggested application model bull Web Role + Queue + Worker
bull With three special considerationsbull Batch job managementbull Task parallelism on an elastic CloudWei Lu Jared Jackson and Roger Barga AzureBlast A Case Study of Developing Science Applications on the Cloud in Proceedings of the 1st Workshop on Scientific
Cloud Computing (Science Cloud 2010) Association for Computing Machinery Inc 21 June 2010
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
AzureBLAST Task-FlowA simple SplitJoin pattern
Leverage multi-core of one instance bull argument ldquondashardquo of NCBI-BLASTbull 1248 for small middle large and extra large instance size
Task granularity bull Large partition load imbalance bull Small partition unnecessary overheadsbull NCBI-BLAST overheadbull Data transferring overhead
Best Practice test runs to profiling and set size to mitigate the overhead
Value of visibilityTimeout for each BLAST task bull Essentially an estimate of the task run time bull too small repeated computation bull too large unnecessary long period of waiting time in case of the instance failure
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Micro-Benchmarks Inform DesignTask size vs Performancebull Benefit of the warm cache effectbull 100 sequences per partition is the best
choice
Instance size vs Performancebull Super-linear speedup with larger size
worker instancesbull Primarily due to the memory capability
Task SizeInstance Size vs Costbull Extra-large instance generated the best
and the most economical throughputbull Fully utilize the resource
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
AzureBLAST
Web Portal
Web Service
Job registration
Job Scheduler
WorkerWorker
WorkerWorker
WorkerWorker
Global dispatch
queue
Web Role
Azure Table
Job Management Role
Azure Blob
Database updating Role
helliphellip
Scaling Engine
Blast databases temporary data etc)
Job RegistryNCBI databases
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
hellip
Merging Task
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
AzureBLAST Job PortalASPNET program hosted by a web role instancebull Submit jobsbull Track jobrsquos status and logs
AuthenticationAuthorization based on Live ID
The accepted job is stored into the job registry tablebull Fault tolerance avoid in-memory
states
Web Portal
Web Service
Job registration
Job Scheduler
Job Portal
Scaling Engine
Job Registry
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Demonstration
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
R palustris as a platform for H2 productionEric Shadt SAGE Sam Phattarasukol Harwood Lab UW
Blasted ~5000 proteins (700K sequences)bull Against all NCBI non-redundant proteins completed in 30 minbull Against ~5000 proteins from another strain completed in less
than 30 sec
AzureBLAST significantly saved computing timehellip
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
All-Against-All ExperimentDiscovering Homologs bull Discover the interrelationships of known protein sequences
ldquoAll against Allrdquo querybull The database is also the input querybull The protein database is large (42 GB size)bull Totally 9865668 sequences to be queried
bull Theoretically 100 billion sequence comparisons
Performance estimationbull Based on the sampling-running on one extra-large Azure
instancebull Would require 3216731 minutes (61 years) on one desktop
This scale of experiments usually are infeasible to most scientists
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Our Approachbull Allocated a total of ~4000 instances
bull 475 extra-large VMs (8 cores per VM) four datacenters US (2) Western and North Europe
bull 8 deployments of AzureBLASTbull Each deployment has its own co-located storage service
bull Divide 10 million sequences into multiple segmentsbull Each will be submitted to one deployment as one job for executionbull Each segment consists of smaller partitions
bull When load imbalances redistribute the load manually
50
6262 62
6262
5062
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
End Resultbull Total size of the output result is ~230GB
bull The number of total hits is 1764579487
bull Started at March 25th the last task completed on April 8th (10 days compute)bull But based our estimates real working instance time should be 6~8 daybull Look into log data to analyze what took placehellip
50
6262 62
6262
5062
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Understanding Azure by analyzing logs
A normal log record should be
Otherwise something is wrong (eg task failed to complete)
3312010 614 RD00155D3611B0 Executing the task 251523 3312010 625 RD00155D3611B0 Execution of task 251523 is done it took 109mins3312010 625 RD00155D3611B0 Executing the task 251553 3312010 644 RD00155D3611B0 Execution of task 251553 is done it took 193mins3312010 644 RD00155D3611B0 Executing the task 251600 3312010 702 RD00155D3611B0 Execution of task 251600 is done it took 1727 mins
3312010 822 RD00155D3611B0 Executing the task 251774
3312010 950 RD00155D3611B0 Executing the task 251895
3312010 1112 RD00155D3611B0 Execution of task 251895 is done it took 82 mins
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Surviving System Upgrades
North Europe Data Center totally 34256 tasks processed
All 62 compute nodes lost tasks and then came back in a group This is an
Update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
35 Nodes experience blob writing failure at same time
Surviving Storage FailuresWest Europe Datacenter 30976 tasks are completed and job was killed
A reasonable guess the Fault Domain is working
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
MODISAzure Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dryIrish Proverb
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Computing Evapotranspiration (ET)
ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature(Pa K-1) λv = Latent heat of vaporization (Jg) Rn = Net radiation (W m-2)cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa)ga = Conductivity of air (inverse of ra) (m s-1)gs = Conductivity of plant stoma air (inverse of rs) (m s-1) γ = Psychrometric constant (γ asymp 66 Pa K-1)
Estimating resistanceconductivity across a catchment can be tricky
bull Lots of inputs big data reductionbull Some of the inputs are not so simple
119864119879= ∆119877119899 + 120588119886 119888119901ሺ120575119902ሻ119892119886(∆+ 120574ሺ1+ 119892119886 119892119904Τ ሻ)120582120592
Penman-Monteith (1964)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration or evaporation through plant membranes by plants
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
ET Synthesizes Imagery Sensors Models and Field Data
NASA MODIS imagery source
archives5 TB (600K files)
FLUXNET curated sensor dataset
(30GB 960 files)
FLUXNET curated field dataset2 KB (1 file)
NCEPNCAR ~100MB (4K files)
Vegetative clumping~5MB (1file)
Climate classification~1MB (1file)
20 US year = 1 global year
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
MODISAzure Four Stage Image Processing PipelineData collection (map) stagebull Downloads requested input
tiles from NASA ftp sitesbull Includes geospatial lookup for
non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile
Reprojection (map) stagebull Converts source tile(s) to
intermediate result sinusoidal tiles
bull Simple nearest neighbor or spline algorithms
Derivation reduction stagebull First stage visible to scientistbull Computes ET in our initial use
Analysis reduction stagebull Optional second stage visible
to scientistbull Enables production of science
analysis artifacts such as maps tables virtual sensors
Reduction 1 Queue
Source Metadata
AzureMODIS Service Web Role Portal
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Science results
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
httpresearchmicrosoftcomen-usprojectsazureazuremodisaspx
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
MODISAzure Architectural Big Picture (12)
bull ModisAzure Service is the Web Role front doorbull Receives all user requestsbull Queues request to appropriate
Download Reprojection or Reduction Job Queue
bull Service Monitor is a dedicated Worker Rolebull Parses all job requests into tasks
ndash recoverable units of work bull Execution status of all jobs and
tasks persisted in Tables
ltPipelineStagegt Request
hellipltPipelineStagegtJobStatus
PersistltPipelineStagegtJob Queue
MODISAzure Service(Web Role)
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
hellip
DispatchltPipelineStagegtTask Queue
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
MODISAzure Architectural Big Picture (22)
All work actually done by a Worker Role
Service Monitor (Worker Role)
Parse amp PersistltPipelineStagegtTaskStatus
GenericWorker (Worker Role)
hellip
hellip
DispatchltPipelineStagegtTask Queue
hellip
ltInputgtData Storage
bull Dequeues tasks created by the Service Monitor
bull Retries failed tasks 3 timesbull Maintains all task status
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Example Pipeline Stage Reprojection Service
Reprojection Requesthellip
Service Monitor (Worker Role)
ReprojectionJobStatusPersist
Parse amp PersistReprojectionTaskStatus
GenericWorker (Worker Role)
hellip
Job Queue
hellip
Dispatch
Task Queue
Points to
hellip
ScanTimeList
SwathGranuleMetaReprojection Data
Storage
Each entity specifies a single reprojection job request
Each entity specifies a single reprojection task (ie a single
tile)
Query this table to get geo-metadata (eg boundaries)
for each swath tile
Query this table to get the list of satellite scan times that
cover a target tile
Swath Source Data Storage
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Costs for 1 US Year ET Computation
bull Computational costs driven by data scale and need to run reduction multiple times
bull Storage costs driven by data scale and 6 month project duration
bull Small with respect to the people costs even at graduate student rates
Reduction 1 Queue
Source Metadata
Request Queue
Scientific Results Download
Data Collection Stage
Source Imagery Download Sites
Reprojection Queue
Reduction 2 Queue
DownloadQueue
Scientists
Analysis Reduction StageDerivation Reduction Stage Reprojection Stage
400-500 GB60K files10 MBsec11 hourslt10 workers
$50 upload$450 storage
400 GB45K files3500 hours20-100 workers
5-7 GB55K files1800 hours20-100 workers
lt10 GB~1K files1800 hours20-100 workers
$420 cpu$60 download
$216 cpu$1 download$6 storage
$216 cpu$2 download$9 storage
AzureMODIS Service Web Role Portal
Total $1420
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Observations and Experiencebull Clouds are the largest scale computer centers ever constructed and have
the potential to be important to both large and small scale science problems
bull Equally import they can increase participation in research providing needed resources to userscommunities without ready access
bull Clouds suitable for ldquoloosely coupledrdquo data parallel applications and can support many interesting ldquoprogramming patternsrdquo but tightly coupled low-latency applications do not perform optimally on clouds today
bull Provide valuable fault tolerance and scalability abstractions
bull Clouds as amplifier for familiar client tools and on premise compute
bull Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Resources Cloud Research Community Sitehttpresearchmicrosoftcomazure bull Getting started steps for
developersbull Available research services bull Use cases on Azure for researchbull Event Announcementsbull Detailed tutorialsbull Technical papers
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Resources AzureScopehttpazurescopecloudappnet bull Simple benchmarks illustrating
basic performance for compute and storage services
bull Benchmarks for reference algorithms
bull Best Practice tipsbull Code Samples
Email us with questions at xcgngagemicrosoftcom
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Demonstration
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-
Azure in Action Manning PressProgramming Windows Azure OrsquoReilly PressBing Channel 9 Windows AzureBing Windows Azure Platform Training Kit - November Updatehttpresearchmicrosoftcomazurexcgngagemicrosoftcom
- Windows Azure for Research Roger Barga Architect
- The Million Server Datacenter
- HPC and Clouds ndash Select Comparisons
- HPC Node Architecture
- HPC Interconnects
- Modern Data Center Network
- HPC Storage Systems
- HPC and Clouds ndash Select Comparisons (2)
- Slide 9
- Slide 10
- Application Model Comparison
- Application Model Comparison (2)
- Key Components
- Key Components Fabric Controller
- Key Components Fabric Controller (2)
- Key Components Fabric Controller (3)
- Creating a New Project
- Windows Azure Compute
- Key Components ndash Compute Web Roles
- Key Components ndash Compute Worker Roles
- Suggested Application Model Using queues for reliable messaging
- Scalable Fault Tolerant Applications
- Key Components ndash Compute VM Roles
- Slide 24
- lsquoGrokkingrsquo the service model
- Automated Service Management
- Service Definition
- Service Configuration
- GUI
- Deploying to the cloud
- Service Management API
- The Secret Sauce ndash The Fabric
- Slide 33
- Durable Storage At Massive Scale
- Blob Features and Functions
- Containers
- Two Types of Blobs Under the Hood
- Blocks
- Pages
- BLOB Leases
- Windows Azure Drive
- Windows Azure Drive API
- BLOB Guidance
- Table Structure
- Windows Azure Tables
- Is not relational
- Windows Azure Queues
- Storage Partitioning
- Partition Keys In Each Abstraction
- Replication Guarantee
- Scalability Targets
- Partitions and Partition Ranges
- Key Selection Things to Consider
- Slide 54
- Tables Recap
- Queues Their Unique Role in Building Reliable Scalable Applica
- Queue Terminology
- Message Lifecycle
- Truncated Exponential Back Off Polling
- Removing Poison Messages
- Removing Poison Messages (2)
- Removing Poison Messages (3)
- Queues Recap
- Windows Azure Storage Takeaways
- Slide 65
- Picking the Right VM Size
- Using Your VM to the Maximum
- Exploiting Concurrency
- Finding Good Code Neighbors
- Scaling Appropriately
- Storage Costs
- Saving Bandwidth Costs
- Compressing Content
- Best Practices Summary
- Cloud Computing for eScience Applications
- NCBI BLAST
- Opportunities for Cloud Computing
- AzureBLAST
- AzureBLAST Task-Flow
- Micro-Benchmarks Inform Design
- AzureBLAST (2)
- AzureBLAST Job Portal
- Demonstration
- R palustris as a platform for H2 production
- All-Against-All Experiment
- Our Approach
- End Result
- Understanding Azure by analyzing logs
- Surviving System Upgrades
- Surviving Storage Failures
- MODISAzure Computing Evapotranspiration (ET) in the Cloud
- Computing Evapotranspiration (ET)
- ET Synthesizes Imagery Sensors Models and Field Data
- MODISAzure Four Stage Image Processing Pipeline
- MODISAzure Architectural Big Picture (12)
- MODISAzure Architectural Big Picture (22)
- Example Pipeline Stage Reprojection Service
- Costs for 1 US Year ET Computation
- Observations and Experience
- Resources Cloud Research Community Site
- Resources AzureScope
- Resources AzureScope (2)
- Demonstration (2)
- Slide 104
-