nicta, disaster recovery using openstack
DESCRIPTION
Jorke Odolphi, NICTA, Disaster Recovery Solution using OpenStack, Thurs, 3:50 pm sessionTRANSCRIPT
Building a Disaster Recovery Solution using OpenStack
Jorke Odolphi
Principal Research Engineer
NICTA
@jorke
http://bionicvision.org.au/eye
The Team
Yuru – ‘cloud’, Gamilaraay People NSW
Problem
The cloud can fail.
Online businesses that rely and benefit most from the cloud don’t have the skills
to handle failure.
Disaster Recovery
process, policies and procedures related to preparing for recovery or continuation of
technology infrastructure critical to an organisation after a natural or human-induced
disaster *
*according to wikipedia..
RPO
Recovery Point Objective
“maximum tolerable period in which data might be lost from an IT Service due to a Major
incident…” *
*according to wikipedia..
RTO
Recovery Time Objective
“duration of time and a service level within which a business process must be restored after
a disaster…” *
*according to wikipedia..
Recovery Time Objective0 downtime Sometime...
Realtime recovery/
failover
Recovery Point
Objective
Somewhere..
Without re-architecting your application;
Provide a configurable warm standby solution,
with a known consistent RPO,
reducing RTO,
minimising business impact.
Our Goal
Goals and Challenges
Replicate application over to OpenStack in case of a disaster
–Preserve the running environment of the application, this includes:
• Compute instances
• Networks
• DNS
Minimise RTO and RPO AND cost!
mypizzashop.com.auPublic IP / Load Balanced
Web front endApache/Nginx/IIS
app.mypizzashop.com.auPrivate IP
ApplicationProcessing/memcache
db.mypizzashop.com.auPrivate IPDatabase
MySQL/PostgreSQL/MSSQL
Architecting for DR in Cloud
Virtualise your servers
– snapshotting support in hypervisor primarily at the disk
Use Dynamic DNS solutions
– E.g. Route 53, Anycast DNS
Compatibility across IaaS Clouds
Cloud Provider
Framework Compute Instance
Object Store
Block Storage
Network Security Group
AWS Custom ✓ ✓ ✓ DHCP ✓
Rackspace Custom ✓ ✓ ✗ STATIC ✗
Ninefold CloudStack ✓ ✓ ✓ DHCP ✓
TryStack OpenStack ✓ ✓ ✓ DHCP ✓
HP Cloud OpenStack ✓ ✓ ✗ DHCP ✓
• Replication from one cloud to another is NOT always possible • Some clouds do not have all the technology pieces (e.g., Block Storage)
• Minimum requirements for replicating application servers: • compute instance and persistent storage, such as object store or block storage • Snapshot service (to ensure point-in-time consistency) • Hypervisor support (e.g., PVGrub)
Overview of DR Process
AWSTake snapshot Create volume
Partition
Send to storageDownload from storage
OpenStackMount new
instance
Building DR using OpenStack
Progress: – Deploying OpenStack in our NICTA lab – Successfully replicated AWS compute instances to
OpenStack • In Rackspace OpenStack public cloud (private beta) • Instances created from standard 64-bit EXT3 AWS OpenSuse
image
Requirements: – Xen support for PVGrub – Write access to partition table – Network support
Problems
Latency Point in Time Log and replay / transactional How do modern databases handle broken transactions / problem disks? Rollback
Optimisations: Incremental Backup
Typical AWS system volume is around 10GB
Replication is tricky for large data volumes
– Initial backup:
• Send the whole data volume (unavoidable!)
• Optimise by compression and skipping empty space (0’s)
– Subsequent backups:
• Incremental – partition a volume into chunks and resend only the difference (the ‘delta’)
Large Data Transfer Across Cloud Datacenters Why so slow?
Optimisations: Large Data Transfer Across Cloud Datacenters for DR
Problem: Transferring large data volumes is slow
– Where is the bottleneck?
• Reading from the source volume? YES!!
• Transferring across LAN/WAN?
• Writing to destination volume?
• Our solution
Rapidly Cloning data volumes from snapshots
– Parallel transfers
50 40
190
140
Volume Scan (MB/s) End-to-end Transfer(MB/s)
Data Transfer Evaluations
1 Clone 4 Clones
Reversing..
Point us to your instances
Replicate to new cloud/region
Automatically sync changes every hour
If the worst happens: failover