disaster recovery for big data by carlos izquierdo at big data spain 2017
TRANSCRIPT
About us
We are nerds!
Started working in Big Data for international companies
Founded a start-up a few years ago: With colleagues working in related technical areas
And who also knew business stuff!
We’ve been participating in different Big Data projects
Introduction
“I already have HDFS replication and High Availability in my services, why would I need Disaster Recovery (or backup)?”
Concepts
High Availability (HA) Protects from failing
components: disks, servers, network
Is generally a “systems” issue
Redundant, doubles components
Generally has strict network requirements
Fully automated, immediate
Concepts
Backup Allows you to go back to
a previous state in time: daily, monthly, etc.
It is a “data” issue
Protects from accidental deletion or modification
Also used to check for unwanted modifications
Takes some time to restore
Concepts
Disaster Recovery Allows you to work
elsewhere
It is a “business” issue
Covers you from: main site failures such as electric power or network outages, fires, floods or building damage
Similar to having insurance
Medium time to be back online
The ideal Disaster Recovery
High Availability for datacenters
Exact duplicate of the main site Seamless operation (no
changes required)
Same performance
Same data
This is often very expensive and sometimes downright impossible
DR considerations
So, can we build a cheap(ish) DR? We must evaluate some tradeoffs:
What’s the cost of the service not being available? (Murphy’s Law: accidents will happen when you are busiest)
Is all information equally important? Can we lose a small amount of data?
Can we wait until we recover certain data from backup?
Can I find other uses for the DR site?
DR considerations
Synchronous vs Asynchronous Synchronous replication
requires a FAST connection
Synchronous works at transaction level and is necessary for operational systems
Asynchronous replication converges over time
Asynchronous is not affected by delays nor does it create them
Big Data DR
Can’t generally be copied synchronously
No VM replication Other DR rules apply:
Since it impacts users, someone is in charge of the “starting gun”
DNS and network changes to point clients
Main types: Storage replication
Dual ingestion
Storage replication
Similar to non-Big Data solutions, where central storage is replicated
Generally implemented using distcp and HDFS snapshots
Data is ingested in source cluster and then copied
Storage replication
Administrative overhead: Copy jobs must be
scheduled
Metadata changes must be tracked
Good enough for data that comes traditional ETLs such as daily batches
Dual Ingestion
No files, just streams Generally ingested from multiple outside
sources through Kafka Streams must be directed to both sites
Dual Ingestion
Adds complexity to apps Nifi can be set up as a front-end to both
endpoints
Data consistency must be checked Can be automatically set up via monitoring
Consolidation processes (such as a monthly re-sync) might be needed
Others
Ingestion replication Variant of the dual ingestion
A consumer is set up in the source Kafka that in turn writes to a destination Kafka
Bottleneck if the initial streams were generated by many producers
Mixed: Previous solutions are not mutually exclusive
Storage replication for batch processes’ results
Dual ingestion for streams
Commercial offerings
Solutions that ease DR setup Cloudera BDR
Coordinates HDFS snapshots and copy
WANdisco Fusion Continuous storage replication
Confluent Multi-site Allows multi-site Kafka data replication
Tips
Big Data clusters have many nodes Costly to replicate
Performance / Capacity tradeoff
We can use cheaper servers in DR, since we don’t expect to use them often
Tips
Document and test procedures DR is rarely fully automated, so responsibilities and
actions should be clearly defined
Plan for (at least) a yearly DR run
Track changes in software and configuration
Tips
Once you have a DR solution, other uses will surface
DR site can be used for backup Maintain HDFS
snapshots
DR data can be used for testing / reporting Warning: it may alter
stored data
Conclusions
Balance HA / Backup / DR as needed, they are not exclusive: Different costs
Different impact
Big Data DR is different: Dedicated hardware
No VMs, no storage cabin
Plan for DATA CENTRIC solutions