redundancy does not imply fault tolerance · 2017. 3. 3. · redundancy does not imply fault...
TRANSCRIPT
Redundancy Does Not Imply Fault Tolerance:Analysis of Distributed Storage Reactions
to Single Errors and Corruptions
Aishwarya Ganesan, Ramnatthan Alagappan,
Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau
Redundancy for Fault Tolerance
2
Redundancy for Fault Tolerance
2
Replication can mask failures
Redundancy for Fault Tolerance
2
Replication can mask failures
- System crashes
Redundancy for Fault Tolerance
2
Replication can mask failures
- System crashes
- Machine reboots
Redundancy for Fault Tolerance
2
Replication can mask failures
- Network failures
- System crashes
- Machine reboots
Redundancy in Distributed Storage
3
Redundancy in Distributed Storage
3
Depend on local file systems to store data
Redundancy in Distributed Storage
3
How about partial storage faults?
Depend on local file systems to store data
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
How about partial storage faults?
Depend on local file systems to store data
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
How about partial storage faults?
Depend on local file systems to store data
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
- disk block corruption
How about partial storage faults?
Depend on local file systems to store data
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
- disk block corruption
File system may return I/O error on a read
How about partial storage faults?
Depend on local file systems to store data
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
- disk block corruption
File system may return I/O error on a read
- latent sector error
How about partial storage faults?
Depend on local file systems to store data
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
- disk block corruption
File system may return I/O error on a read
- latent sector error
We call these file-system faults
How about partial storage faults?
Depend on local file systems to store data
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
- disk block corruption
File system may return I/O error on a read
- latent sector error
We call these file-system faults
How about partial storage faults?
Depend on local file systems to store data
Do distributed storage systems use redundancy to recover from local file-system faults?
Our Study
4
Our StudyBehavior of eight distributed systems in response to file-system faults
4
Our StudyBehavior of eight distributed systems in response to file-system faults
Broad spectrum of replication and consensus protocols
4
Our StudyBehavior of eight distributed systems in response to file-system faults
Broad spectrum of replication and consensus protocols
Replicated state machines- ZooKeeper (uses ZAB for consensus)
- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)
4
Our StudyBehavior of eight distributed systems in response to file-system faults
Broad spectrum of replication and consensus protocols
Replicated state machines- ZooKeeper (uses ZAB for consensus)
- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)
Primary backup replication- MongoDB
- Redis
- Kafka (in-sync replicas for leader election)
4
Our StudyBehavior of eight distributed systems in response to file-system faults
Broad spectrum of replication and consensus protocols
Replicated state machines- ZooKeeper (uses ZAB for consensus)
- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)
Primary backup replication- MongoDB
- Redis
- Kafka (in-sync replicas for leader election)
Dynamo-style quorum- Cassandra (decentralized, no leader/follower)
4
Fault model
5
Fault model
A single fault to a single file-system block in a single node
5
Fault model
A single fault to a single file-system block in a single node
Type of faults:
- corruptions
- read errors
- write errors
5
Common Expectation
6
Common Expectation
6
Common Expectation
6
Fault Model: A single fault to one block in only one replica
Common Expectation
6
Fault Model: A single fault to one block in only one replica
Redundancy would enable recovery from local file-system faults
7
Redundancy Does Not Imply Fault Tolerance
7
Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes
7
Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes
- data loss, corruption, unavailability, and spread of corruption to other intact replicas
7
Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes
- data loss, corruption, unavailability, and spread of corruption to other intact replicas
Silent corruption
Unavailability Data lossReduced
redundancyQuery
failures
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
7
Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes
- data loss, corruption, unavailability, and spread of corruption to other intact replicas
Silent corruption
Unavailability Data lossReduced
redundancyQuery
failures
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Why does Redundancy Not Imply Fault Tolerance?
8
Why does Redundancy Not Imply Fault Tolerance?
Some fundamental problems across multiple systems – not just implementation bugs!
8
Why does Redundancy Not Imply Fault Tolerance?
Some fundamental problems across multiple systems – not just implementation bugs!
Faults are often undetected locally – leads to harmful global effects
8
Why does Redundancy Not Imply Fault Tolerance?
Some fundamental problems across multiple systems – not just implementation bugs!
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
8
Why does Redundancy Not Imply Fault Tolerance?
Some fundamental problems across multiple systems – not just implementation bugs!
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
Crash and corruption handling are entangled – loss of committed data
8
Why does Redundancy Not Imply Fault Tolerance?
Some fundamental problems across multiple systems – not just implementation bugs!
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
Crash and corruption handling are entangled – loss of committed data
Unsafe interaction between local behavior and global distributed protocols can spread corruption or data loss
8
9
Fundamental Problems: Summary
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
Outline
Introduction
Fault Injection
System Behavior Analysis
Major Results
Observations Across Systems
Conclusion
10
Fault Model
11
Fault Model
11
Server 1 Server 2 Server 3
Request
Fault Model
11
Server 1 Server 2 Server 3Client
Request
Fault Model
11
Server 1 Server 2 Server 3Client
Request
Fault Model
11
Server 1 Server 2 Server 3Client
File System
read
/w
rite
File System
read
/w
rite
File System
read
/w
rite
Request
Fault Model
11
Server 1 Server 2 Server 3Client
File System
read
/w
rite
A single fault to a single file-system block in a single node
File System
read
/w
rite
File System
read
/w
rite
Request
Fault Model
11
Server 1 Server 2 Server 3Client
File System
read
/w
rite
A single fault to a single file-system block in a single node
Fault for current run:server 1, block B1 read corruption
File System
read
/w
rite
File System
read
/w
rite
Request
Fault Model
11
Server 1 Server 2 Server 3Client
File System
read
/w
rite
A single fault to a single file-system block in a single node
Fault for current run:server 1, block B1 read corruption
Fault for next run:server 1, block B1read error
File System
read
/w
rite
File System
read
/w
rite
Request
Fault Model
11
Server 1 Server 2 Server 3Client
File System
read
/w
rite
A single fault to a single file-system block in a single node
Faults injected only to user data not filesystem metadata
Fault for current run:server 1, block B1 read corruption
Fault for next run:server 1, block B1read error
File System
read
/w
rite
File System
read
/w
rite
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
File System File System File System
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
File System File System File Systemext4 ext4ext4
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
File System File System File Systemext4 ext4ext4
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
a
File System File System File Systemext4 ext4ext4
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
aC
orr
up
t d
ata
File System File System File Systemext4 ext4ext4
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
aC
orr
up
t d
ata
Ext4: disk corruption → corrupted data to applications
File System File System File Systemext4 ext4ext4
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Ext4: disk corruption → corrupted data to applications
File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
a
Ext4: disk corruption → corrupted data to applications
File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
a
Ext4: disk corruption → corrupted data to applications
File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs
I/O
Er
ror
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
a
Ext4: disk corruption → corrupted data to applications
Btrfs: disk corruption → I/O error to applications
File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs
I/O
Er
ror
Fault Injection Methodology
13
Server 1 Server 2 Server 3
File System File SystemFile System
Fault Injection Methodology
13
Server 1 Server 2 Server 3
File System File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
errfs - a FUSE file system to inject file-system faults
Fault Injection Methodology
13
Server 1 Server 2 Server 3
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
errfs - a FUSE file system to inject file-system faults
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
errfs - a FUSE file system to inject file-system faults
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
errfs - a FUSE file system to inject file-system faults
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
errfs - a FUSE file system to inject file-system faults
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
return B1’-B4
errfs - a FUSE file system to inject file-system faults
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
return B1’-B4
errfs - a FUSE file system to inject file-system faults
Local Behavior
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
return B1’-B4
errfs - a FUSE file system to inject file-system faults
Local BehaviorCrash RetryIgnore faulty dataNo detection/recovery
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
return B1’-B4
errfs - a FUSE file system to inject file-system faults
Local BehaviorCrash RetryIgnore faulty dataNo detection/recovery Global Effect
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
return B1’-B4
errfs - a FUSE file system to inject file-system faults
Local BehaviorCrash RetryIgnore faulty dataNo detection/recovery Global Effect
CorruptionData lossUnavailability
Outline
Introduction
Fault Injection
System Behavior Analysis
Major Results
Observations Across Systems
Conclusion
14
System Behavior AnalysisBehavior of eight distributed systems in response to file-system faults
Broad spectrum of replication and consensus protocols
Replicated state machines- ZooKeeper (uses ZAB for consensus)
- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)
Primary backup replication- MongoDB
- Redis
- Kafka (in-sync replicas for leader election)
Dynamo-style quorum- Cassandra (decentralized, no leader/follower)
15
An Example: Redis
16
An Example: Redis
16
Redis is a popular data structure store
Leader
An Example: Redis
16
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
ClientWrite
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
ClientWrite
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
ClientWrite
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
ClientWrite
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
redis_database
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
redis_database
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
redis_database
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
redis_database
redis_database
redis_database
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
Follower
Follower
Leader
An Example: Redis
16
redis_database
redis_database
redis_database
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
Redis: Behavior Analysis
17
Redis: Behavior Analysis
17
Read Workload
Redis: Behavior Analysis
17
Local Behavior
Read Workload
Redis: Behavior Analysis
17
CorruptRead
I/O Error
Local Behavior
Read Workload
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderL
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Local Behavior
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
UnavailabilityNo checksums to detect corruptionLeader crashes due to failed deserializationNo automatic failover - cluster unavailable
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
CorruptionNo checksums to detect corruptionLeader returns corrupted data on queriesCorruption propagation to followers
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
Correct
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
Correct
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
Retry
WriteUnavailability
Correct
Other Systems
18
Other Systems
Metadata stores: ZooKeeper, LogCabin
Wide column store: Cassandra
Document stores: MongoDB
Distributed databases: RethinkDB, CockroachDB
Message Queues: Kafka
18
Outline
Introduction
Fault Injection
System Behavior Analysis
Major Results
Redundancy Does not Provide Fault Tolerance
Observations Across Systems
Conclusion
19
Redundancy Does not Provide Fault Tolerance
20
Redundancy Does not Provide Fault Tolerance
20
Redis Read
ZooKeeper Write
Kafka Read
RethinkDB Read
Cassandra ReadKafka Write
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
ZooKeeper WriteWrite Error
Kafka Read
RethinkDB ReadCorrupt
Cassandra ReadKafka Write
CorruptReadError Corrupt
ReadError
CorruptReadError
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
Corrupt
Cassandra ReadKafka Write
checkpoint
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corrupt
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
CorruptionCorrupt
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Data Loss
Corrupt
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
Corrupt
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Query Failure
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Query Failure
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Reduced Redundancy
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Query Failure
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Reduced Redundancy
Harmful global effects despite redundancy
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Query Failure
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Reduced Redundancy
Harmful global effects despite redundancyNot simple implementation bugs - fundamental problems across multiple systems!
OutlineIntroduction
Fault Injection
System Behavior Analysis
Major Results
Observations Across Systems
Faults are Often Undetected Locally
Crashing: Common Local Reaction
Crash and Corruption Handling are Entangled
Unsafe interaction between local and global protocols
Conclusion
21
Faults are Often Undetected Locally
22
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
22
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
Cassandra: Locally Undetected Fault
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
Client Replica 1 Other Replicas
Cassandra: Locally Undetected Fault
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
Cassandra: Locally Undetected Fault
sstable
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
Cassandra: Locally Undetected Fault
sstable
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
Cassandra: Locally Undetected Fault
sstable
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
Cassandra: Locally Undetected Fault
sstable
READ (R=1)
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
Cassandra: Locally Undetected Fault
sstable
CORRUPT
READ (R=1)
Read Repair
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
Cassandra: Locally Undetected Fault
sstable
CORRUPT
READ (R=1)
Read Repair
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
key | value
Cassandra: Locally Undetected Fault
sstable
CORRUPT
READ (R=1)
Read Repair
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
key | valueREAD
CORRUPT
Cassandra: Locally Undetected Fault
sstable
CORRUPT
READ (R=1)
Read Repair
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
key | valueREAD
CORRUPT
Cassandra: Locally Undetected Fault
sstable
CORRUPT
READ (R=1)
Need for end-to-end integrity and error handling
Crashing - Common Local Reaction
23
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
Crash
Leader
Follower
L
F
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
Crash
Leader
Follower
L
F
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail
ZooKeeper
L F
Crash
Leader
Follower
L
F
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail
ZooKeeper
L F
Crash
Leader
Follower
L
F
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail
ZooKeeper
L F
Crash
Leader
Follower
L
F
Crashing leads to reduced redundancy and imminent unavailability
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail
ZooKeeper
L F
Crash
Leader
Follower
L
F
Crashing leads to reduced redundancy and imminent unavailabilityPersistent fault -- Requires manual intervention
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail
ZooKeeper
L F
Crash
Leader
Follower
L
F
Crashing leads to reduced redundancy and imminent unavailabilityPersistent fault -- Requires manual intervention
Redundancy underutilized!
Crash and Corruption Handling are Entangled
24
Crash and Corruption Handling are EntangledKafka Message Log
24
Crash and Corruption Handling are EntangledKafka Message Log
0
24
Crash and Corruption Handling are EntangledKafka Message Log
0
data
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1
Append(log, entry 2)
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1
Append(log, entry 2)
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
Lose uncommitted data
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
Lose uncommitted data
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
Lose uncommitted data
0 1 2
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
Lose uncommitted data
0 1 2
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Lose uncommitted data
0 1 2
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Lose uncommitted data
0 1 2
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Action: Truncate log at 0
Lose uncommitted data
0 1 2
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Action: Truncate log at 0
Lose uncommitted data
0 1 2
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Action: Truncate log at 0
Lose uncommitted data
Lose committed data!
0 1 2
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Action: Truncate log at 0
Lose uncommitted data
Lose committed data!
0 1 2
Developers of LogCabin and RethinkDB agree entanglement is the problem
24
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Action: Truncate log at 0
Lose uncommitted data
Lose committed data!
0 1 2
Developers of LogCabin and RethinkDB agree entanglement is the problem
24Need for discerning corruptions due to crashes from other type of corruptions
Unsafe Interaction between Local & Global Protocols
25
Unsafe Interaction between Local & Global Protocols
0 1 2
25
Kafka: Message log at Node 1
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatch
0 1 2
25
Kafka: Message log at Node 1
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0
0 1 2
25
Kafka: Message log at Node 1
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Node1 Other Nodes
0 1 2
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Node1 Other Nodes
0 1 2
Set of in-sync replicas
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Node1 Other Nodes
0 1 2
Set of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Node1 Other Nodes
0 1 2
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other NodesREAD
0 1 2
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
0 1 2
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Assertion failure
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Assertion failureWRITE (W=2)
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Assertion failure
Failure
WRITE (W=2)
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Assertion failure
Failure
WRITE (W=2)
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Unsafe interaction between local behavior and leader election protocol leads to data loss and write unavailability
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Assertion failure
Failure
WRITE (W=2)
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Need for synergy between local behavior and global protocol
Unsafe interaction between local behavior and leader election protocol leads to data loss and write unavailability
Redundancy Does not Provide Fault Tolerance
26
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Query Failure
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Reduced Redundancy
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
Unsafe interaction between local and global protocols
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
Unsafe interaction between local and global protocols
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Not simple implementation bugs - fundamental problems across multiple systems!
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
Unsafe interaction between local and global protocols
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Not simple implementation bugs - fundamental problems across multiple systems!Redundancy underutilized as a source of recovery
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
Unsafe interaction between local and global protocols
28
Fundamental Problems: Summary
28
Fundamental Problems: Summary
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
28
Fundamental Problems: Summary
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
28
Fundamental Problems: Summary
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
28
Fundamental Problems: Summary
More observations, results, and discussions in the paper …
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
OutlineIntroduction
Fault Injection
System Behavior Analysis
Major Results
Observations Across Systems
Faults are Often Undetected Locally
Crashing: Common Local Reaction
Crash and Corruption Handling are Entangled
Unsafe interaction between local and global protocols
Conclusion
29
Summary
30
SummaryWe analyzed distributed storage reactions to single file-system faults
30
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
30
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
30
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
30
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
30
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
Some fundamental problems across multiple systems:
30
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
Some fundamental problems across multiple systems:
Faults are often undetected locally – leads to harmful global effects
30
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
Some fundamental problems across multiple systems:
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
30
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
Some fundamental problems across multiple systems:
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
Crash and corruption handling are entangled – loss of committed data
30
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
Some fundamental problems across multiple systems:
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
Crash and corruption handling are entangled – loss of committed data
Unsafe interaction between local behavior and global distributed protocols can spread corruption or data loss
30
Conclusion
31
Conclusion
Most distributed systems not yet resilient
31
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
31
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
Detecting faults and not using redundancy to recover is undesirable
31
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
Detecting faults and not using redundancy to recover is undesirable
Cannot always assume corruption to be caused by a crash
31
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
Detecting faults and not using redundancy to recover is undesirable
Cannot always assume corruption to be caused by a crash
Local behavior has implications for distributed systems
31
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
Detecting faults and not using redundancy to recover is undesirable
Cannot always assume corruption to be caused by a crash
Local behavior has implications for distributed systems
Our study provides directions for more robust distributed storage design
31
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
Detecting faults and not using redundancy to recover is undesirable
Cannot always assume corruption to be caused by a crash
Local behavior has implications for distributed systems
Our study provides directions for more robust distributed storage design
Our fault injection framework available online: http://research.cs.wisc.edu/adsl/Software/cords/
Thank you!31