starfish: highly-available block storage eran gabber jeff fellin michael flaster fengrui gu bruce...
TRANSCRIPT
StarFish: highly-available block storageEran Gabber
Jeff Fellin
Michael Flaster
Fengrui Gu
Bruce Hillyer
Wee Teck Ng
Banu O¨ zden
Elizabeth Shriver
2003 USENIX Annual Technical Conference
Presenter: D00922019 林敬棋
IntroductionImportant data need to be
protected.◦Making replicas.
Replication on remote sites◦Reduce the amount of data lost in
failure.◦Decrease the time required to
recover from catastrophic site failure.
StarFishA highly-available geographically-
dispersed block storage system.◦Does not require expensive
dedicated communication lines to all replicas to achieve highly-available .
◦Achieves good performance even during recovery from a replica failure.
◦Single-owner access semantics.
ArchitectureStarFish consists of
◦One Host Element(HE) Provides storage virtualization and read
cache.
◦N Storage Element(SE) Q: write quorum size. Synchronous updates to a quorum of Q
SEs, and asynchronous updates to the rest.
Recommended Setup
N = 3, Q = 2
MAN : Metropolitan Area NetworkWAN :Wide Area Network
Another Deployment
SE RecoveryWrite log
◦HE keeps a circular buffer of recent writes.
◦Each SE maintains a circular buffer of recent writes on a log disk.
Three types of recovery◦Quick recovery◦Replay recovery◦Full recovery
Availability and ReliabilityAssume that the failure and
recovery processes of the network links and SEs are i.i.d Poisson processes with combined mean failure and recovery rates of λ and μ per second.
Similarly, the HE has Poisson-distributed λhe and μhe .
AvailabilityThe steady-state probability that
at least Q SEs are available.
Derived from the standard machine repairman mode.
NQ
i
N
NQAN
QN
i
i
1,10,
)1(),( 0
Machine Repairman Model
Availability(cont.)
Availability(cont.)
X ★ 9 : the number of 9s in an availability measure.
Achieve a much higher availability when N = 2Q + 1.
For fixed N, availability decrease with larger quorum size.◦Increasing quorum size trades off
availability for reliability.
ReliabilityThe probability of no data loss.The reliability increases with
larger Q.Two approaches
◦Make Q > floor(N/2) and at least Q SEs are available. Reduce availability and performance.
◦Read-only consistency
Read-only ConsistencyAvailable in read-only mode
during failure.◦Read-only mode obviates the need
for Q SEs to be available to handle updates.
◦Increase availability
Qhe
iQ
ihe
Nhe
iN
iadOnly
i
Q
i
N
NQA)1)(1(
)(
)1)(1(
)(),(
1
0
1
0Re
he
he
headOnly
QANANQA
1
),1(
1
),1(),(Re
Availability with Read-only Consistency
ObservationsIf ρhe = 0, availability is
independent of Q.◦Can always recover from HE.
If ρhe increase, availability increase with Q.
Largest increase occurs from Q = 1 to Q = 2, and bounded by 3/16 when ρ = 1.◦Diminishing gain after Q = 2.◦Suggest Q = 2 in practical system.
Implementation
Performance MeasurementsCompares with a direct-attached
RAID unit.
SettingsDifferent network delays
◦1, 2, 4, 8, 23, 36, 65 msDifferent bandwidth limitations
◦31, 51, 62, 93, 124 Mb/s.Benchmark:
◦Micro-benchmark Read hit Read miss Write
◦PostMark
Effects of network delays and HE cache size
Near SE delay: 4ms; Far SE delay: 8msNo cache miss if HE cache size = 400
MB
ObservationLarge HE cache improves
performance.◦HE can respond to more read
requests without communicating with SE. Does not change write requests.
◦Especially beneficial when local SE has significant delays.
Q = 2 and 400MB cache size is not influenced by the delay to local SE.◦Depend on near SE.
Normal Operation and placement of the far SE
1-8: 1, 2, 4, 8 ms; 4-12: 4, 8, 12 ms 23-65: 23, 36, 65 ms; 31-124:
31,51,62,93,124 Mbps Local SE delay: 0ms
N = 3
Normal Operation and placement of the far SE(Cont.)
N = 3 8 threads
Normal Operation and placement of the far SE(Cont.)
ObservationPerformance is influenced mostly
by two parameters◦Write quorum size◦Delay to the SE.
StarFish can provide adequate performance when one of the SEs is placed in a remote location.◦At least 85% of the performance of a
direct-attached RAID.
Recovery
Performance degrades more during full recovery.
ConclusionThe StarFish system reveals
significant benefits from a third copy of the data at an intermediate distance.
A StarFish system with 3 replicas, a write quorum size of 2, and read-only consistency yields better than 99.9999% availability assuming individual Storage Element availability of 99%.