![Page 1: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/1.jpg)
Geographically distributed Swift clusters
Alistair ColesSwift core developer
[email protected]: acoles
![Page 2: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/2.jpg)
Overview
• What is Swift?• Geographically distributed clusters • What? • Why?• How?
• Erasure coded geographically distributed cluster• Swift now supports these !• …enabled by fragment duplication and composite rings
• Summary
2
![Page 3: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/3.jpg)
What is Swift?
• object storage service
• REST API – Create, Read, Update, Delete
• Simple naming hierarchy• objects belong to containers• containers belong to accounts
3
![Page 4: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/4.jpg)
Swift is durable
server
server
server
server
server0
2
• Multiple replicas of every object (or erasure coding)• The Ring always tries to disperse replicas across
different devices and servers
RingProxy server
PUT a/c/o
3 replica policy
1
4
![Page 5: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/5.jpg)
Swift is scalable
RingProxy server
server
server
server
server
server0
2
PUT a/c/o
3 replica policy
• The Ring always tries to balance load across all devices• No centralized services
RingProxy server
1
0
2PUT a2/c2/o2
3 replica policy1
5
![Page 6: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/6.jpg)
Swift is highly available
RingProxy server
server
server
server
server
server
1
23 replica policy
• Write succeeds on quorum of replicas• Missing replicas are updated asynchronously
PUT a/c/o @ t1‘my old data’
t1
t1
asyncupdate
6
![Page 7: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/7.jpg)
Swift is eventually consistent
RingProxy server
server
server
server
server
server
1
0
23 replica policy
• Possibility to read stale data• Async process makes data eventually consistent
RingProxy serverGET a/c/o @ t2 + ∂
‘my old data’
3 replica policy
t2
t2
t1
asyncupdate
7
PUT a/c/o @ t2‘my new data’
![Page 8: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/8.jpg)
Overview
• What is Swift?• Geographically distributed clusters • What? • Why?• How?
• Erasure coded geographically distributed cluster• Swift now supports these!• … but there’s some new stuff to know about
• Deep dive: erasure coding, fragment duplication, composite rings• Summary
8
![Page 9: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/9.jpg)
Geographically distributed clusters
• What?• Multiple physical locations
• Typically connected by high latency/low bandwidth WAN• Copies of data in each location• Single namespace
• Why?• Disaster recovery• Data locality
• Also known as “Global clusters”, “Multi-region Swift”
WAN
9
![Page 10: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/10.jpg)
region1
region2server
server
server
server
server
Geographically distributed Swift clusters
Ring
Proxy server
3
server
server
server
server
server
1
0
2
4 replica policy
PUT a/c/o
WAN
• The Ring always tries to disperse replicas across different devices and servers …and regions
10
![Page 11: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/11.jpg)
region1
region2server
server
server
server
server
Geographically distributed clusters- disaster recovery
Ring
Proxy server
3
server
server
server
server
server
1
0
2GET a/c/o
WAN
• A 4-replica policy makes each region independently robust to a single device failure
11
4 replica policy
![Page 12: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/12.jpg)
region1
region2server
server
server
server
server
Geographically distributed Swift clusters- data locality
Ring
Proxy server
3
server
server
server
server
server
1
0
2GET a/c/o @ t1
WAN
GET a/c/o @ t2
GET a/c/o @ t3
• By default the ring will try to balance read load by choosing random replicas
random choice for
reads
12
4 replica policy
![Page 13: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/13.jpg)
region1
region2server
server
server
server
server
Read affinity – trade off load balancing for read performance
Ring
Proxy server
3
server
server
server
server
server
1
0
2
Ring
Proxy server
GET a/c/o @ t4
read_affinity -> region1
WAN
read_affinity -> region2
always read replica in
local region
GET a/c/o @ t1
GET a/c/o @ t2
GET a/c/o @ t3
13
![Page 14: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/14.jpg)
region1
region2server
server
server
server
server
What if remote writes fail?
Ring
Proxy server
server
server
server
server
server0
2PUT a/c/o @ t1
1
WANasync
move to remote region
3
• Remote replicas are written to temporary locations.
• Async process moves them when remote region is available.
temporarily misplaced
replicas
14
t1
t1
t1
t1
![Page 15: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/15.jpg)
region1
region2server
server
server
server
server
Overwrite - replicas are eventually consistent
Ring
Proxy server
server
server
server
server
server
PUT a/c/o @ t3‘my new data’
Ring
Proxy serverGET a/c/o @ t4‘my old data’
0
2
1
WAN
PUT a/c/o @ t1‘my old data’
temporarily stale
replicas
asyncmove to remote region
3
1
WAN fails at t2
15
t1
t1
t3
t3
t3
t3
3
![Page 16: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/16.jpg)
region1
region2server
server
server
server
server
What if remote writes are slow?
Ring
Proxy server
server
server
server
server
server
1
0
2PUT a/c/o
4 replica policy
“But isn’t this a terrible idea? All my writes will be slowed down by requests to the remote region!”
3
16
WAN
![Page 17: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/17.jpg)
Effect of remote region write time on PUTs(remote region servers artificially slowed)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00
10K
obje
ct
PUT
time
(s)
Time (secs)
Remote write 100ms
Remote write 200ms
Remote write 600ms
Remote write 800ms
post_quorum_timeout= 0.5
Remote write 400ms
post_quorum_timeout puts an upper bound on the extra latency
17
![Page 18: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/18.jpg)
region1
region2server
server
server
server
server
Write affinity – temporarily trade off dispersion for write performance
Ring
Proxy server
server
server
server
server
server0
2PUT a/c/o
1
write_affinity = region1
1
• Remote replicas are initiallywritten to the local region.
• Async process moves them to remote region.
asyncmove to remote region
3
3
18
WAN
![Page 19: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/19.jpg)
Write affinity performance improvement(remote region servers artificially slowed)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00
10K
obje
ct
PUT
time
(s)
Time (secs)
Write affinity enabledRemote write
100ms
Remote write 200ms
Remote write 600ms
Remote write 800ms
Remote write 400ms
19
![Page 20: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/20.jpg)
region1
region2server
server
server
server
server
Write affinity – data is always available
Ring
Proxy server
server
server
server
server
server
Ring
Proxy serverGET a/c/o @ t1 + ∂‘my data’
PUT a/c/o @ t1‘my data’
0
2
1
WANreads fall back to remote region
3write_affinity = region1
20
![Page 21: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/21.jpg)
region1
region2server
server
server
server
server
Write affinity – trade off consistency for write performance
Ring
Proxy server
server
server
server
server
server
PUT a/c/o @ t2‘my new data’
Ring
Proxy serverGET a/c/o @ t2 + ∂‘my data’
0
2
1
WAN
PUT a/c/o @ t1‘my data’
write_affinity = region1 asyncmove to remote region
3
1
3
21
t1
t1
t2
t2
t2
t2
temporarily stale
replicas
![Page 22: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/22.jpg)
Workload considerations
• Use write affinity with care!• “No free lunch” - replicas have to be copied across WAN eventually
• Suitable workloads:• Moderate or bursty write rates • Non-immediate remote reads• E.g. replicating archive data across sites
• Unsuitable workloads:• Continuous high write rate
• misplaced replicas will back up in local region• Immediate remote reads
• reads will fetch data over WAN before async move has happened
22
![Page 23: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/23.jpg)
Overview
• What is Swift?• Geographically distributed clusters • What? • Why?• How?
• Erasure coded geographically distributed cluster• Swift now supports these!• … but there’s some new stuff to know about
• Deep dive: erasure coding, fragment duplication, composite rings• Summary
23
![Page 24: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/24.jpg)
decode
Erasure Code (EC)
Erasure coding – same durability, less storage
data 0 1 2 3 4 5
Example: 4 data fragments + 2 parity fragments
Any subset of 4 unique fragments is sufficient to reconstruct data:
data0 1 3 5
Requires only 1.5 x size of data to store all fragments
24
![Page 25: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/25.jpg)
server
Erasure coding in Swift
server
server
server
server
server
3
0
2
1
5
4https://github.com/openstack/pyeclibPython interface to liberasurecode
https://github.com/openstack/liberasurecodeC library with pluggable Erasure Code backends
ECdata 0 1 2 34 5
Ring
Proxy server
EC 4+2 policy
4 + 2 erasure coding requires approx. 50% storage vs 3 replicas with similar durability
25
![Page 26: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/26.jpg)
region1
region2server
server
server
server
server
Erasure coding across regions
server
server
server
server
server
3
0
2
1
5
4
ECdata 0 1 2 34 5
Ring
Proxy server
EC 4+2 policy
WAN
26
• The Ring always tries to dispersefragments across different devices and servers …and regions
![Page 27: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/27.jpg)
region1
region2server
server
server
server
server
Erasure coding across regions
server
server
server
server
server
3
0
2
1
5
4
With EC 4+2 we don’t have enough fragments in oneregion to reconstruct the
data!
ECdata 0 1 2 34 5
Ring
Proxy server
EC 4+2 policy
WAN
27
![Page 28: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/28.jpg)
region1
region2server
server
server
server
server
Erasure coding across regions requires more fragments
server
server
server
server
server
3
0
2
1
5
4
8
7
9
6
How about EC 4+6?requires 2.5 x size of data
vsReplication requires 4 x size of
data for similar durability
ECdata0 1 2 34 5 6 78 9
Ring
Proxy server
EC 4+6 policy
WAN
28
![Page 29: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/29.jpg)
Erasure coding time increases with number of parity fragments
0
0.01
0.02
0.03
0.04
0.05
0.06
2 3 4 5 6 7
Rela
tive
com
pute
tim
e
Number of parity fragments
4 data fragments isa_l_rs_cauchy backend40MB object
29
![Page 30: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/30.jpg)
region1
region2server
server
server
server
server
EC duplication: more fragments, less compute
server
server
server
server
server
0
0
2
1
2
4
3
4
3
1
Each region has EC 4+1 fragmentsrequires 2.5 x size of data
vsReplication requires 4 x size of
data for similar durability
Ring
Proxy server
ECdata 0 1 2 34 3
0
4
12
3
0
4
12
WAN
EC 4+1 policy plus duplication
30
![Page 31: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/31.jpg)
region1
region2server
server
server
server
server
But duplicates must be correctly dispersed…
server
server
server
server
server
3
0
2
1
1
4
3
4
0
2
ECdata 0 1 2 34
Ring
Proxy server
We don’t have enough unique fragments in this region to reconstruct the
data!
1 123 3
20 0
4 4
WAN
EC 4+1 policy plus duplication
31
![Page 32: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/32.jpg)
region1
region2server
server
server
server
server
Solution: EC duplication + composite rings
server
server
server
server
server
0
0
2
1
2
4
3
4
3
1
PUT a/c/o
EC 4+1 policy plus duplication
ECdata 0 1 2 34
Proxy server
3
0
4
12
3
0
4
12
WAN
Each region has its own ‘component’ ring - this guarantees a set of unique fragments in each region.
32
![Page 33: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/33.jpg)
Summary
• Swift enables you to build geographically distributed clusters• Good for disaster recovery and data locality• Tuning via options in the Swift proxy server
• write_affinity, read_affinity• Understand your workloads!
• Erasure coded storage policies can also be geographically distributed1
• Uses new features: EC fragment duplication and composite rings
1 new in Swift 2.15.0
33
![Page 34: Geographically distributed Swift clusters · Swift is durable server server server server 0 server 2 • Multiple replicas of every object (or erasure coding) • The Ring always](https://reader033.vdocument.in/reader033/viewer/2022042805/5f6816d67b3fe4017426c596/html5/thumbnails/34.jpg)
Swift welcomes new users and contributors
• You can find us in freenode #openstack-swift• Project: https://launchpad.net/swift• Docs: https://docs.openstack.org/swift• Code: https://github.com/openstack/swift
34