untangling cluster management with helix
DESCRIPTION
This talk was given by Kishore Gopalakrishna (Staff Software Engineer @ LinkedIn) at the 3rd ACM Symposium on Cloud Computing (SOCC 2012).TRANSCRIPT
![Page 1: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/1.jpg)
Recruiting Solutions Recruiting Solutions Recruiting Solutions
Untangling Cluster Management with Helix
1
Helix team @ LinkedIn Kishore Gopalakrishna http://www.linkedin.com/in/kgopalak @kishoreg1980
![Page 2: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/2.jpg)
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
2
![Page 3: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/3.jpg)
What is Helix
3
Cluster management framework for distributed systems using declarative state model
![Page 4: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/4.jpg)
Distributed system examples
4
![Page 5: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/5.jpg)
Motivation
A system starts out simple… …but gets complex in the real world …as you address real requirements
5
Application
client library
System Call Routing
Replica 1
Replica 2
…
Scale Failover Bootstrapping
…
![Page 6: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/6.jpg)
Motivation
These are cluster management problems Helix solves them once… …so you can focus on your system
6
Scale Failover Bootstrapping
![Page 7: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/7.jpg)
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
7
![Page 8: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/8.jpg)
Use-Case: Distributed Data Store
Distributed
8
Node 1 Node 3
P.1
Node 2
![Page 9: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/9.jpg)
Use-Case: Distributed Data Store
Distributed Partitioned
9
Node 1 Node 3 Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8
![Page 10: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/10.jpg)
Use-Case: Distributed Data Store
Distributed Partitioned Replicated
10
Node 1 Node 3 Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
![Page 11: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/11.jpg)
Partition Layout
Highly Available Master accepts writes Balanced distribution
11
Node 1 Node 3 Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
Master
Slave
![Page 12: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/12.jpg)
Failover
Node 1
P.5 P.6
P.9 P.10
P.4
Node 3
P.9 P.10
P.11
P.4 P.3 P.12 P.7 P.8
P.1 P.2 P.3
P.1
Node 2
P.7
P.11 P.12
P.2
P.5 P.6
P.8 P.1
Master
Slave
P.1 P.2 P.3 P.4
![Page 13: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/13.jpg)
Add Capacity
Node 1
P.5 P.6
P.9
P.4
Node 3
P.10
P.11
P.4 P.3 P.12 P.7
P.2 P.3
P.1
Node 2
P.7
P.11 P.10
P.8 P.12
P.2
P.9 P.1 P.5 P.6
P.8 P.1
Node 4
P.10
P.8 P.12 Master
Slave
P.1 P.5 P.9
![Page 14: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/14.jpg)
Use-case requirements
• Partition constraints • 1 master per partition • Balance partitions across cluster • No single-point-of-failure: replicas on different nodes
• Handle failures: transfer mastership • Elasticity
• Distribute workload across added nodes Minimize partition movement
• Meet SLAs Throttle concurrent data movement
14
![Page 15: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/15.jpg)
State machine – States
offline, slave, master – Transitions
O-S, S-O, S-M, M-S
COUNT=2
COUNT=1
minimize(maxnj∈N S(nj) ) t1≤ 5
Declarative Problem Statement
Constraints – States – Transitions
Objective – Partition placement
15
S
M O
t1 t2
t3 t4 minimize(maxnj∈N M(nj) )
![Page 16: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/16.jpg)
Generalizing cluster management
16
STATE MACHINE
CONSTRAINTS OBJECTIVE
![Page 17: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/17.jpg)
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
17
![Page 18: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/18.jpg)
Helix Based System Roles
18
Node 1 Node 3 Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11
P.12
P.2 P.1
RESPONSE COMMAND
![Page 19: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/19.jpg)
Controller Execution Flow
P1:OS P1:SM
![Page 20: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/20.jpg)
Controller fault tolerance
20
![Page 21: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/21.jpg)
Controller fault tolerance
21
![Page 22: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/22.jpg)
Participant Plug-in code
22
![Page 23: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/23.jpg)
Spectator Plug-in code
23
![Page 24: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/24.jpg)
Benefits
Cluster operations “just work” – Bootstrapping – Failover – Add nodes
Global vs Local – Helix Controller
Global knowledge Makes cluster decisions
– Participant Local knowledge Follows orders
24
![Page 25: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/25.jpg)
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
25
![Page 26: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/26.jpg)
consumer group
26
![Page 27: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/27.jpg)
Consumer group: Scaling
27
![Page 28: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/28.jpg)
Consumer group: Fault tolerance
28
![Page 29: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/29.jpg)
Consumer group: state model
29
![Page 30: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/30.jpg)
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
30
![Page 31: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/31.jpg)
Helix usage at LinkedIn (Pictures)
Espresso – a timeline-consistent, distributed data store
Databus – a change data capture service
Search as a Service – a multi-tenant service for multiple search applications
More planned
31
![Page 32: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/32.jpg)
Summary
Generic framework Easy to use: declarative model Easy to operate
32
![Page 33: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/33.jpg)
Helix: Future Roadmap
• Features • Span multiple data centers • Load balancing
• Announcement
• Open source: https://github.com/linkedin/helix • Apache incubation • New contributors
![Page 34: Untangling Cluster Management with Helix](https://reader033.vdocument.in/reader033/viewer/2022052619/55501f3ab4c90535638b53dd/html5/thumbnails/34.jpg)
Questions?
34