cassandra @ netflix: monitoring c* at scale, gossip and tickler & python

31
Cassandra Meetup

Upload: datastax-academy

Post on 06-Jan-2017

1.530 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Cassandra Meetup

Page 2: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Monitoring C* Health at ScaleJason Cacciatore @jasoncac

Page 3: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

How do we assess health ?

● Node Level - dmesg errors - gossip status - thresholds (heap, disk usage, …)

● Cluster Level – ring aggregate

Page 4: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Scope

● Production + Test● Hundreds of clusters● Over 9,000 nodes

Page 5: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Current Architecture

Page 6: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

So… what’s the problem ?● Cron system for monitoring is

problematic○ No state, just snapshot

● Stream processor is a better fit

Page 7: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Mantis● A reactive stream processing system that runs on Apache Mesos

○ Cloud native○ Provides a flexible functional programming model○ Supports job autoscaling○ Deep integration into Netflix ecosystem

● Current Scale○ ~350 jobs○ 8 Million Messages/sec processed○ 250 Gb/sec data processed

● QConNY 2015 talk on Netflix Mantis

Page 8: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Modeled as Mantis Jobs

Page 9: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Resources

Page 10: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Real-time Dashboard

Page 11: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Page 12: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

How can I try this ?

● Priam - JMXNodeTool● Stream Processor - Spark

Page 13: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

THANK YOU

Page 14: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

C* Gossip: the good, the bad and the ugly

Minh Do @timiblossom

Page 15: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

What is Gossip Protocol or Gossip?● A peer-to-peer communication protocol in a

distributed system● Inspired by the form of gossip seen in

human social networks● Nodes spread out information to

whichever peers they can contact● Used in C* primarily as a membership

protocol and information sharing

Page 16: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Gossip flow in C*● At start, Gossiper loads seed

addresses from configuration file into the gossip list● doShadowRound on seeds● Every 1s, gossip up to 3 nodes

from the peers: a random peer, a seed peer, and unreachable peer

Page 17: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

C* Gossip round in 3 stages

Page 18: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

How Gossip Helps C*?● Discover cluster topology (DC, Rack)● Discover token owners● Figure out peer statuses:

○ moving○ leaving/left○ normal○ down○ bootstrapping

● Exchange Schema version● Share Load (used disk space)/Severity (CPU)● Share Release version/Net version

Page 19: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

What Gossip does not do for C*?● Detect crashes in Thrift or Native servers

● Manage cluster (need Priam or OpsCenter)

● Collect performance metrics: latencies, RPS, JVM stats, network stats, etc.

● Give C* admins a good sleep.

Page 20: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Gossip race conditionMost of Gossip issues and bugs are caused by incorrect code logics in handling race conditions

● The larger the cluster the higher chance of having race condition

● There are several C* components running in different threads that can affect gossip status:

○ Gossiper○ FailureDetector○ Snitches○ StorageService○ InboundTcpConnection○ OutboundTcpConnection

Page 21: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Pain - Gossip Can Inflict on C*● CASSANDRA-6125 An example to show race condition● CASSANDRA-10298 Gossiper does not clean out

metadata on a dead peer properly to cause a dead peer to stay in ring forever

● CASSANDRA-10371 Dead nodes remain in gossip to prevent a replacement due to FailureDetector unable to evict a down node

● CASSANDRA-8072 Unable to gossip to any seeds● CASSANDRA-8336 Shut down issue

that peers resurrect a down node

Page 22: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Pain - Gossip Can Inflict on C*, cont.● CASSANDRA-8072 and CASSANDRA-7292

Problems on reusing IP of a dead node on a new node● CASSANDRA-10969 Long running cluster (over 1yr) has

restarting issue ● CASSANDRA-8768 Upgrading issue to newer version ● CASSANDRA-10321 Gossip to dead nodes caused CPU

usage to be 100%● A lemon node or AWS network issue to cause one node

not to see the other to display a confusing gossip view

Page 23: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

What can we do?

● Rolling restart C* cluster once in awhile● On AWS, when there is an gossip issue,

try reboot. If still have bad gossip view,, replace with a new instance

● Node assassination (unsafe and need a repair/clean-up)

● Monitor network activities to take pre-emptive actions

● Search community for the issues reported in the system logs

● Fix it yourself

● Pray

Page 24: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

THANK YOU

Page 25: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

References● https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf● https://wiki.apache.org/cassandra/ArchitectureGossip

Page 26: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Cassandra Tickler@chriskalan

Page 27: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

When does repair fall down?

● Running LCS on an old version of C* ● Space issues● Repair gets stuck

Page 28: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Solution - Cassandra Tickler

Page 29: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Solution - Cassandra Tickler

Page 30: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Solution - Cassandra Tickler

https://github.com/ckalantzis/cassTickler

Page 31: Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

THANK YOU