monitoring and troubleshooting a real time pipeline
TRANSCRIPT
Monitoring and Troubleshooting a Real Time PipelineAlan Ngai, CTO/Co-Founder, OpsClarity
Businesses are Turning to Data-First Applications
AD Network – Real-time bidding
DDoS Attack Prevention
Fraud Detection
Internet of Things
Financial Services
Real-time Personalization
Data-First Application: Many Moving Parts!
DATA SOURCE MESSAGE BROKER STREAM PROCESSOR
DATA SINK APPLICATIONS
DATA PIPELINE
ELASTIC INFRASTRUCTURE
BUSINESS LOGIC AS MICROSERVICES CODE
OpsClarity Runs on Data Pipelines
Real TimeTopology
Real TimeHealth
Real TimeAnomaly Detection
Characteristics of Data Pipelines• Heterogeneous
Components
Characteristics of Data Pipelines• Heterogeneous
Components
• Extremely Complex
Storm Master Host
Storm Worker HostSupervisor Process
Topology
Executor
Spout Task
Bolt Task
Bolt Task
Bolt Task
METRIC STORM
Characteristics of Data Pipelines• Heterogeneous
Components
• Highly Complex
• Highly Inter-dependent
Characteristics of Data Pipelines• Heterogeneous
Components• Highly Interdependent• Highly Complex•Painful to Monitor and
Debug
Put Data In One Place (don’t rely on this)
Kafka Web Console Spark UI Marvel (Elasticsearch)
Ambari (Hadoop) Ganglia Nagios
Organize Your Concerns Horizontally
• Throughput• Latency• Error Rate• Buffered• Data Loss• Duplication
stuff per unit of time
how long it takes to process stuff
how frequently bad stuff happens
how much stuff is piled up
how much stuff is being lost
How much stuff is being duplicated
Matters for all stages in a pipeline!Matters for all business use cases too!
Organize Your Concerns Horizontally
• Throughput• Latency• Error Rate• Buffered• Data Loss• Duplication
…And Also Vertically
Where to start?!?!
Storm Master Host
Storm Worker HostSupervisor Process
Topology
Executor
Spout Task
Bolt Task
Bolt Task
Bolt Task
METRIC STORM
…And Also VerticallyData Health
Dependency Health
Service Health
Application
Job/Topology Health
Node Service Health
Node System Health
throughput, latency, errors?
Are Kafka and Zookeeper healthy?
Is the Storm Master healthy? Are there adequate resources in the
cluster?Are my application KPI’s within
normal range?
Is my Job well distributed in the cluster? Are job counters normal?
Are all jobs running on this node normal?
Are key system metrics (cpu, mem, network, disk i/o) normal?
Data Health
Dependency Health
Service Health
Application
Job/Topology Health
Node Service Health
Node System Health
DEMO
What We Talked About• Data-First Applications Are Becoming a Thing• Monitoring Data-First Applications is Hard!• Get Your Metrics In One Place• Organize Your Data Horizontally and Vertically