high throughput analytics with cassandra & azure

26
High Throughput Analytics with Cassandra & Azure Charles Lamanna Principal Dev Lead @clamanna

Upload: planet-cassandra

Post on 15-Jan-2015

2.247 views

Category:

Documents


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: High Throughput Analytics with Cassandra & Azure

High Throughput Analytics with Cassandra & Azure

Charles LamannaPrincipal Dev Lead

@clamanna

Page 2: High Throughput Analytics with Cassandra & Azure

MetricsHubkeep cloud services up and running for the lowest possible cost

Page 3: High Throughput Analytics with Cassandra & Azure

Live Status

Cost Awareness

Alerts and Notifications

Actions and Scaling

$

Page 4: High Throughput Analytics with Cassandra & Azure
Page 5: High Throughput Analytics with Cassandra & Azure

2000+ customers in 6 months

10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130

500

1000

1500

2000

2500

Page 6: High Throughput Analytics with Cassandra & Azure

storing data200M data points per hour80,000 data points per second (peak)

Page 7: High Throughput Analytics with Cassandra & Azure

Planning for huge data ingestion ratesRequires high scale, real-time data

1,000 data points per minute per VM12 data points per endpoint per minute

Aggregate, analyze and take actions based on this data stream (in near real-time)

Must be cheap, scalable and reliable

Page 8: High Throughput Analytics with Cassandra & Azure

Evaluated several technologiesAggregation in memory; good performance, bad COGs

Rolling tables for aggs; good tooling/support, hard to scale

Aggregation on write; easy to scale and good COGs

Page 9: High Throughput Analytics with Cassandra & Azure

Cassandra UpsideScales fluidly Grows horizontally – double the nodes, double capacityAdd / remove capacity / nodes with no downtime

Highly availableNo single point of failureReplication factor (i.e. hot copies) is just a config switch

Page 10: High Throughput Analytics with Cassandra & Azure

… and by the wayLittle-to-no operations cost

New nodes take minutes to setupNodes just keep running for months on end

“Aggregate on write” – no jobs required!Distributed counters make it easy to do aggregates on write

…and a nice kicker: has *great* perf / COGS in Azure

Page 11: High Throughput Analytics with Cassandra & Azure

architecture68 virtual machines (PAAS and IAAS)

Page 12: High Throughput Analytics with Cassandra & Azure

Table StorageJobs Worker Role (24 instances)

SQL Database

Portal Web Role

(3 instances)

Cassandra VM Cluster

(32 XL instances)

Web API Web Role

(8 instances)

End User Web Browsers

Monitored Customer Resources

(e.g. websites; SQL databases)

Monitored Virtual Machines

Endpoints Replicated datain multiple

datacenters

ClientsPaaS

IaaS

Services

Page 13: High Throughput Analytics with Cassandra & Azure

Avoiding state

• Application logic / code all lives on stateless machines

• Keeps it simple: decreases human operations cost

• Use Azure PAAS offerings (Web and Worker roles)

Table Storage

Jobs Worker Role (24 instances)

SQL Database

Blob storage

Portal Web Role

(3 instances)

Cassandra VM Cluster

(32 XL instances)

Web API Web Role

(8 instances)

Endpoints Replicated datain multiple

datacenters

PaaS

Page 14: High Throughput Analytics with Cassandra & Azure

Azure Cloud Services (PAAS)

• Scale horizontally (grew from 1 to 30+ instances)

• Managed by the platform (patched; coordinated recycling; failover; etc.)

• 1 click deployment from Visual Studio (with automatic load balancer swaps)

Web Role Worker Role

Page 15: High Throughput Analytics with Cassandra & Azure

Table Storage

Jobs Worker Role (24 instances)

SQL Database

Blob storage

Web API Web Role

(8 instances)

Endpoints Replicated datain multiple

datacenters

Maintains all state for metrics / time series data

32 XL Linux Virtual Machines Portal Web

Role (3 instances)

Cassandra VM Cluster

(32 XL instances)

Cassandra Cluster

IaaS

Page 16: High Throughput Analytics with Cassandra & Azure

32 nodes, 8 “pods” of 4 nodes

Page 17: High Throughput Analytics with Cassandra & Azure

……..

……….

Exposed via a single endpoint

Exposed via a single endpoint

Exposing the pods• Each pod of 4 nodes

has a single load balanced endpoint

• Clients (on our stateless roles) treats the endpoints as a pool

• Blacklists and skips an endpoint if it starts producing a lot of errors

Page 18: High Throughput Analytics with Cassandra & Azure

Where does the data go?

• Data files are on 16 mounted network backed disks (*not* ephemeral disks)

• Data disks are geo-replicated (3 copies local; 1 remote) for “free” DR

• Azure data disks offer great throughput (VMs end up CPU bound)

Page 19: High Throughput Analytics with Cassandra & Azure

Our Column Families (CQL 3)

CREATE TABLE oneminute (

rk text,  ck text,  cnt counter,  sum counter,  PRIMARY KEY (rk, ck)

);

Page 20: High Throughput Analytics with Cassandra & Azure

Updating values…Realtime “average” values at any granularity, for any time window

updateoneminute/tenminute/oneday

setsum = sum + {sample_value},cnt = cnt + 1

where rk = '{customer+metric}' and ck = '{tags_and_timestamp}'

Page 21: High Throughput Analytics with Cassandra & Azure

Reading values…

*ONE* round trip to fetch a metric over time (e.g. CPU over past week)

select * from oneminutewhere rk = ‘{customer_name}' and ck < '{metric_path_start}' and ck >= '{metric_path_end}‘order by ck desc;

Page 22: High Throughput Analytics with Cassandra & Azure

Some hard lessons…

• Static private IPs are a must; otherwise, reboots / outages can confuse the cluster when nodes come back up

• Monitor performance carefully; once you tip over, it is hard to rebalance the cluster and add new nodes

• Fit the cluster to the platform: in Azure, match the Upgrade Domains / Fault Domains to preserve uptime during service maintenance / hardware failure

Page 23: High Throughput Analytics with Cassandra & Azure

Single node tests..• 4 disks, RAID 0, no read cache

Workload (%write)

Ops / sec Latency median

Latency 95th

Latency 99th

%100 20018 1.5 3.7 7.9%75 8361 85.9 376.6 584.8%25 5412 459.9 759.1 940.1

• 4 disks, RAID 0, read cacheWorkload (%write)

Ops / sec Latency median

Latency95th

Latency99th

%100 19208 1.5 3.8 7.9  18543 1.5 3.6 7.9  18563 1.4 3.6 8.2

%75 7112 195.9 595.8 1099.6  7581 168.9 589.5 985.2  5149 256.5 774.0 1402.9

%25 15358 23.0 110.2 309.1  3742 279.2 563.0 789.7  15376 22.1 98.8 293.3

jbod RAID00

1000

2000

3000

4000

5000

6000

7000

JBOD vs RAID 0 for read-heavy workload

Page 24: High Throughput Analytics with Cassandra & Azure

Workload (%write)

Ops / sec

Latency Median

Latency 95th

Latency99th

%100 13638 1.9 4.9 24.0%75 3239 11.2 687.0 1099.3%25 1825 243.6 687.0 808.7

Multi-node load tests..

• 4 Nodes; RF = 3 (Quorom)

• 8 Disks, RAID 0

Page 25: High Throughput Analytics with Cassandra & Azure
Page 26: High Throughput Analytics with Cassandra & Azure

QUESTIONS & ANSWERS

Charles [email protected]

m@clamanna