c* summit 2013: optimizing the public cloud for cost and scalability with cassandra - the metricshub...
DESCRIPTION
MetricsHub is a monitoring and scalability service for public clouds, allowing companies to continuously gather data from their systems and auto-scale their deployments to optimize service costs. Taking advantage of Cassandra rapid ingestion rates, reliable replication model, and easiness of deployment, Metrics Hub can handle billions of datapoints per day. During this session, you will learn about the architecture supporting this service, which combines the power of the PaaS + IaaS on the Windows Azure platform.TRANSCRIPT
![Page 1: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/1.jpg)
Optimizing the Public Cloud for Cost and Scalability with Cassandra
#CASSANDRA13
Charles LamannaSenior Development Lead @clamanna
Ricardo VillalobosSenior Cloud Architect@ricvilla
![Page 2: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/2.jpg)
MetricsHubkeep services up and running for the lowest possible cost
#CASSANDRA13
![Page 3: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/3.jpg)
Live Status
Cost Awareness
Alerts and Notifications
Actions and Scaling
$
#CASSANDRA13
![Page 4: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/4.jpg)
#CASSANDRA13
![Page 5: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/5.jpg)
#CASSANDRA13
![Page 6: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/6.jpg)
growth2000+ customers in 6 months
#CASSANDRA13
![Page 7: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/7.jpg)
10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130
500
1000
1500
2000
2500
Number of MetricsHub Customers
#CASSANDRA13
![Page 8: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/8.jpg)
10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130
1000
2000
3000
4000
5000
6000
7000
8000
9000
Number of VMs Monitored by MetricsHub
#CASSANDRA13
![Page 9: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/9.jpg)
10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130
1
2
3
4
5
6
7
8 Number of Metricshub Employees
#CASSANDRA13
![Page 10: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/10.jpg)
storing data200M data points per hour
#CASSANDRA13
![Page 11: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/11.jpg)
Planning for huge data ingestion rates• MetricsHub requires high scale, real-time data: • 1,000 data points per minute per VM• 12 data points per endpoint per minute• 500+ data points per storage account per hour
• Need to aggregate, analyze and take actions based on this data stream (in near real-time)
• Must be cheap, scalable and reliable
#CASSANDRA13
![Page 12: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/12.jpg)
Looked at Redis…• Perform aggregation in memory (using INCR and other
native operations)
• Flush aggregate data from Redis to persistent storage at a regular interval
• Is fast, powerful and a good OSS community
#CASSANDRA13
![Page 13: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/13.jpg)
… but it was fragile, and expensive for this use case
• RAM/Memory in the public cloud is *expensive* (but storage is *cheap*)
• Flushing the data requires complex coordination
• If we did not flush quickly enough – out of memory!
#CASSANDRA13
![Page 14: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/14.jpg)
Looked at SQL…• Create tables for different time windows and
granularities
• Roll over from table-to-table (and drop entire tables when the data expires)
• Update in place (for counters, min, max, etc.) in a reliable way
#CASSANDRA13
![Page 15: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/15.jpg)
… but SQL did not fit• Higher write than read volume pushed boundaries of
the servers
• Requires complex sharding after just a few dozen new customers
• Is possible, but not worth the operational cost
#CASSANDRA13
![Page 16: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/16.jpg)
Then we tried Cassandra (and never went back)
• Scales fluidly • Grows horizontally – double the nodes, double capacity• Add / remove capacity / nodes with no downtime
• Highly available• No single point of failure• Replication factor (i.e. hot copies) is just a config switch
#CASSANDRA13
![Page 17: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/17.jpg)
… and by the way
• Little-to-none operations cost• New nodes take minutes to setup• Nodes just keep running for months on end
• “Aggregate on write” – no jobs required!• Atomic distributed counters make it easy to do
aggregates on write
• …and a nice kicker: has *great* perf / COGS in Azure
#CASSANDRA13
![Page 18: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/18.jpg)
architecture68 virtual machines (PAAS and IAAS)
#CASSANDRA13
![Page 19: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/19.jpg)
#CASSANDRA13
Table Storage
Jobs Worker Role (24 instances)
SQL Database
Blob storage
Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Web API Web Role
(8 instances)
End User Web Browsers
Monitored Customer Resources
(e.g. websites; SQL databases)
Monitored Virtual Machines
Endpoints Replicated datain multiple
datacenters
ClientsPaaS
IaaS
Services
![Page 20: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/20.jpg)
Avoiding state
• Application logic / code all lives on stateless machines
• Keeps it simple: decreases human operations cost
• Use Azure PAAS offerings (Web and Worker roles)
Table Storage
Jobs Worker Role (24 instances)
SQL Database
Blob storage
Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Web API Web Role
(8 instances)
Endpoints Replicated datain multiple
datacenters
#CASSANDRA13
PaaS
![Page 21: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/21.jpg)
Windows Azure Cloud Services (PAAS)
• Scale horizontally (grew from 1 to 30+ instances)
• Managed by the platform (patched; coordinated recycling; failover; etc.)
• 1 click deployment from Visual Studio (with automatic load balancer swaps)
Web Role Worker Role
#CASSANDRA13
![Page 22: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/22.jpg)
Table Storage
SQL Database
Blob storage
Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Web API Web Role
(8 instances)
Endpoints Replicated datain multiple
datacenters
Jobs Worker RoleRuns recurring tasks to pull, generate and analyze data
Jobs are synchronized and scheduled using Windows Azure Tables and Queues
Jobs Worker Role (24 instances)
![Page 23: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/23.jpg)
Table Storage
Jobs Worker Role (24 instances)
SQL Database
Blob storage
Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Endpoints Replicated datain multiple
datacenters
Web API Role
RESTful endpoint for saving and reading custom metrics.
Highly concurrent, secure & scalable.
Web API Web Role
(8 instances)
![Page 24: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/24.jpg)
Table Storage
Jobs Worker Role (24 instances)
SQL Database
Blob storage
Cassandra VM Cluster
(32 XL instances)
Web API Web Role
(8 instances)
Endpoints Replicated datain multiple
datacenters
Portal Web Role
Interface for our customers – shows trends, charts and issues. Portal Web
Role (3 instances)
![Page 25: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/25.jpg)
Table Storage
Jobs Worker Role (24 instances)
SQL Database
Blob storage
Web API Web Role
(8 instances)
Endpoints Replicated datain multiple
datacenters
Maintains all state for metrics / time series data.
Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Cassandra Cluster
![Page 26: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/26.jpg)
Windows Azure Virtual Machines (IaaS)
Management Portal
Scripting (Windows, Linux and Mac)
REST API
Starting Select Image and VM Size New Disk Persisted in Storage
Boot VM from New Disk
![Page 27: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/27.jpg)
32 nodes, 8 “pods” of 4 nodes
![Page 28: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/28.jpg)
……..
……….
Exposed via a single endpoint (port 9160)
9160
9160 Exposed via a single endpoint (port 9161)
Exposing the pods• Each pod of 4 nodes
has a single load balanced endpoint
• Clients (on our stateless roles) treats the endpoint as a pool
• Blacklists and skips an endpoint if it starts producing a lot of errors
#CASSANDRA13
![Page 29: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/29.jpg)
Where does the data go?
• Data files are on 8 mounted network backed disks (*not* ephemeral disks)
• Data disks are geo-replicated (3 copies local; 1 remote) for “free” DR
• Azure data disks offer great throughput (VMs end up CPU bound)
#CASSANDRA13
![Page 30: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/30.jpg)
Our Column Families (CQL 3)
CREATE TABLE oneminute (
rk text, ck text, cnt counter, sum counter, PRIMARY KEY (rk, ck)
);
#CASSANDRA13
![Page 31: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/31.jpg)
Updating values…Realtime “average” values at any granularity, for any time window
updateoneminute/tenminute/oneday
setsum = sum + {sample_value},cnt = cnt + 1
where rk = '{customer_name}' and ck = '{metric_path}'
#CASSANDRA13
![Page 32: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/32.jpg)
Reading values…
*ONE* round trip to fetch a metric over time (e.g. CPU over past week)
select * from oneminutewhere rk = ‘{customer_name}' and ck < '{metric_path_start}' and ck >= '{metric_path_end}‘order by ck desc;
#CASSANDRA13
![Page 33: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/33.jpg)
What’s next?
• Windows Azure Virtual Networks to connect / secure all of our resources
(PAAS + IAAS + Services)• Expand Cassandra cluster across
datacenter boundaries for improved availability• Integrate with more off-the-shelf Azure
components to reduce operational overhead
#CASSANDRA13
![Page 34: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/34.jpg)
#CASSANDRA13
Global Physical Infrastructureservers/network/datacenters
automated
elastic
managed resources
usage based
REST API + OTHER SERVICES
compute data management networking
SQL database
noSQL databasewebsites blob connect
virtual network
traffic manager
cloud services VMs
![Page 35: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos](https://reader038.vdocument.in/reader038/viewer/2022102921/54555e33af79590b088b4868/html5/thumbnails/35.jpg)
#CASSANDRA13