Download - 20131017 - en - presentation damn data
Moving Forwards with Cassandra!
Storing traffic data historically
21/10/2013 Pieter Callewaert
Be-Mobile│1
1. Detection 2. Aggregation 3. Distribution
Be-Mobile
0SmartMove
Mobility Database
Sensor: Input API for sensors
Icarus: Floating car data
Editor: Manual data entry
Navigation traffic services
0Road
sensor data
0Probevehicle
data
0Public
sourcing
Traffic operators
Connector: External data0Other
mobility data
Phone, SMS, CamerasApps, Web
Traffic centers, Public transport, Fuel, Parking
Traveler information services
Smartphone & (mobile) web
Radio & TV traffic services
Traffic management services
Road traffic mgt services
Fleet & logistics traffic mgt
Traffic analysis & consulting
Be-Mobile│2
Be-Mobile is hiring!
http://www.be-mobile.be/about/careers
Be-Mobile NV Technologiepark 12b 9052 Ghent Belgium
www.be-mobile.be
Be-Mobile
Be-Mobile│3
Requirements…
February 2010
Be-Mobile│4
What data do we want to store?
Green dots: Nodes
Red dots: Super Nodes (connecting Links)
Blue lines: Segments
Orange lines: Links
Be-Mobile│5
2010: New project
We wanted to store our raw traffic data into a database so it would be easy to query and generate reports.
Requirements (February 2010):
• 50 000 links stored every 15 minutes (+ 4,8 million records each day)
• Low cost
• C# .NET
No problem, we already had a Microsoft SQL Server database, and the needed experience to do this.
Be-Mobile│6
But requirements change…
Requirements (October 2010):
• 50 000 links 520 000 stored every 15 5 minutes (+ 150 million records each day) data size is 31x larger
• Very Low cost
• C#
Be-Mobile│7
Back to the drawing board
Pre selection of possible contenders:
• Microsoft SQL server : Relational database
• MongoDB : Document data store
• Apache Cassandra : Column family data store
Be-Mobile│8
Microsoft SQL Server
Current approach didn’t work. 2 options:
• Buy high end hardware
• Distribute load over multiple servers?
Pros:
• We have experience with SQL Server
• (Compression)
Cons:
• High hardware costs, our high license costs…
• No experience with this volume data in SQL Server
• Partitioning data over multiple servers can be tricky
Be-Mobile│9
Mongo DB
MongoDB 1.6.x
Proof of concept ready in less than a day.
Pros:
• Backed by a company (10gen, now MongoDB Inc.),
• Open Source,
• Official C# driver,
• EASY!
Cons:
• Easier to scale beyond 1 server, but still not that straight forward,
• No (native) compression,
• 16 MB document limit forced us to make a more complex data model.
Be-Mobile│10
Apache Cassandra
Apache Cassandra 0.8/1.0
Took about 10 days to create a proof of concept.
Pros:
• Backed by a company (Datastax),
• Open Source,
• Scales automatically,
• Also configuring replication is easy.
• Compression (since 1.0),
Cons:
• Not easy to learn,
• Thrift interface,
• Data modeling was not easy.
• No official C# driver
Be-Mobile│11
How does Apache Cassandra work?
Introduction to Apache Cassandra
Be-Mobile│12
The basics
• Open source by Facebook in 2008
• Marriage between Amazon Dynamo and Google BigTable
• No single point of failure (Dynamo)
• Consistent hashing for data distribution (Dynamo)
• BigTable data model
• A cluster is represented in a ring (of nodes)
• When a new node is added, it takes place where needed.
• Other cool stuff:
• Multi datacenter setups
• Able how your replicated data is spread
• Native Hadoop/Pig support
• Able to define a time to live
• Terminology:
• Keyspace = (SQL) database
• Column family/table = (SQL) table
Be-Mobile│13
Data modeling was hard, until CQL3 came along…
Before:
Complex data models with column families
Connect with thrift interface.
Hard to correctly model you problem.
After:
CQL3 is available with thrift and native transport.
Easy to query (SELECT, INSERT, UPDATE,…)
You have to ‘static’ model, but can use maps, sets and lists as column type to add dynamic columns.
Be-Mobile│14
Consistency? Replication?
Replication
Define the replication factor when creating the key space.
Consistency
With every read or write you can define a consistency level.
• ONE
• TWO
• THREE
• QUORUM
• LOCAL_QUORUM
• EACH_QUORUM
• ALL
(QUORUM: (replication_factor / 2) + 1)
* Example shown with virtual nodes set on 1
Be-Mobile│15
Awesome tools: cqlsh
Packed with Apache Cassandra
Run your own queries on the data.
• Tab completion
• Colored view
• Perfect for your first steps with Apache Cassandra
• Allows tracing!
Be-Mobile│16
Awesome tools: nodetool
Packed with Apache Cassandra
Nodetool: CLI-based administration tool
THE tool to use when operating a cassandra cluster
• Allows to manage your cluster, see metrics, status,…
• See internals of your cluster
• Show off your stats…
Be-Mobile│17
Awesome tools: Datastax OpsCenter (Community)
Web-based front-end to monitor your cluster
Be-Mobile│18
Does it scale?
• In November 2011 Netflix published a blog post about a benchmark to test scalability.
• This was done with Apache Cassandra 0.8.6 on Amazon EC2 instances.
• Test was run on 48, 96, 144 and 288 nodes
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Be-Mobile│19
Apache Cassandra at Be-Mobile
Implementation
Be-Mobile│20
Current situation
In the meanwhile requirements did change again.
Requirements (September 2013):
• 50 000 links 520 000 links stored every 15 5 minutes (+ 150 million records each day) data size is 31x larger
• Very Low cost
• C#
• Average 1,2m segments stored every minute (+ 1,73 billion records each day) stored for maximum 31 days.
Be-Mobile│21
Implementation
Data model v3
Thanks to CQL3 we were able to create an easy to understand data model.
2 almost identical data models for our segments and links.
• Data is partitioned by “id” and “date“
• “datetime” is the clustering key
• Data is sorted by “datetime” descending, so the newest data is always first
The segments table is defined with a default time to live for data.
Be-Mobile│22
Our cluster (since September 2013)
12 nodes (commodity hardware!)
• Intel Core i7-4770,
• 32GB RAM,
• 240 GB SSD,
• 2 x 2TB 7200 RPM HDD
Running Ubuntu Linux 12.04 with Apache Cassandra 2.0.1
Connection with our own API on each node, developed in C# and ServiceStack.
Cluster data size: 12 TB
Every minute 1.2m records are written in 5 seconds
Be-Mobile│23
Results
11th of March 2013
12th of March 2013
Be-Mobile│24
Thanks! Questions?