introduction to cassandra basics
DESCRIPTION
An introduction to some basic concepts and data modeling techniques in Cassandra.TRANSCRIPT
Introduction to CassandraNick Bailey@nickmbailey
Monday, October 28, 13
©2012 DataStax
Who am I?
2
Monday, October 28, 13
©2012 DataStax
What’s DataStax?
3
Monday, October 28, 13
©2012 DataStax
On to the good stuff!
4
Monday, October 28, 13
©2012 DataStax
Why Cassandra?
Cluster Architecture
Node Architecture
Data Modeling
Wrap up
5
Monday, October 28, 13
©2012 DataStax
Why Cassandra?
6
Monday, October 28, 13
©2012 DataStax
Time for buzz words!
7
Big Data!
NoSQL!
Monday, October 28, 13
©2012 DataStax
Big Data
8
• Gartner: “...high-volume, high-velocity and high-variety...”
• 2 sides of ‘big data’• Analytics• Real-time
Monday, October 28, 13
©2012 DataStax
NoSQL
9
• A terrible label
• Covers a wide range of DBs• Cassandra• Redis• MongoDB• HBase• ...
Monday, October 28, 13
©2012 DataStax
Started by Facebook
10
Monday, October 28, 13
©2012 DataStax
Dynamo (Amazon)+
Big Table (Google)
11
Monday, October 28, 13
©2012 DataStax 12
Monday, October 28, 13
©2012 DataStax
Cassandra is great for...
13
• Massive, linear scaling (e.g. CERN hadron collider, Barracuda Networks)
• Extremely heavy writes(e.g. BlueMountain Capital – financial tick data)
• High availability(e.g. eBay, Eventbrite, Netflix, SoundCloud, HeathCare Anytime, Comcast, GoDaddy, Sony Entertainment Network)
Monday, October 28, 13
©2012 DataStax 14
Monday, October 28, 13
©2012 DataStax 15
Monday, October 28, 13
©2012 DataStax 169
http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html
Monday, October 28, 13
©2012 DataStax
One size does not fit all
Polyglot persistence
17
Monday, October 28, 13
©2012 DataStax
More Resources
18
• PlanetCassandra.org
• Blog
• 5 minute interviews
Monday, October 28, 13
©2012 DataStax
Cluster Architecture
19
Monday, October 28, 13
©2012 DataStax
75
0
25
50
Data Distribution
Hash_Function(Partition Key) >> Token
Monday, October 28, 13
©2012 DataStax
Replication
Monday, October 28, 13
©2012 DataStax
Failure Modes
Monday, October 28, 13
©2012 DataStax
Consistency Level
23
• Multiple options• ONE• QUORUM• ALL• LOCAL_QUORUM• ...
• Can be specified per request
Monday, October 28, 13
©2012 DataStax
Quorum
Monday, October 28, 13
©2012 DataStax
Quorum
Monday, October 28, 13
©2012 DataStax
Consistency
WriteCL: ONE
Monday, October 28, 13
©2012 DataStax
Consistency
ReadCL: One
Monday, October 28, 13
©2012 DataStax
Failure Types
28
• UnavailableException• Didn’t even try
• TimedOutException• Possible success or failure
Monday, October 28, 13
©2012 DataStax
Multi DC
Monday, October 28, 13
©2012 DataStax
Gossip
30
• Manages cluster state• Nodes up/down• Nodes joining/leaving
• Decentralized
Monday, October 28, 13
©2012 DataStax
Snitch
31
• Responsible for determining cluster topology
• Tracks node responsiveness
• Simple, PropertyFile, Ec2Snitch, etc...
Monday, October 28, 13
©2012 DataStax
Node Architecture
32
Monday, October 28, 13
©2012 DataStax
Write Path
33
commit log
Memtable
SSTable
Write
Memory
Disk
Monday, October 28, 13
©2012 DataStax
Read Path
34
Memtable
SSTable
Read
SSTable
Memory
Disk
Monday, October 28, 13
©2012 DataStax
Data Modeling
35
Monday, October 28, 13
©2012 DataStax
CQLCassandra Query Language
36
Monday, October 28, 13
©2012 DataStax
Terminology
37
• Keyspace
• Table (Column Family)
• Row
• Column
• Partition Key
• Clustering Key (Optional)
Monday, October 28, 13
©2012 DataStax
CREATE KEYSPACE packagetracker WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
CREATE KEYSPACE packagetracker WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'dc1' : 2, 'dc2' : 2};
CREATE TABLE events (package_id text,status_timestamp timestamp,location text,notes text,PRIMARY KEY (package_id, status_timestamp)
);
For Example:
38
Monday, October 28, 13
©2012 DataStax
Constructs
39
Monday, October 28, 13
©2012 DataStax
Basic Data Types
40
• blob
• int
• text
• long
• uuid
• etc
Monday, October 28, 13
©2012 DataStax
More Data Modeling Constructs
41
• Collections• map, set, list
• Time to live (TTL)
• Counters
• Secondary Indexes
Monday, October 28, 13
©2012 DataStax
Approaching Data Modeling
42
• Model your queries, not your data• Optimize your data model for reads
• Don’t be afraid to denormalize
• You will get it wrong, iterate
Monday, October 28, 13
©2012 DataStax
An Example:User Logins
43
Monday, October 28, 13
©2012 DataStax
What are the last 10 locations nickmbailey logged in from?
SELECT time, location FROM logins WHERE user = ‘nickmbailey’ ORDER BY time DESC LIMIT 10;
The Query
44
Monday, October 28, 13
©2012 DataStax
What are the last 10 locations nickmbailey logged in from?
SELECT time, location FROM logins WHERE user = ‘nickmbailey’ ORDER BY time DESC LIMIT 10;
The Query
45
Partition Key
Monday, October 28, 13
©2012 DataStax
What are the last 10 locations nickmbailey logged in from?
SELECT time, location FROM logins WHERE user = ‘nickmbailey’ ORDER BY time DESC LIMIT 10;
The Query
46
Clustering Key Partition Key
Monday, October 28, 13
©2012 DataStax
What are the last 10 locations nickmbailey logged in from?
SELECT time, location FROM logins WHERE user = ‘nickmbailey’ ORDER BY time DESC LIMIT 10;
The Query
47
Additional ColumnsClustering Key Partition Key
Monday, October 28, 13
©2012 DataStax
What are the last 10 locations nickmbailey logged in from?
SELECT time, location FROM logins WHERE user = ‘nickmbailey’ ORDER BY time DESC LIMIT 10;
CREATE COLUMN FAMILY logins ( user text, time timestamp, location text, PRIMARY KEY (user, time));
The Query
48
Additional ColumnsClustering Key Partition Key
Monday, October 28, 13
©2012 DataStax
What are the last 10 locations nickmbailey logged in from?
SELECT time, location FROM logins WHERE user = ‘nickmbailey’ ORDER BY time DESC LIMIT 10;
CREATE COLUMN FAMILY logins ( user text, time timestamp, location text, PRIMARY KEY (user, time));
The Query
49
User Time Locationnickmbailey 2013-07-19 09:22:18 Austin, Texas
nickmbailey 2013-07-19 14:49:27 Blacksburg, Virginia
jsmith 2013-07-20 07:59:34 Atlanta, Georgia
Partition key Primary key
Monday, October 28, 13
©2012 DataStax
Time-series data
50
• By far, the most common data model
• Event logs
• Metrics
• Sensor Data
• Etc
Monday, October 28, 13
©2012 DataStax
When was the last time nickmbailey logged in from San Francisco, California?
SELECT time FROM logins WHERE user = ‘nickmbailey’ and location=‘San Francisco, California’;
Another Query
51
User Time Location
nickmbailey 2013-07-19 09:22:18 Austin, Texas
nickmbailey 2013-07-19 14:49:27 Blacksburg, Virginia
nickmbailey 2013-07-19 14:49:27 Austin, Texas
nickmbailey 2013-05-19 14:49:27 Austin, Texas
nickmbailey 2013-04-19 14:49:27 San Francisco, California
... ... ...
jsmith 2013-07-20 07:59:34 Atlanta, Georgia
Monday, October 28, 13
©2012 DataStax
When was the last time nickmbailey logged in from Austin, Texas?
SELECT time FROM logins_by_location WHERE user = ‘nickmbailey’ and location=‘San Francisco, California’;
CREATE COLUMN FAMILY logins_by_location (user text, time timestamp, location text, PRIMARY KEY (user, location));
Another Query
52
Monday, October 28, 13
©2012 DataStax
When was the last time nickmbailey logged in from Austin, Texas?
SELECT time FROM logins_by_location WHERE user = ‘nickmbailey’ and location=‘San Francisco, California’;
CREATE COLUMN FAMILY logins_by_location (user text, time timestamp, location text, PRIMARY KEY (user, location));
Another Query
53
User Location Time
nickmbailey Austin, Texas 2013-07-19 09:22:18
nickmbailey Blacksburg, Virginia 2013-07-19 14:49:27
nickmbailey San Francisco, California 2013-07-19 14:49:27
Monday, October 28, 13
©2012 DataStax
Denormalize
54
• Create materialized views of the same data to support different queries
• Storage space is cheap, Cassandra is fast
Monday, October 28, 13
©2012 DataStax
Debugging your data model
55
cqlsh> tracing on;Now tracing requests.
cqlsh:foo> INSERT INTO test (a, b) VALUES (1, 'example');Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9
activity | timestamp | source | source_elapsed-------------------------------------+--------------+-----------+---------------- execute_cql3_query | 00:02:37,015 | 127.0.0.1 | 0 Parsing statement | 00:02:37,015 | 127.0.0.1 | 81 Preparing statement | 00:02:37,015 | 127.0.0.1 | 273 Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540 Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779
Messsage received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63 Applying mutation | 00:02:37,016 | 127.0.0.2 | 220 Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250 Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277 Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378 Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710 Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888
Monday, October 28, 13
©2012 DataStax
A note on Transactions
56
• In general, you want to construct your data model around them
• The latest version of Cassandra has ‘Compare and swap’• An implementation of Paxos• ...IF NOT EXISTS;• ...IF column1 = ‘value’;
Monday, October 28, 13
©2012 DataStax
Try it out
57
Monday, October 28, 13
©2012 DataStax
CCM
58
• CCM - Cassandra Cluster Manager• https://github.com/pcmanus/ccm
• Warning: not lightweight
• Example:• ccm create test -v 2.0.1• ccm populate -n 3• ccm start
Monday, October 28, 13
©2012 DataStax
Clients
59
• Cqlsh• Bundled with Cassandra
• Drivers• java: https://github.com/datastax/java-driver• python: https://github.com/datastax/python-driver• .net: https://github.com/datastax/csharp-driver• and more: http://www.datastax.com/download/
clientdrivers
Monday, October 28, 13
©2012 DataStax
Get Help
60
• IRC: #cassandra on freenode
• Mailing Lists
• Stack Overflow
• DataStax Docs• http://www.datastax.com/docs
Monday, October 28, 13
©2012 DataStax
Questions?
61
Monday, October 28, 13
Monday, October 28, 13