cassandra for ruby/rails devs
DESCRIPTION
TRANSCRIPT
CassandraIntro to
Tyler Hobbs
Dynamo(clustering)
History
BigTable(data model)
Cassandra
Inbox search
Users
Every node plays the same role– No masters, slaves, or special nodes
– No single point of failure
Clustering
Consistent Hashing
0
10
20
30
40
50
0
10
20
30
40
50
Key: “www.google.com”
Consistent Hashing
0
10
20
30
40
50
Key: “www.google.com”
14
md5(“www.google.com”)
Consistent Hashing
0
10
20
30
40
50
14
Key: “www.google.com”
md5(“www.google.com”)
Consistent Hashing
0
10
20
30
40
50
14
Key: “www.google.com”
md5(“www.google.com”)
Consistent Hashing
0
10
20
30
40
50
14
Key: “www.google.com”
md5(“www.google.com”)
Replication Factor = 3
Consistent Hashing
Client can talk to any node
Clustering
Scaling
50
0
10
20
30
The node at50 owns the red portion
RF = 2
Scaling
50
0
10
20
30
40Add a new node at 40
RF = 2
Scaling
50
0
10
20
30
40Add a new node at 40
RF = 2
Node Failures
50
0
10
20
30
RF = 2
40
Replicas
Node Failures
50
0
10
20
30
RF = 2
40
Replicas
Node Failures
50
0
10
20
30
RF = 2
40
Consistency, Availability Consistency
– Can I read stale data? Availability
– Can I write/read at all? Tunable Consistency
Consistency N = Total number of replicas R = Number of replicas read from
– (before the response is returned) W = Number of replicas written to
– (before the write is considered a success)
Consistency N = Total number of replicas R = Number of replicas read from
– (before the response is returned) W = Number of replicas written to
– (before the write is considered a success)
W + R > N gives strong consistency
Consistency
W + R > N gives strong consistency
N = 3W = 2R = 2
2 + 2 > 3 ==> strongly consistent
Consistency
W + R > N gives strong consistency
N = 3W = 2R = 2
2 + 2 > 3 ==> strongly consistent
Only 2 of the 3 replicas must be available.
Consistency Tunable Consistency
– Specify N (Replication Factor) per data set– Specify R, W per operation
Consistency Tunable Consistency
– Specify N (Replication Factor) per data set– Specify R, W per operation– Quorum: N/2 + 1
• R = W = Quorum• Strong consistency• Tolerate the loss of N – Quorum replicas
– R, W can also be 1 or N
Availability Can tolerate the loss of:
– N – R replicas for reads– N – W replicas for writes
CAP Theorem
Availability
Consistency
During node or network failure:
100%
100%
Possible
Not Possible
CAP Theorem
Availability
Consistency
During node or network failure:
100%
100%
Cassandra
Not Possible
Possible
No single point of failure Replication that works Scales linearly
– 2x nodes = 2x performance• For both reads and writes
– Up to 100's of nodes– See “Netflix: 1 million writes/sec on AWS”
Operationally simple Multi-Datacenter Replication
Clustering
Comes from Google BigTable Goals
– Commodity Hardware• Spinning disks
– Handle data sets much larger than memory• Minimize disk seeks
– High throughput– Low latency– Durable
Data Model
Static– Object data– Similar to a table in a relational database
Dynamic– Precomputed query results– Materialized views
(these are just educational classifications)
Column Families
Static Column Families
zznate
driftx
thobbs
jbellis
password: *
password: *
password: *
name: Nate
name: Brandon
name: Tyler
password: * name: Jonathan site: riptano.com
Users
Rows– Each row has a unique primary key– Sorted list of (name, value) tuples
• Like an ordered hash– The (name, value) tuple is called a “column”
Dynamic Column Families
Dynamic Column Families
zznate
driftx
thobbs
jbellis
driftx: thobbs:
driftx: thobbs:mdennis: zznate:
Following
zznate:
pcmanus: xedin:
Dynamic Column Families Other Examples:
– Timeline of tweets by a user– Timeline of tweets by all of the people a user is
following– List of comments sorted by score– List of friends grouped by state
The Data API RPC-based API
– github.com/twitter/cassandra CQL (Cassandra Query Language)
– code.google.com/a/apache-extras.org/p/cassandra-ruby/
Inserting Data
INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
Updating Data
INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34);
Updates are the same as inserts:
Or
UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
Fetching Data
SELECT * FROM users WHERE KEY = “thobbs”;
Whole row select:
Fetching Data
SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
Explicit column select:
Fetching Data
UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e' WHERE KEY = “key”;
SELECT 1..3 FROM letters WHERE KEY = “key”;
Get a slice of columns
Returns [(1, a), (2, b), (3, c)]
Fetching Data
SELECT FIRST 2 FROM letters WHERE KEY = “key”;
Get a slice of columns
Returns [(1, a), (2, b)]
SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”;
Returns [(5, e), (4, d)]
Fetching Data
SELECT 3..'' FROM letters WHERE KEY = “key”;
Get a slice of columns
Returns [(3, c), (4, d), (5, e)]
SELECT FIRST 2 REVERSED 4..'' FROM letters WHERE KEY = “key”;
Returns [(4, d), (3, c)]
Deleting Data
DELETE FROM users WHERE KEY = “thobbs”;
Delete a whole row:
DELETE “age” FROM users WHERE KEY = “thobbs”;
Delete specific columns:
Secondary Indexes
CREATE INDEX ageIndex ON users (age);
SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
Builtin basic indexes
Performance Writes
– 10k – 30k per second per node– Sub-millisecond latency
Reads– 1k – 20k per second per node (depends on data
set, caching– 0.1 to 10ms latency
Other Features Distributed Counters
– Can support millions of high-volume counters Excellent Multi-datacenter Support
– Disaster recovery– Locality
Hadoop Integration– Isolation of resources– Hive and Pig drivers
Compression
What Cassandra Can't Do Transactions
– Unless you use a distributed lock– Atomicity, Isolation– These aren't needed as often as you'd think
Limited support for ad-hoc queries– Know what you want to do with the data
Not One-size-fits-all Use alongside an RDBMS
Problems you shouldn't solve with C* Prototyping Distributed Locking Small datasets
– (When you don't need availability) Complex graph processing
– Shallow graph queries work well, though Fundamentally highly relational/transactional
data
The sweet spot for Cassandra Large dataset, low latency queries Simple to medium complexity queries
– Key/value– Time series, ordered data– Lists, sets, maps
High Availability
The sweet spot for Cassandra Social
– Texts, comments, check-ins, collaboration Activity
– Feeds, timelines, clickstreams, logs, sensor data Metrics
– Performance data over time– CloudKick, DataStax OpsCenter
Text Search– Inbox search at Facebook
ORMs Poor integration ORMs are not a natural fit for Cassandra
– In C*, we mainly care about queries, not objects– Beyond simple K/V, abstraction breaks
Suggestion: don't waste time with an ORM– C* will only be used for a specific subset of your
data/queries– Use the C* API directly in your model