cassandra for ruby/rails devs

CassandraIntro to

Tyler Hobbs

Dynamo(clustering)

History

BigTable(data model)

Cassandra

Inbox search

Every node plays the same role– No masters, slaves, or special nodes

– No single point of failure

Clustering

Consistent Hashing

0

10

20

30

40

50

0

10

20

30

40

50

Key: “www.google.com”

Consistent Hashing

0

10

20

30

40

50


14

md5(“www.google.com”)

Consistent Hashing

0

10

20

30

40

50

14



Consistent Hashing

0

10

20

30

40

50

14



Replication Factor = 3

Consistent Hashing

Client can talk to any node

Clustering

Scaling

50

0

10

20

30

The node at50 owns the red portion

RF = 2

Scaling

50

0

10

20

30

40Add a new node at 40

RF = 2

Node Failures

50

0

10

20

30

RF = 2

40

Replicas

Node Failures

50

0

10

20

30

RF = 2

40

Consistency, Availability Consistency

– Can I read stale data? Availability

– Can I write/read at all? Tunable Consistency

Consistency N = Total number of replicas R = Number of replicas read from

– (before the response is returned) W = Number of replicas written to

– (before the write is considered a success)

Consistency N = Total number of replicas R = Number of replicas read from

– (before the response is returned) W = Number of replicas written to

– (before the write is considered a success)

W + R > N gives strong consistency

Consistency


N = 3W = 2R = 2

2 + 2 > 3 ==> strongly consistent

Consistency


N = 3W = 2R = 2

2 + 2 > 3 ==> strongly consistent

Only 2 of the 3 replicas must be available.

Consistency Tunable Consistency

– Specify N (Replication Factor) per data set– Specify R, W per operation

Consistency Tunable Consistency

– Specify N (Replication Factor) per data set– Specify R, W per operation– Quorum: N/2 + 1

• R = W = Quorum• Strong consistency• Tolerate the loss of N – Quorum replicas

– R, W can also be 1 or N

Availability Can tolerate the loss of:

– N – R replicas for reads– N – W replicas for writes

CAP Theorem

Availability

Consistency

During node or network failure:

100%

100%

Possible

Not Possible

CAP Theorem

Availability

Consistency

During node or network failure:

100%

100%

Cassandra

Not Possible

Possible

No single point of failure Replication that works Scales linearly

– 2x nodes = 2x performance• For both reads and writes

– Up to 100's of nodes– See “Netflix: 1 million writes/sec on AWS”

Operationally simple Multi-Datacenter Replication

Clustering

Comes from Google BigTable Goals

– Commodity Hardware• Spinning disks

– Handle data sets much larger than memory• Minimize disk seeks

– High throughput– Low latency– Durable

Data Model

Static– Object data– Similar to a table in a relational database

Dynamic– Precomputed query results– Materialized views

(these are just educational classifications)

Column Families

Static Column Families

zznate

driftx

thobbs

jbellis

password: *

password: *

password: *

name: Nate

name: Brandon

name: Tyler

password: * name: Jonathan site: riptano.com

Users

Rows– Each row has a unique primary key– Sorted list of (name, value) tuples

• Like an ordered hash– The (name, value) tuple is called a “column”

Dynamic Column Families

Dynamic Column Families

zznate

driftx

thobbs

jbellis

driftx: thobbs:

driftx: thobbs:mdennis: zznate:

Following

zznate:

pcmanus: xedin:

Dynamic Column Families Other Examples:

– Timeline of tweets by a user– Timeline of tweets by all of the people a user is

following– List of comments sorted by score– List of friends grouped by state

The Data API RPC-based API

– github.com/twitter/cassandra CQL (Cassandra Query Language)

– code.google.com/a/apache-extras.org/p/cassandra-ruby/

Inserting Data

INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);

Updating Data

INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34);

Updates are the same as inserts:

Or

UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;

Fetching Data

SELECT * FROM users WHERE KEY = “thobbs”;

Whole row select:

Fetching Data

SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;

Explicit column select:

Fetching Data

UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e' WHERE KEY = “key”;

SELECT 1..3 FROM letters WHERE KEY = “key”;

Get a slice of columns

Returns [(1, a), (2, b), (3, c)]

Fetching Data

SELECT FIRST 2 FROM letters WHERE KEY = “key”;


Returns [(1, a), (2, b)]

SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”;

Returns [(5, e), (4, d)]

Fetching Data

SELECT 3..'' FROM letters WHERE KEY = “key”;


Returns [(3, c), (4, d), (5, e)]

SELECT FIRST 2 REVERSED 4..'' FROM letters WHERE KEY = “key”;

Returns [(4, d), (3, c)]

Deleting Data

DELETE FROM users WHERE KEY = “thobbs”;

Delete a whole row:

DELETE “age” FROM users WHERE KEY = “thobbs”;

Delete specific columns:

Secondary Indexes

CREATE INDEX ageIndex ON users (age);

SELECT name FROM USERS WHERE age = 24 AND state = “TX”;

Builtin basic indexes

Performance Writes

– 10k – 30k per second per node– Sub-millisecond latency

Reads– 1k – 20k per second per node (depends on data

set, caching– 0.1 to 10ms latency

Other Features Distributed Counters

– Can support millions of high-volume counters Excellent Multi-datacenter Support

– Disaster recovery– Locality

Hadoop Integration– Isolation of resources– Hive and Pig drivers

Compression

What Cassandra Can't Do Transactions

– Unless you use a distributed lock– Atomicity, Isolation– These aren't needed as often as you'd think

Limited support for ad-hoc queries– Know what you want to do with the data

Not One-size-fits-all Use alongside an RDBMS

Problems you shouldn't solve with C* Prototyping Distributed Locking Small datasets

– (When you don't need availability) Complex graph processing

– Shallow graph queries work well, though Fundamentally highly relational/transactional

data

The sweet spot for Cassandra Large dataset, low latency queries Simple to medium complexity queries

– Key/value– Time series, ordered data– Lists, sets, maps

High Availability

The sweet spot for Cassandra Social

– Texts, comments, check-ins, collaboration Activity

– Feeds, timelines, clickstreams, logs, sensor data Metrics

– Performance data over time– CloudKick, DataStax OpsCenter

Text Search– Inbox search at Facebook

ORMs Poor integration ORMs are not a natural fit for Cassandra

– In C*, we mainly care about queries, not objects– Beyond simple K/V, abstraction breaks

Suggestion: don't waste time with an ORM– C* will only be used for a specific subset of your

data/queries– Use the C* API directly in your model

Tyler Hobbs@tylhobbs

[email protected]

Questions?

cassandra for ruby/rails devs

Business

consistency w r n

n r replicas

n w replicas

strong consistency n

successw r n

loss of n quorum replicas

consistencytunable consistency

node failuresrf