Download - Learning Cassandra
![Page 1: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/1.jpg)
Learning Cassandra
Dave Gardner@davegardnerisme
![Page 2: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/2.jpg)
What I’m going to cover
• How to NoSQL• Cassandra basics (dynamo and
big table)• How to use the data model in
real life
![Page 3: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/3.jpg)
How to NoSQL
1. Find data store that doesn’t use SQL2. Anything3. Cram all the things into it4. Triumphantly blog this success5. Complain a month later when it
bursts into flames
http://www.slideshare.net/rbranson/how-do-i-cassandra/4
![Page 4: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/4.jpg)
Choosing NoSQL
“NoSQL DBs trade off traditional features to better support new and emerging use cases”
http://www.slideshare.net/argv0/riak-use-cases-dissecting-the-sol
utions-to-hard-
problems
![Page 5: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/5.jpg)
Choosing Cassandra: Tradeoffs
More widely used, tested and documented softwareMySQL first OS release 1998
For a relatively immature productCassandra first open-sourced in 2008
![Page 6: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/6.jpg)
Choosing Cassandra: Tradeoffs
Ad-hoc queryingSQL join, group by, having, order
For a rich data model with limited ad-hoc querying abilityCassandra makes you denormalise
![Page 7: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/7.jpg)
Choosing NoSQL
“they say … I can’t decide between this project and this project even though they look nothing like each other. And the fact that you can’t decide indicates that you don’t actually have a problem that requires them.”
Benjamin Black – NoSQL Tapes (at 30:15)
http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
![Page 8: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/8.jpg)
What do we get in return?
Proven horizontal scalability
Cassandra scales reads and writes linearly as new nodes are added
![Page 9: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/9.jpg)
Netflix benchmark: linear scaling
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
![Page 10: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/10.jpg)
What do we get in return?
High availability
Cassandra is fault-resistant with tunable consistency levels
![Page 11: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/11.jpg)
What do we get in return?
Low latency, solid performance
Cassandra has very good write performance
![Page 12: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/12.jpg)
http://blog.cubrid.org/dev-platform/nosql-benchmarking/
* Add pinch of salt
Performance benchmark *
![Page 13: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/13.jpg)
What do we get in return?
Operational simplicity
Homogenous cluster, no “master” node, no SPOF
![Page 14: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/14.jpg)
What do we get in return?
Rich data model
Cassandra is more than simple key-value – columns, composites, counters, secondary indexes
![Page 15: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/15.jpg)
How to NoSQL version 2
Learn about each solution
• What tradeoffs are you making?• How is it designed?• What algorithms does it use?
http://www.alberton.info/nosql_databases_what_when_why_phpuk2011.html
![Page 16: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/16.jpg)
Amazon Dynamo + Google Big Table
Consistent hashingVector clocks *Gossip protocolHinted handoffRead repair
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
ColumnarSSTable storage
Append-onlyMemtable
Compaction
http://labs.google.com/papers/bigtable-osdi06.pdf
* not in Cassandra
![Page 17: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/17.jpg)
The dynamo paper
#1
#4
#6
#2
#3
Client
#5
tokens are integers from0 to 2127
![Page 18: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/18.jpg)
The dynamo paper
#1
#4
#6
#2
#3
Client
#5
Coordinator
consistent hashing
![Page 19: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/19.jpg)
Consistency levels
How many replicas must respond to declare success?
![Page 20: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/20.jpg)
Consistency levels: read operations
Level Description
ONE 1st Response
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas
http://wiki.apache.org/cassandra/API#Read
![Page 21: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/21.jpg)
Consistency levels: write operations
Level Description
ANY One node, including hinted handoff
ONE One node
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas
http://wiki.apache.org/cassandra/API#Write
![Page 22: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/22.jpg)
The dynamo paper
RF = 3CL = One
#1
#4
#6
#2
#3
Client
#5
Coordinator
![Page 23: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/23.jpg)
The dynamo paper
RF = 3CL = Quorum
#1
#4
#6
#2
#3
Client
#5
Coordinator
![Page 24: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/24.jpg)
The dynamo paper
RF = 3CL = One
#1
#4
#6
#2
#3
Client
#5
Coordinator
+ hint
![Page 25: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/25.jpg)
The dynamo paper
RF = 3CL = One
#1
#4
#6
#2
#3
Client
#5
Coordinator
Read repair
![Page 26: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/26.jpg)
The big table paper
• Sparse "columnar" data model• SSTable disk storage• Append-only commit log• Memtable (buffer and sort)• Immutable SSTable files• Compactionhttp://labs.google.com/papers/bigtable-osdi06.pdfhttp://www.slideshare.net/geminimobile/bigtable-4820829
![Page 27: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/27.jpg)
The big table paper
Name
Value
Column
+ timestamp
![Page 28: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/28.jpg)
The big table paper
Name
Value
Column
Name
Value
Column
Name
Value
Column
we can have millions of columns
*
* theoretically up to 2 billion
![Page 29: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/29.jpg)
The big table paper
Name
Value
Column
Name
Value
Column
Name
Value
Column
Row Key
Row
![Page 30: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/30.jpg)
The big table paper
Column Family
ColumnRow Key Column Column
ColumnRow Key Column Column
ColumnRow Key Column Column
we can have billions of rows
![Page 31: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/31.jpg)
The big table paper
Write Memtable
SSTable
SSTable
SSTable
SSTable
Commit Log
Memory
Disk
Flushed on time/size trigger
Immutable
![Page 32: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/32.jpg)
Data model basics: conflict resolution
Per-column timestamp-based conflict resolution
http://cassandra.apache.org/
{ column: foo, value: bar, timestamp: 1000}
{ column: foo, value: zing, timestamp: 1001}
![Page 33: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/33.jpg)
Data model basics: conflict resolution
Per-column timestamp-based conflict resolution
http://cassandra.apache.org/
{ column: foo, value: bar, timestamp: 1000}
{ column: foo, value: zing, timestamp: 1001}
bigger timestamp
![Page 34: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/34.jpg)
Data model basics: column ordering
Columns ordered at time of writing, according to Column Family schema
http://cassandra.apache.org/
{ column: zebra, value: foo, timestamp: 1000}
{ column: badger, value: foo, timestamp: 1001}
![Page 35: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/35.jpg)
Data model basics: column ordering
Columns ordered at time of writing, according to Column Family schema
http://cassandra.apache.org/
{ badger: foo, zebra: foo}
with AsciiType column schema
![Page 36: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/36.jpg)
Key point
Each “query” can be answered from a single slice of disk
(once compaction has finished)
![Page 37: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/37.jpg)
Data modeling – 1000ft introduction
• Start from your queries and work backwards
• Denormalise in the application(store data more than once)
http://www.slideshare.net/mattdennis/cassandra-data-modelinghttp://blip.tv/datastax/data-modeling-workshop-5496906
![Page 38: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/38.jpg)
Pattern 1: not using the value
Storing that user X is in bucket Y
Row key: f97be9cc-5255-457…Column name: fooValue: 1
https://github.com/davegardnerisme/we-have-your-kidneys/blob/master/www/add.php#L53-58
we don’t really care about this
![Page 39: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/39.jpg)
Pattern 1: not using the value
Q: is user X in bucket foo?f97be9cc-5255-4578-8813-76701c0945bd
bar: 1foo: 1
06a6f1b0-fcf2-41d9-8949-fe2d416bde8ebaz: 1zoo: 1
503778bc-246f-4041-ac5a-fd944176b26daaa: 1
A: single column fetch
![Page 40: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/40.jpg)
Pattern 1: not using the value
Q: which buckets is user X in?f97be9cc-5255-4578-8813-76701c0945bd
bar: 1foo: 1
06a6f1b0-fcf2-41d9-8949-fe2d416bde8ebaz: 1zoo: 1
503778bc-246f-4041-ac5a-fd944176b26daaa: 1
A: column slice fetch
![Page 41: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/41.jpg)
Pattern 1: not using the value
We could also use expiring columns to automatically delete columns N seconds after insertion
UPDATE users USING TTL = 3600SET 'foo' = 1WHERE KEY = 'f97be9cc-5255-4578-8813-76701c0945bd'
![Page 42: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/42.jpg)
Pattern 2: counters
Real-time analytics to count clicks/impressions of ads in hourly buckets
Row key: 1Column name: 2011103015-clickValue: 34
https://github.com/davegardnerisme/we-have-your-kidneys/blob/master/www/adClick.php
![Page 43: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/43.jpg)
Pattern 2: counters
Increment by 1 using CQL
UPDATE adsSET '2011103015-impression' = '2011103015-impression' + 1WHERE KEY = '1’
![Page 44: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/44.jpg)
Pattern 2: counters
Q: how many clicks/impressions for ad 1 over time range?1
2011103015-click: 12011103015-impression: 34342011103016-click: 122011103016-impression: 54112011103017-click: 22011103017-impression: 345
A: column slice fetch, between column X and Y
![Page 45: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/45.jpg)
Pattern 3: time series
Store canonical reference of impressions and clicks
Row key: 20111030Column name: <time UUID>Value: {json}
http://rubyscale.com/2011/basic-time-series-with-cassandra/
Cassandra can order columns by time
![Page 46: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/46.jpg)
Pattern 4: object properties as columns
Store user properties such as name, email, etc.
Row key: f97be9cc-5255-457…Column name: nameValue: Bob Foo-Bar
http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
![Page 47: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/47.jpg)
Anti-pattern 1: read-before-write
Instead store as independent columns and mutate individually
(see pattern 4)
![Page 48: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/48.jpg)
Anti-pattern 2: super columns
Friends don’t let friends use super columns.
http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-the-unwary/
![Page 49: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/49.jpg)
Anti-pattern 3: OPP
The Order Preserving Partitioner unbalances your load and makes your life harder
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/
![Page 50: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/50.jpg)
Recap: Data modeling
• Think about the queries, work backwards
• Don’t overuse single rows; try to spread the load
• Don’t use super columns
• Ask on IRC! #cassandra
![Page 51: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/51.jpg)
There’s more: Brisk
Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra
DataStax offer this functionality in their “Enterprise” product
http://www.datastax.com/products/enterprise
![Page 52: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/52.jpg)
Hive: SQL-like interface to Hadoop
CREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );
SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
![Page 53: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/53.jpg)
In conclusion
Cassandra is founded on sound design principles
![Page 54: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/54.jpg)
In conclusion
The data model is incredibly powerful
![Page 55: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/55.jpg)
In conclusion
CQL and a new breed of clients are making it easier to use
![Page 56: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/56.jpg)
In conclusion
Hadoop integration means we can analyse data directly from a Cassandra cluster
![Page 57: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/57.jpg)
In conclusion
There is a strong community and multiple companies offering professional support
![Page 58: Learning Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061218/54b7a2d94a795998738b46f9/html5/thumbnails/58.jpg)
Thanks
Learn more about Cassandrameetup.com/Cassandra-London
Sample ad-targeting project on Github https://github.com/davegardnerisme/we-have-your-kidneys
Watch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations
looking for a job?