meetup crash course: cassandra data modelling
Post on 14-Apr-2017
427 Views
Preview:
TRANSCRIPT
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
Crash Course : Cassandra Data ModellingErick Ramirez
DataStax Engineering@flightc
27 August 2015
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Welcome• Modelling crash course• Forget everything you know• Informal session• Please ask me questions
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
A refresher• Gossip• Partitions & hashing• Replicas & snitches• Client & coordinator• Consistency level
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
A cluster
• Node - a Cassandra instance• Rack - a logical group of nodes• DC - a logical group of racks• Cluster - a full set of nodes
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Gossip
• New node gossips with seed nodes
• Happens every second• Learns about other nodes• Up/down status• Node locations
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Partitions & hashing
• Data is partitioned• Partition key is hashed
hash(“DataStax”) = 9b036bd16dbe90073ahash(“@flightc”) = 1668bf314257609f04
• Partition range is -263 to 263• Each node owns token [range]*
* vnodes = multiple owned tokens
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Replicas & snitches
• A replica is copy of a partition• 1st replica is token owner• Next replica is “next” node• A snitch tells partitioner the
topology
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Client & coordinator
• C* driver (client) chooses node- seed nodes- load-balancing policy
• Chosen node for request is coordinator
• Coordinator manages replication factor
• Each write is timestamped
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Consistency level
• Number of nodes which must acknowledge a read or write
• Can vary per request• Possible CLs: ANY, ONE, QUORUM, LOCAL_QUORUM, ALL
• For writes, data is written to disk (commitlog)
• For reads, nodes send most recent copy of data
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Modelling Cassandra• CQL• Tables & column families• Rows & partitions
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Modelling is a science
• Use tested methodologies• Predictable results
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Modelling is an art
• Sometimes, you need to improvise
• Massage schema to optimise
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Data Modelling
• Collect & analyse data requirements
• Identify entities & relationships• Identify queries• Design schema• Optimise!
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Goals
• Very fast queries• De-normalise• Nest data• Duplicate data• Query-driven model
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Modelling Cassandra
• Use Cassandra Query Language (CQL)
• Similar SQL-like approach• DDL - CREATE, ALTER, DROP • DML - SELECT, INSERT, UPDATE, DELETE
CREATE TABLE users ( userid uuid, name text, email text, PRIMARY KEY (userid));
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Tables & column families
• Table is a two-dimensional view of data
• A set of rows with a similar structure
• Table schema defines a set of columns and a primary key
• PK is a sequence of columns which uniquely identify a row
• Column family is a multi-dimensional data structure
• Rows are organised into partitions
• A partition has 1 or more rows• Partition key is part of primary
key used to uniquely identify a partition
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Example - Table with single-row partitions
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Example - Table with multi-row partitions
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Keys, composites & clustering columns
• A simple partition keyPRIMARY KEY ( userid )
• Composite partition keyPRIMARY KEY ( (album_name, year) )
• Simple partition key with clustering columnsPRIMARY KEY ( userid, name, email )
• Composite partition key with clustering columnsPRIMARY KEY ( (album_name, year), title)
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Examples
Composite partition key
Composite partition key with clustering columns
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Column families
• Distributed• Sparse
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Storage
FAST SCAN
SLO
W S
CA
N
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Physical storage layout
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
On-disk layout to 2D representation
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Sizes
• Column family size is only limited to the size of the cluster
• Linear scaling - partitions are distributed
• Largest partition must fit on disk on a single node
• A single partition does not span multiple nodes
• Max cells is 2 billion• Max data size per cell (column
value) is 2GB
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Query-driven modelling
• Find all performers and albums for a given track title
CREATE TABLE albums_by_track ( track_title TEXT, performer TEXT, year INT, album_title TEXT, PRIMARY KEY ( track_title, performer, year, album_title ));
• Find performer, genre & titles for a given album title & year
CREATE TABLE tracks_by_album ( album_title TEXT, year INT, performer TEXT, genre TEXT, number INT, track_title TEXT, PRIMARY KEY ( (album_title,year), number ));
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
• Most efficient access pattern• Query accesses only 1 partition• Partition can be 1 or more rows
Partition per query
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Partition+ per query
• Less efficient• Not necessarily bad• Query accesses 1+ partitions
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Table scan, multi-table
• Not efficient at all - avoid!• Query accesses all partitions in
a table(s)
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Nest data
• More efficient to get to partition and iterate through rows
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Duplicate data
• Better than doing an expensive join• Results are pre-computed & materialised
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Query-driven model
• Each query has a corresponding table
• Tables are optimised for queries• Tables return data in correct
order
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
This is the beginning
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Get trained
• Free instructor-led courses• Free self-paced learning• Free online resources• Go to academy.datastax.com
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Cassandra Summit 2015
• 5 reasons to join me in SF buff.ly/1JHl6Kw
• September 22-24• Free general passes still
available!
top related