meetup crash course: cassandra data modelling

36
Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. Crash Course : Cassandra Data Modelling Erick Ramirez DataStax Engineering @flightc 27 August 2015

Upload: erick-ramirez

Post on 14-Apr-2017

427 views

Category:

Technology


0 download

TRANSCRIPT

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

Crash Course : Cassandra Data ModellingErick Ramirez

DataStax Engineering@flightc

27 August 2015

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Welcome• Modelling crash course• Forget everything you know• Informal session• Please ask me questions

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

A refresher• Gossip• Partitions & hashing• Replicas & snitches• Client & coordinator• Consistency level

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

A cluster

• Node - a Cassandra instance• Rack - a logical group of nodes• DC - a logical group of racks• Cluster - a full set of nodes

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Gossip

• New node gossips with seed nodes

• Happens every second• Learns about other nodes• Up/down status• Node locations

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Partitions & hashing

• Data is partitioned• Partition key is hashed

hash(“DataStax”) = 9b036bd16dbe90073ahash(“@flightc”) = 1668bf314257609f04

• Partition range is -263 to 263• Each node owns token [range]*

* vnodes = multiple owned tokens

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Replicas & snitches

• A replica is copy of a partition• 1st replica is token owner• Next replica is “next” node• A snitch tells partitioner the

topology

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Client & coordinator

• C* driver (client) chooses node- seed nodes- load-balancing policy

• Chosen node for request is coordinator

• Coordinator manages replication factor

• Each write is timestamped

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Consistency level

• Number of nodes which must acknowledge a read or write

• Can vary per request• Possible CLs: ANY, ONE, QUORUM, LOCAL_QUORUM, ALL

• For writes, data is written to disk (commitlog)

• For reads, nodes send most recent copy of data

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Modelling Cassandra• CQL• Tables & column families• Rows & partitions

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Modelling is a science

• Use tested methodologies• Predictable results

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Modelling is an art

• Sometimes, you need to improvise

• Massage schema to optimise

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Data Modelling

• Collect & analyse data requirements

• Identify entities & relationships• Identify queries• Design schema• Optimise!

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Goals

• Very fast queries• De-normalise• Nest data• Duplicate data• Query-driven model

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Modelling Cassandra

• Use Cassandra Query Language (CQL)

• Similar SQL-like approach• DDL - CREATE, ALTER, DROP • DML - SELECT, INSERT, UPDATE, DELETE

CREATE TABLE users ( userid uuid, name text, email text, PRIMARY KEY (userid));

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Tables & column families

• Table is a two-dimensional view of data

• A set of rows with a similar structure

• Table schema defines a set of columns and a primary key

• PK is a sequence of columns which uniquely identify a row

• Column family is a multi-dimensional data structure

• Rows are organised into partitions

• A partition has 1 or more rows• Partition key is part of primary

key used to uniquely identify a partition

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Example - Table with single-row partitions

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Example - Table with multi-row partitions

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Keys, composites & clustering columns

• A simple partition keyPRIMARY KEY ( userid )

• Composite partition keyPRIMARY KEY ( (album_name, year) )

• Simple partition key with clustering columnsPRIMARY KEY ( userid, name, email )

• Composite partition key with clustering columnsPRIMARY KEY ( (album_name, year), title)

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Examples

Composite partition key

Composite partition key with clustering columns

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Column families

• Distributed• Sparse

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Storage

FAST SCAN

SLO

W S

CA

N

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Physical storage layout

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

On-disk layout to 2D representation

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Sizes

• Column family size is only limited to the size of the cluster

• Linear scaling - partitions are distributed

• Largest partition must fit on disk on a single node

• A single partition does not span multiple nodes

• Max cells is 2 billion• Max data size per cell (column

value) is 2GB

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Query-driven modelling

• Find all performers and albums for a given track title

CREATE TABLE albums_by_track ( track_title TEXT, performer TEXT, year INT, album_title TEXT, PRIMARY KEY ( track_title, performer, year, album_title ));

• Find performer, genre & titles for a given album title & year

CREATE TABLE tracks_by_album ( album_title TEXT, year INT, performer TEXT, genre TEXT, number INT, track_title TEXT, PRIMARY KEY ( (album_title,year), number ));

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

• Most efficient access pattern• Query accesses only 1 partition• Partition can be 1 or more rows

Partition per query

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Partition+ per query

• Less efficient• Not necessarily bad• Query accesses 1+ partitions

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Table scan, multi-table

• Not efficient at all - avoid!• Query accesses all partitions in

a table(s)

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Nest data

• More efficient to get to partition and iterate through rows

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Duplicate data

• Better than doing an expensive join• Results are pre-computed & materialised

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Query-driven model

• Each query has a corresponding table

• Tables are optimised for queries• Tables return data in correct

order

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

This is the beginning

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Get trained

• Free instructor-led courses• Free self-paced learning• Free online resources• Go to academy.datastax.com

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Cassandra Summit 2015

• 5 reasons to join me in SF buff.ly/1JHl6Kw

• September 22-24• Free general passes still

available!

Melbourne Cassandra Meetup

© 2015 DataStax. Use only with permission.

@flightc

Erick Ramirez | @flightc

Thank youErick Ramirez @flightc