cassandra: an alien technology that's not so alien

Cassandra: An Alien Technology That’s not so Alien

Who am I?

• Brian Hess

– Sr. Product Manger for Analytics

• Math and CS background

– Distributed Systems

– Algorithms

• Government data mining research

• Data warehousing (Netezza)

– SQL and data mining and UDF fun

• Joined Datastax 1.5 years ago

– New to NoSQL and Cassandra

© 2015. All Rights Reserved. 2

Agenda

The Good

• Query Language

• Tooling

• Conceptual Data

Model

The Different

• Application

Methodology

• Connections/Drivers

• Data Modeling

The Dangerous

• Batches

• Secondary Indices

• Lightweight

Transactions


Cassandra Query Language

• Very much like SQL

– SELECT, INSERT, DELETE

– WHERE clauses

– CREATE TABLE, GRANT, etc, etc

• SQL syntax for Cassandra operations

– Only supports Cassandra operations

– Not trying to cover the full SQL space

• Missing several things:

– JOIN, GROUP BY, windowed aggregates, subqueries, WITH, etc



• Pop Quiz: CQL or SQL: 1. SELECT a, b, c FROM myData;

2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3;

3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5;

4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3;

5. SELECT MAX(temp) FROM txn

6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON

(a.id=b.id)


Tabular Data Model

• Rows and Columns

– Like SQL, R data.frame, etc

• Strong schema

– Each column has a data type

– Custom data types including Map, Set, List, and UDTs

– Can mimic “old style thrift tables” with Map

• Most Cassandra tables really have a schema

– Most data really has a schema


Tooling

• Cqlsh

– Command-line CQL interpreter

– Mainly used for management operations

• CREATE KEYSPACE, CREATE TABLE, etc

• GRANT, etc

– Singleton INSERTs

– Some light load/unload operations via COPY


Tooling

• DevCenter

– DataStax tool

– “Toad for Cassandra”


Data Modeling

• Start with the query – How will you access this data

– Then create the schema optimized for those queries

– Store it multiple times if you must – “Query Tables”

• Uniqueness – what is an overwrite versus an insert? – Partition keys – for fast lookups

– Clustering Columns – for ranges, etc

– Mix and match for different query patterns – Users by ID, Users by Email, etc

• No joins – So, no star schema

– Denormalize!


Application Methodology

• Instead of rolling back, we retry

• In SQL you try a transaction, and if error, then rollback

– Leverages transaction isolation and rollback

• In Cassandra you try, and if error, try again

– Leverages idempotent data model and high availability

• “Well, did you want to write that to the database or not?”

– “Instead of Transactions and Rollback, we have Idempotency and Retry”


Drivers / Connections

• Multilanguage drivers – Java, Python, C/C++, PHP, Ruby, etc

• Leverage the whole cluster – Connect to all the nodes – connect to one and discovering the others

– Load balancing – Round Robin, etc

– Smart routing of queries – “Token Aware Routing”

• Not JDBC / ODBC – Cassandra-specific, but similar

– Cluster, Session, Statement, ResultSet, etc

– Different data types, etc

– Lower-level configurations – number of connections, paging size, etc


Batches

• These are not what you think they are

• How they work… – Send all the statements to the coordinator

– If “logged”, then a batchlog is written for durability – and replicates it

– Executes each statement – involves other nodes (possibly)

• They are not done as an isolated set of INSERTs – Well, they are if they all update the same partition (nuance)

– Things will be seen as they are executed

• They do not speed things up – The coordinator has to do all the work – latency increase and timeouts

– You still need to talk to multiple nodes for multiple INSERTs

– Logged batches require a lot of work by the coordinator • Maintain a batchlog for durability – and replicates it

• But, they do serve a purpose – Inter-table consistency (logged batches)

– Some bulk load performance (unlogged batches)


Secondary Indices


• Secondary Indices need to consult every Cassandra node

– Cassandra is optimized for “single node queries”

– Okay for single-partition queries, but not really worth it

• Maintaining secondary indices is also an overhead

• Basically, they rarely make things faster in cases that matter

• Consider Materialized Views in Cassandra 3.0


Lightweight Transactions


• They are not SQL transactions – They do not support roll-back, etc

• They do allow for serialization and isolation – Useful for operations like creating accounts

– INSERT INTO users(username, email) VALUES

('cassandra_fan', '[email protected]') IF NOT EXISTS

• They are costly – Paxos algorithm – multiple passes/rounds

• But they do serve a purpose – Use sparingly


Thank you

Cassandra?


Cassandra!