cassandra: an alien technology that's not so alien
TRANSCRIPT
Cassandra: An Alien Technology That’s not so Alien
Who am I?
• Brian Hess
– Sr. Product Manger for Analytics
• Math and CS background
– Distributed Systems
– Algorithms
• Government data mining research
• Data warehousing (Netezza)
– SQL and data mining and UDF fun
• Joined Datastax 1.5 years ago
– New to NoSQL and Cassandra
© 2015. All Rights Reserved. 2
© 2015. All Rights Reserved. 3
© 2015. All Rights Reserved. 4
Agenda
The Good
• Query Language
• Tooling
• Conceptual Data
Model
The Different
• Application
Methodology
• Connections/Drivers
• Data Modeling
The Dangerous
• Batches
• Secondary Indices
• Lightweight
Transactions
© 2015. All Rights Reserved. 5
Cassandra Query Language
• Very much like SQL
– SELECT, INSERT, DELETE
– WHERE clauses
– CREATE TABLE, GRANT, etc, etc
• SQL syntax for Cassandra operations
– Only supports Cassandra operations
– Not trying to cover the full SQL space
• Missing several things:
– JOIN, GROUP BY, windowed aggregates, subqueries, WITH, etc
© 2015. All Rights Reserved. 6
Cassandra Query Language
• Pop Quiz: CQL or SQL: 1. SELECT a, b, c FROM myData;
2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3;
3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5;
4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3;
5. SELECT MAX(temp) FROM txn
6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON
(a.id=b.id)
© 2015. All Rights Reserved. 7
Cassandra Query Language
• Pop Quiz: CQL or SQL: 1. SELECT a, b, c FROM myData;
2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3;
3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5;
4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3;
5. SELECT MAX(temp) FROM txn
6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON
(a.id=b.id)
© 2015. All Rights Reserved. 8
Tabular Data Model
• Rows and Columns
– Like SQL, R data.frame, etc
• Strong schema
– Each column has a data type
– Custom data types including Map, Set, List, and UDTs
– Can mimic “old style thrift tables” with Map
• Most Cassandra tables really have a schema
– Most data really has a schema
© 2015. All Rights Reserved. 9
Tooling
• Cqlsh
– Command-line CQL interpreter
– Mainly used for management operations
• CREATE KEYSPACE, CREATE TABLE, etc
• GRANT, etc
– Singleton INSERTs
– Some light load/unload operations via COPY
© 2015. All Rights Reserved. 10
Tooling
• DevCenter
– DataStax tool
– “Toad for Cassandra”
© 2015. All Rights Reserved. 11
Data Modeling
• Start with the query – How will you access this data
– Then create the schema optimized for those queries
– Store it multiple times if you must – “Query Tables”
• Uniqueness – what is an overwrite versus an insert? – Partition keys – for fast lookups
– Clustering Columns – for ranges, etc
– Mix and match for different query patterns – Users by ID, Users by Email, etc
• No joins – So, no star schema
– Denormalize!
© 2015. All Rights Reserved. 12
Application Methodology
• Instead of rolling back, we retry
• In SQL you try a transaction, and if error, then rollback
– Leverages transaction isolation and rollback
• In Cassandra you try, and if error, try again
– Leverages idempotent data model and high availability
• “Well, did you want to write that to the database or not?”
– “Instead of Transactions and Rollback, we have Idempotency and Retry”
© 2015. All Rights Reserved. 13
Drivers / Connections
• Multilanguage drivers – Java, Python, C/C++, PHP, Ruby, etc
• Leverage the whole cluster – Connect to all the nodes – connect to one and discovering the others
– Load balancing – Round Robin, etc
– Smart routing of queries – “Token Aware Routing”
• Not JDBC / ODBC – Cassandra-specific, but similar
– Cluster, Session, Statement, ResultSet, etc
– Different data types, etc
– Lower-level configurations – number of connections, paging size, etc
© 2015. All Rights Reserved. 14
Batches
• These are not what you think they are
• How they work… – Send all the statements to the coordinator
– If “logged”, then a batchlog is written for durability – and replicates it
– Executes each statement – involves other nodes (possibly)
• They are not done as an isolated set of INSERTs – Well, they are if they all update the same partition (nuance)
– Things will be seen as they are executed
• They do not speed things up – The coordinator has to do all the work – latency increase and timeouts
– You still need to talk to multiple nodes for multiple INSERTs
– Logged batches require a lot of work by the coordinator • Maintain a batchlog for durability – and replicates it
• But, they do serve a purpose – Inter-table consistency (logged batches)
– Some bulk load performance (unlogged batches)
© 2015. All Rights Reserved. 15
Secondary Indices
• These are not what you think they are
• Secondary Indices need to consult every Cassandra node
– Cassandra is optimized for “single node queries”
– Okay for single-partition queries, but not really worth it
• Maintaining secondary indices is also an overhead
• Basically, they rarely make things faster in cases that matter
• Consider Materialized Views in Cassandra 3.0
© 2015. All Rights Reserved. 16
Lightweight Transactions
• These are not what you think they are
• They are not SQL transactions – They do not support roll-back, etc
• They do allow for serialization and isolation – Useful for operations like creating accounts
– INSERT INTO users(username, email) VALUES
('cassandra_fan', '[email protected]') IF NOT EXISTS
• They are costly – Paxos algorithm – multiple passes/rounds
• But they do serve a purpose – Use sparingly
© 2015. All Rights Reserved. 17
Thank you
Cassandra?
© 2015. All Rights Reserved. 19
Cassandra!
© 2015. All Rights Reserved. 20