cassandra 3 new features 2016

@doanduyhai

New Cassandra 3 FeaturesDuyHai DOANApache Cassandra Evangelist

@doanduyhai

Who Am I ?Duy Hai DOAN Apache Cassandra Evangelist•  talks, meetups, confs …

•  open-source projects (Achilles, Apache Zeppelin ...)

•  OSS Cassandra point of contact• 

☞ duy_hai.doan@datastax.com ☞ @doanduyhai

@doanduyhai

Datastax

•  Founded in April 2010

•  We contribute a lot to Apache Cassandra™

•  400+ customers (25 of the Fortune 100), 400+ employees

•  Headquarter in San Francisco Bay area

•  EU headquarter in London, offices in France and Germany

•  Datastax Enterprise = OSS Cassandra + extra features

@doanduyhai

Agenda

•  Materialized Views

•  User Defined Functions (UDF) and Aggregates (UDA)

•  JSON Syntax

•  New SASI full text search index

@doanduyhai

Materialized Views (MV)•  Why ? •  Gotchas

@doanduyhai

Why Materialized Views ?•  Relieve the pain of manual denormalization

CREATE TABLE user(id int PRIMARY KEY, country text, …);

CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));

@doanduyhai

CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id));

Materialzed View In ActionCREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastnameFROM user WHERE country IS NOT NULL AND id IS NOT NULLPRIMARY KEY(country, id)

Materialized Views Demo 8

@doanduyhai

Materialized View Performance•  Write performance

•  slower than normal write•  local lock + read-before-write cost (but paid only once for all views)•  for each base table update, worst case: mv_count x 2 (DELETE + INSERT) extra

mutations for the views

@doanduyhai

Materialized View Performance•  Write performance vs manual denormalization

•  MV better because no client-server network traffic for read-before-write •  MV better because less network traffic for multiple views (client-side BATCH)

•  Makes developer life easier à priceless

@doanduyhai

Materialized View Performance•  Read performance vs secondary index

•  MV better because single node read (secondary index can hit many nodes)•  MV better because single read path (secondary index = read index + read data)

@doanduyhai

Materialized Views Consistency•  Consistency level

•  CL honoured for base table, ONE for MV + local batchlog

•  Weaker consistency guarantees for MV than for base table.

@doanduyhai

User Define Functions (UDF)•  Why ? •  UDAs •  Gotchas

@doanduyhai

Rationale•  Push computation server-side

•  save network bandwidth (1000 nodes!)•  simplify client-side code•  provide standard & useful function (sum, avg …)•  accelerate analytics use-case (pre-aggregation for Spark)

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE language AS $$ // source code here$$;

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE languageAS $$ // source code here$$;

Param name to refer to in the code Type = Cassandra type

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE language // jAS $$ // source code here$$;

Always called Null-check mandatory in code

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnTypeLANGUAGE language // javAS $$ // source code here$$;

If any input is null, function execution is skipped and return null

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnType LANGUAGE languageAS $$ // source code here$$;

Cassandra types •  primitives (boolean, int, …) •  collections (list, set, map) •  tuples •  UDT

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE languageAS $$ // source code here$$;

JVM supported languages •  Java, Scala •  Javascript (slow) •  Groovy, Jython, JRuby •  Clojure ( JSR 223 impl issue)

UDF Demo

@doanduyhai

User Defined Aggregates (UDA)•  Real use-case for UDF

•  Aggregation server-side à huge network bandwidth saving

•  Provide similar behavior for Group By, Sum, Avg etc …

@doanduyhai

How to create an UDA ?CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;

Only type, no param name

State type

Initial state type

@doanduyhai

Accumulator function. Signature: accumulatorFunction(stateType, type1, type2, …)

RETURNS stateType

@doanduyhai

Optional final function. Signature: finalFunction(stateType)

UDA Demo

@doanduyhai

Gotchas

•  UDA in Cassandra is not distributed !

•  Do not execute UDA on a large number of rows (106 for ex.) •  single fat partition•  multiple partitions•  full table scan

•  à Increase client-side timeout•  default Java driver timeout = 12 secs

@doanduyhai

Cassandra UDA or Apache Spark ?

Consistency Level

Single/Multiple Partition(s)

Recommended Approach

ONE Single partition UDA with token-aware driver because node local

ONE Multiple partitions Apache Spark because distributed reads

> ONE Single partition UDA because data-locality lost with Spark

> ONE Multiple partitions Apache Spark definitely

@doanduyhai

JSON Syntax•  Why ? •  Example

@doanduyhai

Why JSON ?

•  JSON is a very good exchange format

•  But a terrible schema …

•  How to have best of both worlds ?•  use Cassandra schema•  convert rows to JSON format

@doanduyhai

JSON syntax for INSERT/UPDATE/DELETE

CREATE TABLE users ( id text PRIMARY KEY,

age int, state text );

INSERT INTO users JSON '{"id": "user123", "age": 42, "state": "TX"}’;

INSERT INTO users(id, age, state) VALUES('me', fromJson('20'), 'CA');

UPDATE users SET age = fromJson('25’) WHERE id = fromJson('"me"');

DELETE FROM users WHERE id = fromJson('"me"');

@doanduyhai

JSON syntax for SELECT

> SELECT JSON * FROM users WHERE id = 'me';[json]

---------------------------------------- {"id": "me", "age": 25, "state": "CA”}

> SELECT JSON age,state FROM users WHERE id = 'me';[json]

---------------------------------------- {"age": 25, "state": "CA"}

> SELECT age, toJson(state) FROM users WHERE id = 'me'; age | system.tojson(state) -----+---------------------- 25 | "CA"

JSON Syntax Demo

@doanduyhai

SASI index, the search is over!•  Why ? •  How ? •  Who ? •  Demo ! •  When ?

@doanduyhai

Why SASI ?•  Searching (and full text search) was always a pain point for Cassandra

•  limited search predicates (=, <=, <, > and >= only)•  limited scope (only on primary key columns)

•  Existing secondary index performance is poor•  reversed-index•  use Cassandra itself as index storage …

•  limited predicate ( = ). Inequality predicate = full cluster scan 😱

@doanduyhai

How ?•  New index structure = suffix trees

•  Extended predicates (=, inequalities, LIKE %)

•  Full text search (tokenizers, stop-words, stemming …)

•  Query Planner to optimize AND predicates

•  NO, we don’t use Apache Lucene

@doanduyhai

Who ?•  Open source contribution by an engineers team from …

SASI Demo 41

@doanduyhai

When ?•  Cassandra 3.5

•  Later•  support for OR clause : ( aaa OR bbb) AND (ccc OR ddd)•  index on collections (Set, List, Map)

@doanduyhai

Comparison

SASI vs Solr/ElasticSearch ?•  Cassandra is not a search engine !!! (database = durability) •  always slower because 2 passes (SASI index read + original Cassandra data)•  no scoring •  no ordering (ORDER BY)•  no grouping (GROUP BY) à Apache Spark for analytics

Still, SASI covers 80% of search use-cases and people are happy !

@doanduyhai

duy_hai.doan@datastax.com

https://academy.datastax.com/

Thank You

cassandra 3 new features 2016

Technology

cassandra summit 2014: lesser known features of cassandra...

cassandra day nyc - cassandra anti patterns

cassandra introduction & features

myths of big partitions (robert stupp, datastax) | cassandra...

cassandra internals: the read path (tyler hobbs, datastax) |...

apache cassandra professional...

event sourcing with cassandra (from cassandra japan meetup...

spark cassandra 2016

cassandra at ebay - cassandra summit 2012

cassandra at instagram 2016 (dikang gu, facebook) |...

paris cassandra meetup - cassandra for developers

cassandra perry | portfolio 2016

cassandra day atlanta 2016 - monitoring cassandra

cassandra exports as a trivially parallelizable problem...

myheritage cassandra meetup 2016

apache cassandra on awsamazon web services – apache...

examiningcassandra constraints: pragmatic eyes...

cassandra day atlanta 2015: python & cassandra

cassandra introduction 2016

chicago cassandra - cassandra from python