cassandra 3 new features @ geecon krakow 2016

Cassandra 3.0 new features

DuyHai DOAN Apache Cassandra Evangelist

Speaker’s Name, 11-13 May 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Apache Cassandra Evangelist!•  talks, meetups, confs!•  open-source devs (Achilles, Apache Zeppelin)!•  OSS Cassandra point of contact!

☞ duy_hai.doan@datastax.com! ☞ @doanduyhai

Who Am I ?

Datastax •  Founded in April 2010!•  We contribute a lot to Apache Cassandra™!•  400+ customers (25 of the Fortune 100), 450+ employees!•  Headquarter in San Francisco Bay area!•  EU headquarter in London, offices in France and Germany!

•  Datastax Enterprise = OSS Cassandra + extra features!

Agenda •  Materialized Views (MV)!•  User Defined Functions (UDF) & User Defined Aggregates (UDA)!•  JSON syntax!•  New SASI full text search!

Materialized Views (MV)

DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Why Materialized Views ? •  Relieve the pain of manual denormalization!

CREATE TABLE user(id int PRIMARY KEY, country text, …); CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));

Materialized Views creation

CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id));

CREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastname FROM user WHERE country IS NOT NULL AND id IS NOT NULL PRIMARY KEY(country, id)

Materialized View Demo

Materialized Views Performance •  Write performance

•  slower than normal write!•  local lock + read-before-write cost (but paid only once for all views)!•  for each base table update, worst case: mv_count x 2 (DELETE +

INSERT) extra mutations for the views!

Materialized Views Performance •  Write performance vs manual denormalization

•  MV better because no client-server network traffic for read-before-write •  MV better because less network traffic for multiple views (client-side

BATCH)

•  Makes developer life easier à priceless

Materialized Views Performance •  Read performance vs secondary index

•  MV better because single node read (secondary index can hit many nodes)

•  MV better because single read path (secondary index = read index + read data)

Materialized Views Consistency •  Consistency level!

•  CL honoured for base table, ONE for MV + local batchlog!

•  Weaker consistency guarantees for MV than for base table !

User Defined Functions (UDF)

Rationale •  Push computation server-side!

•  save network bandwidth (1000 nodes!)!•  simplify client-side code!•  provide standard & useful function (sum, avg …)!•  accelerate analytics use-case (pre-aggregation for Spark)!

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

Param name to refer to in the code!Type = Cassandra type!

Always called. Null-check mandatory in code !

If any input is null, function execution is skipped and return null!

Cassandra types!•  primitives (boolean, int, …)!•  collections (list, set, map)!•  tuples!•  UDT!

JVM supported languages!•  Java, Scala!•  Javascript (slow)!•  Groovy, Jython, JRuby!•  Clojure ( JSR 223 impl issue)!

UDF Demo

User Define Aggregate (UDA) •  Real use-case for UDF!

•  Aggregation server-side à huge network bandwidth saving !

•  Provide similar behavior for Group By, Sum, Avg etc …!

How to create an UDA ?

CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;

Only type, no param name!

State type!Initial state type!

Accumulator function signature:!accumulatorFunction(stateType, type1, type2, …)!RETURNS stateType!!Accumulator function ≈ foldLeft function !

Optional final function signature: finalFunction(stateType)

UDA Demo

Gotchas •  UDA in Cassandra is not distributed !!

•  Do not execute UDA on a large number of rows (106 for ex.)!•  single fat partition!•  multiple partitions!•  full table scan!!

•  à Increase client-side timeout!•  default Java driver timeout = 12 secs!

Cassandra UDA or Apache Spark ?

Consistency Level

Single/MultiplePartition(s)

RecommendedApproach

ONE Single partition! UDA with token-aware driver because node local!

ONE Multiple partitions! Apache Spark because distributed reads!

> ONE Single partition! UDA because data-locality lost with Spark!

> ONE Multiple partitions! Apache Spark definitely!

JSON Syntax

Why JSON ? •  JSON is a very good exchange format

•  But a terrible schema …!!

•  How to have best of both worlds ?!•  use Cassandra schema!•  convert rows to JSON format!

JSON Syntax Demo

SASI full text search index

Why SASI ? •  Searching (and full text search) was always a pain point for

Cassandra!•  limited search predicates (=, <=, <, > and >= only)!•  limited scope (only on primary key columns)!

•  Existing secondary index performance is poor!•  reversed-index!•  use Cassandra itself as index storage …!•  limited predicate ( = ). Inequality predicate = full cluster scan😱!

How is it implemented ? •  New index structure = suffix trees

•  Extended predicates (=, inequalities, LIKE %)!

•  Full text search (tokenizers, stop-words, stemming …)!

•  Query Planner to optimize AND predicates!

•  NO, we don’t use Apache Lucene

Who made it ? •  Open source contribution by an engineers team from …!!

Full Text Search Demo

When is it available ? •  Right now with Cassandra ≥ 3.5!

•  available in Cassandra 3.4 but critical bugs!

•  Later improvement!•  index on collections (List, Set & Map) !!•  OR clause (WHERE (xxx OR yyy) AND zzz)!•  != operator!

SASI vs Search Engine SASI vs Solr/ElasticSearch/Datastax Enterprise Search ?!

•  Cassandra is not a search engine !!! (database = durability)!•  always slower because 2 passes (SASI index read + original Cassandra

data)!•  no scoring•  no ordering (ORDER BY)!•  no grouping (GROUP BY) à Apache Spark for analytics!

Thank You @doanduyhai

duy_hai.doan@datastax.com

https://academy.datastax.com/

cassandra 3 new features @ geecon krakow 2016

Technology

comenius krakow

geecon 2012 hurdle run through ejb testing

2014 geecon custom assertions

geecon krakow 2015 - grails and the real-time world

reactive streams / akka streams - geecon prague 2014

krakow may2007

reactive java (geecon 2014)

20161020 geecon continuous delivery

geecon 2012 bad tests, good tests

geecon 2013 - ejb application guided by tests

advance a/b testing - geecon krakow 2015

geecon - improve your android-fu with kotlin

pragmatic architecture for agile teams - geecon 2014

java bytecode for discriminating developers - geecon 2011

ratpack 101 - geecon 2015

distributed algorithms for big data @ geecon

jcp & adopt-a-jsr @ geecon

junit boot camp (geecon 2016)

geecon 2016: scaling microservices at gilt

geecon - cargo culting and memes in java