cassandra 3 new features @ geecon krakow 2016

Post on 16-Apr-2017

375 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Cassandra 3.0 new features

DuyHai DOAN Apache Cassandra Evangelist

Speaker’s Name, 11-13 May 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Apache Cassandra Evangelist!•  talks, meetups, confs!•  open-source devs (Achilles, Apache Zeppelin)!•  OSS Cassandra point of contact!

☞ duy_hai.doan@datastax.com! ☞ @doanduyhai

Who Am I ?

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Datastax •  Founded in April 2010!•  We contribute a lot to Apache Cassandra™!•  400+ customers (25 of the Fortune 100), 450+ employees!•  Headquarter in San Francisco Bay area!•  EU headquarter in London, offices in France and Germany!

•  Datastax Enterprise = OSS Cassandra + extra features!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Agenda •  Materialized Views (MV)!•  User Defined Functions (UDF) & User Defined Aggregates (UDA)!•  JSON syntax!•  New SASI full text search!

Materialized Views (MV)

DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Why Materialized Views ? •  Relieve the pain of manual denormalization!

CREATE TABLE user(id int PRIMARY KEY, country text, …); CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized Views creation

CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id));

CREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastname FROM user WHERE country IS NOT NULL AND id IS NOT NULL PRIMARY KEY(country, id)

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized View Demo

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized Views Performance •  Write performance

•  slower than normal write!•  local lock + read-before-write cost (but paid only once for all views)!•  for each base table update, worst case: mv_count x 2 (DELETE +

INSERT) extra mutations for the views!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized Views Performance •  Write performance vs manual denormalization

•  MV better because no client-server network traffic for read-before-write •  MV better because less network traffic for multiple views (client-side

BATCH)

•  Makes developer life easier à priceless

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized Views Performance •  Read performance vs secondary index

•  MV better because single node read (secondary index can hit many nodes)

•  MV better because single read path (secondary index = read index + read data)

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized Views Consistency •  Consistency level!

•  CL honoured for base table, ONE for MV + local batchlog!

•  Weaker consistency guarantees for MV than for base table !

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Q & A

! "

User Defined Functions (UDF)

DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Rationale •  Push computation server-side!

•  save network bandwidth (1000 nodes!)!•  simplify client-side code!•  provide standard & useful function (sum, avg …)!•  accelerate analytics use-case (pre-aggregation for Spark)!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

Param name to refer to in the code!Type = Cassandra type!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

Always called. Null-check mandatory in code !

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

If any input is null, function execution is skipped and return null!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

Cassandra types!•  primitives (boolean, int, …)!•  collections (list, set, map)!•  tuples!•  UDT!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

JVM supported languages!•  Java, Scala!•  Javascript (slow)!•  Groovy, Jython, JRuby!•  Clojure ( JSR 223 impl issue)!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

UDF Demo

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

User Define Aggregate (UDA) •  Real use-case for UDF!

•  Aggregation server-side à huge network bandwidth saving !

•  Provide similar behavior for Group By, Sum, Avg etc …!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDA ?

CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;

Only type, no param name!

State type!Initial state type!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDA ?

CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;

Accumulator function signature:!accumulatorFunction(stateType, type1, type2, …)!RETURNS stateType!!Accumulator function ≈ foldLeft function !

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDA ?

CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;

Optional final function signature: finalFunction(stateType)

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDA ?

CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;

Optional final function signature: finalFunction(stateType)

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

UDA Demo

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Gotchas •  UDA in Cassandra is not distributed !!

•  Do not execute UDA on a large number of rows (106 for ex.)!•  single fat partition!•  multiple partitions!•  full table scan!!

•  à Increase client-side timeout!•  default Java driver timeout = 12 secs!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Cassandra UDA or Apache Spark ?

Consistency Level

Single/MultiplePartition(s)

RecommendedApproach

ONE Single partition! UDA with token-aware driver because node local!

ONE Multiple partitions! Apache Spark because distributed reads!

> ONE Single partition! UDA because data-locality lost with Spark!

> ONE Multiple partitions! Apache Spark definitely!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Q & A

! "

JSON Syntax

DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Why JSON ? •  JSON is a very good exchange format

•  But a terrible schema …!!

•  How to have best of both worlds ?!•  use Cassandra schema!•  convert rows to JSON format!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

JSON Syntax Demo

SASI full text search index

DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Why SASI ? •  Searching (and full text search) was always a pain point for

Cassandra!•  limited search predicates (=, <=, <, > and >= only)!•  limited scope (only on primary key columns)!

•  Existing secondary index performance is poor!•  reversed-index!•  use Cassandra itself as index storage …!•  limited predicate ( = ). Inequality predicate = full cluster scan😱!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How is it implemented ? •  New index structure = suffix trees

•  Extended predicates (=, inequalities, LIKE %)!

•  Full text search (tokenizers, stop-words, stemming …)!

•  Query Planner to optimize AND predicates!

•  NO, we don’t use Apache Lucene

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Who made it ? •  Open source contribution by an engineers team from …!!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Full Text Search Demo

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

When is it available ? •  Right now with Cassandra ≥ 3.5!

•  available in Cassandra 3.4 but critical bugs!

•  Later improvement!•  index on collections (List, Set & Map) !!•  OR clause (WHERE (xxx OR yyy) AND zzz)!•  != operator!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

SASI vs Search Engine SASI vs Solr/ElasticSearch/Datastax Enterprise Search ?!

•  Cassandra is not a search engine !!! (database = durability)!•  always slower because 2 passes (SASI index read + original Cassandra

data)!•  no scoring•  no ordering (ORDER BY)!•  no grouping (GROUP BY) à Apache Spark for analytics!

!

!

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Q & A

! "

Thank You @doanduyhai

duy_hai.doan@datastax.com

https://academy.datastax.com/

top related