cassandra 3 new features 2016
Post on 08-Jan-2017
808 Views
Preview:
TRANSCRIPT
@doanduyhai
New Cassandra 3 FeaturesDuyHai DOANApache Cassandra Evangelist
@doanduyhai
Who Am I ?Duy Hai DOAN Apache Cassandra Evangelist• talks, meetups, confs …
• open-source projects (Achilles, Apache Zeppelin ...)
• OSS Cassandra point of contact•
☞ duy_hai.doan@datastax.com ☞ @doanduyhai
2
@doanduyhai
Datastax
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 400+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
3
@doanduyhai
Agenda
4
• Materialized Views
• User Defined Functions (UDF) and Aggregates (UDA)
• JSON Syntax
• New SASI full text search index
@doanduyhai
Materialized Views (MV)• Why ? • Gotchas
@doanduyhai
Why Materialized Views ?• Relieve the pain of manual denormalization
CREATE TABLE user(id int PRIMARY KEY, country text, …);
CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));
6
@doanduyhai
CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id));
Materialzed View In ActionCREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastnameFROM user WHERE country IS NOT NULL AND id IS NOT NULLPRIMARY KEY(country, id)
7
Materialized Views Demo 8
@doanduyhai
Materialized View Performance• Write performance
• slower than normal write• local lock + read-before-write cost (but paid only once for all views)• for each base table update, worst case: mv_count x 2 (DELETE + INSERT) extra
mutations for the views
9
@doanduyhai
Materialized View Performance• Write performance vs manual denormalization
• MV better because no client-server network traffic for read-before-write • MV better because less network traffic for multiple views (client-side BATCH)
• Makes developer life easier à priceless
10
@doanduyhai
Materialized View Performance• Read performance vs secondary index
• MV better because single node read (secondary index can hit many nodes)• MV better because single read path (secondary index = read index + read data)
11
@doanduyhai
Materialized Views Consistency• Consistency level
• CL honoured for base table, ONE for MV + local batchlog
• Weaker consistency guarantees for MV than for base table.
12
Q & A
! "
13
@doanduyhai
User Define Functions (UDF)• Why ? • UDAs • Gotchas
@doanduyhai
Rationale• Push computation server-side
• save network bandwidth (1000 nodes!)• simplify client-side code• provide standard & useful function (sum, avg …)• accelerate analytics use-case (pre-aggregation for Spark)
15
@doanduyhai
How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE language AS $$ // source code here$$;
16
@doanduyhai
How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE languageAS $$ // source code here$$;
Param name to refer to in the code Type = Cassandra type
17
@doanduyhai
How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE language // jAS $$ // source code here$$;
Always called Null-check mandatory in code
18
@doanduyhai
How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnTypeLANGUAGE language // javAS $$ // source code here$$;
If any input is null, function execution is skipped and return null
19
@doanduyhai
How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnType LANGUAGE languageAS $$ // source code here$$;
Cassandra types • primitives (boolean, int, …) • collections (list, set, map) • tuples • UDT
20
@doanduyhai
How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE languageAS $$ // source code here$$;
JVM supported languages • Java, Scala • Javascript (slow) • Groovy, Jython, JRuby • Clojure ( JSR 223 impl issue)
21
UDF Demo
22
@doanduyhai
User Defined Aggregates (UDA)• Real use-case for UDF
• Aggregation server-side à huge network bandwidth saving
• Provide similar behavior for Group By, Sum, Avg etc …
23
@doanduyhai
How to create an UDA ?CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;
Only type, no param name
State type
Initial state type
24
@doanduyhai
How to create an UDA ?CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;
Accumulator function. Signature: accumulatorFunction(stateType, type1, type2, …)
RETURNS stateType
25
@doanduyhai
How to create an UDA ?CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;
Optional final function. Signature: finalFunction(stateType)
26
UDA Demo
27
@doanduyhai
Gotchas
28
• UDA in Cassandra is not distributed !
• Do not execute UDA on a large number of rows (106 for ex.) • single fat partition• multiple partitions• full table scan
• à Increase client-side timeout• default Java driver timeout = 12 secs
@doanduyhai
Cassandra UDA or Apache Spark ?
29
Consistency Level
Single/Multiple Partition(s)
Recommended Approach
ONE Single partition UDA with token-aware driver because node local
ONE Multiple partitions Apache Spark because distributed reads
> ONE Single partition UDA because data-locality lost with Spark
> ONE Multiple partitions Apache Spark definitely
Q & A
! "
30
@doanduyhai
JSON Syntax• Why ? • Example
@doanduyhai
Why JSON ?
32
• JSON is a very good exchange format
• But a terrible schema …
• How to have best of both worlds ?• use Cassandra schema• convert rows to JSON format
@doanduyhai
JSON syntax for INSERT/UPDATE/DELETE
33
CREATE TABLE users ( id text PRIMARY KEY,
age int, state text );
INSERT INTO users JSON '{"id": "user123", "age": 42, "state": "TX"}’;
INSERT INTO users(id, age, state) VALUES('me', fromJson('20'), 'CA');
UPDATE users SET age = fromJson('25’) WHERE id = fromJson('"me"');
DELETE FROM users WHERE id = fromJson('"me"');
@doanduyhai
JSON syntax for SELECT
34
> SELECT JSON * FROM users WHERE id = 'me';[json]
---------------------------------------- {"id": "me", "age": 25, "state": "CA”}
> SELECT JSON age,state FROM users WHERE id = 'me';[json]
---------------------------------------- {"age": 25, "state": "CA"}
> SELECT age, toJson(state) FROM users WHERE id = 'me'; age | system.tojson(state) -----+---------------------- 25 | "CA"
JSON Syntax Demo
35
Q & A
! "
36
@doanduyhai
SASI index, the search is over!• Why ? • How ? • Who ? • Demo ! • When ?
@doanduyhai
Why SASI ?• Searching (and full text search) was always a pain point for Cassandra
• limited search predicates (=, <=, <, > and >= only)• limited scope (only on primary key columns)
• Existing secondary index performance is poor• reversed-index• use Cassandra itself as index storage …
• limited predicate ( = ). Inequality predicate = full cluster scan 😱
38
@doanduyhai
How ?• New index structure = suffix trees
• Extended predicates (=, inequalities, LIKE %)
• Full text search (tokenizers, stop-words, stemming …)
• Query Planner to optimize AND predicates
• NO, we don’t use Apache Lucene
39
@doanduyhai
Who ?• Open source contribution by an engineers team from …
40
SASI Demo 41
@doanduyhai
When ?• Cassandra 3.5
• Later• support for OR clause : ( aaa OR bbb) AND (ccc OR ddd)• index on collections (Set, List, Map)
42
@doanduyhai
Comparison
43
SASI vs Solr/ElasticSearch ?• Cassandra is not a search engine !!! (database = durability) • always slower because 2 passes (SASI index read + original Cassandra data)• no scoring • no ordering (ORDER BY)• no grouping (GROUP BY) à Apache Spark for analytics
Still, SASI covers 80% of search use-cases and people are happy !
Q & A
! "
44
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You
45
top related