sasi, cassandra on full text search ride

51
SASI, Cassandra on full text search ride DuyHai DOAN Apache Cassandra Evangelist

Upload: duyhai-doan

Post on 16-Apr-2017

2.197 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Sasi, cassandra on full text search ride

SASI, Cassandra on full text search rideDuyHai DOANApache Cassandra Evangelist

Page 2: Sasi, cassandra on full text search ride

@doanduyhai

Who Am I ?Duy Hai DOAN Apache Cassandra Evangelist

•  talks, meetups, confs

•  open-source devs (Achilles, Apache Zeppelin…)

•  OSS Cassandra point of contact ☞ [email protected] ☞ @doanduyhai

2

Page 3: Sasi, cassandra on full text search ride

@doanduyhai

Datastax•  Founded in April 2010

•  We contribute a lot to Apache Cassandra™

•  400+ customers (25 of the Fortune 100), 450+ employees

•  Headquarter in San Francisco Bay area

•  EU headquarter in London, offices in France and Germany

•  Datastax Enterprise = OSS Cassandra + extra features

3

Page 4: Sasi, cassandra on full text search ride

SASI Index•  What is SASI ? •  Distributed Index •  Life-cycle •  Query Planner

Page 5: Sasi, cassandra on full text search ride

What is SASI ?

Page 6: Sasi, cassandra on full text search ride

@doanduyhai

Who ?•  Open source contribution by an engineers team

6

Page 7: Sasi, cassandra on full text search ride

@doanduyhai

How ?

7

New secondary index re-designed from scratch•  follow SSTable life-cycle (flush, compaction)•  new data-strutures •  full text search options•  no dependency on Apache Lucene

SASI = SSTable-Attached Secondary Index

Page 8: Sasi, cassandra on full text search ride

SASI Demo

Page 9: Sasi, cassandra on full text search ride

SASI Demo 9

Page 10: Sasi, cassandra on full text search ride

Distributed Index

Page 11: Sasi, cassandra on full text search ride

@doanduyhai

Index on user country

11

H

A

E

D

B C

G F

FR user1 user102 … user493

US user54 user483 … user938

FR user87 user176 … user987

FR user17 user409 … user787

Page 12: Sasi, cassandra on full text search ride

@doanduyhai

Distributed search query handling

12

H

A

E

D

B C

G F

coordinator

1st roundConcurrency factor = 1

Page 13: Sasi, cassandra on full text search ride

@doanduyhai

Distributed search query handling

13

H

A

E

D

B C

G F

coordinator

Not enough results ?

Page 14: Sasi, cassandra on full text search ride

@doanduyhai

Distributed search query handling

14

H

A

E

D

B C

G F

coordinator

2nd roundConcurrency factor = 2

Page 15: Sasi, cassandra on full text search ride

@doanduyhai

Distributed search query handling

15

H

A

E

D

B C

G F

coordinator

Still not enough results ?

Page 16: Sasi, cassandra on full text search ride

@doanduyhai

Distributed search query handling

16

H

A

E

D

B C

G F

coordinator

3rd roundConcurrency factor = 4

Page 17: Sasi, cassandra on full text search ride

@doanduyhai

Caveat 1: query with non-restrictive filters

17

H

A

E

D

B C

G F

coordinator

Hit all nodes L

Page 18: Sasi, cassandra on full text search ride

@doanduyhai

Caveat 1 solution: always use LIMIT

18

H

A

E

D

B C

G F

coordinator

SELECT * FROM …

WHERE ... LIMIT 1000

Page 19: Sasi, cassandra on full text search ride

@doanduyhai

Caveat 2: 1-to-1 index (user_email)

19

H

A

E

D

B C

G F

coordinator

Not found WHERE user_email LIKE '%xxx%'

Page 20: Sasi, cassandra on full text search ride

@doanduyhai

Caveat 2: 1-to-1 index (user_email)

20

H

A

E

D

B C

G F

coordinator

Still no result

WHERE user_email LIKE '%xxx%'

Page 21: Sasi, cassandra on full text search ride

@doanduyhai

Caveat 2: 1-to-1 index (user_email)

21

H

A

E

D

B C

G F

coordinator

At best 1 user foundAt worst 0 user found

WHERE user_email LIKE '%xxx%'

Page 22: Sasi, cassandra on full text search ride

@doanduyhai

Caveat 2 solution: use materalized views

22

For 1-to-1 index/relationship, use materialized views instead

CREATE MATERIALIZED VIEW user_by_email ASSELECT * FROM usersWHERE user_id IS NOT NULL and user_email IS NOT NULLPRIMARY KEY (user_email, user_id)

Page 23: Sasi, cassandra on full text search ride

@doanduyhai

Caveat 3: fetch all rows for analytics use-case

23

H

A

E

D

B C

G F

coordinator

Hit all nodes L

Page 24: Sasi, cassandra on full text search ride

@doanduyhai

Caveat 3 solution: use co-located Apache Spark

24

H

A

E

D

B C

G F

Local index filtering in Cassandra Aggregation in Spark

Local index query

Page 25: Sasi, cassandra on full text search ride

25

Q & A

! "

Page 26: Sasi, cassandra on full text search ride

SASI Life-cycle

Page 27: Sasi, cassandra on full text search ride

@doanduyhai

SASI Life-cycle: in-memory

27

Commit log1

. . .

1

Commit log2

Commit logn

Memory

. . . MemTable Table1

MemTable Table2

MemTable TableN

2

Index MemTable1

Index MemTable2

. . . Index

MemTableN 3

ACK the client

Page 28: Sasi, cassandra on full text search ride

@doanduyhai

IndexMemtable

28

Index mode, data type Data structure Usage PREFIX, text Guava ConcurrentRadixTree name LIKE 'John%'

CONTAINS, text Guava ConcurrentSuffixTree name LIKE ’%John%'name LIKE ’%ny’

PREFIX, other JDK ConcurrentSkipListSet age = 20age >= 20 AND age <= 30

SPARSE, other JDK ConcurrentSkipListSet age = 20age >= 20 AND age <= 30

Page 29: Sasi, cassandra on full text search ride

@doanduyhai

SASI Life-cycle: flush to SSTable

29

Commit log1

. . .

1

Commit log2

Commit logn

Memory

Table1

SStable1

Table2 Table3

SStable2 SStable3 4

OnDiskIndex1

OnDiskIndex2 OnDiskIndex3

Page 30: Sasi, cassandra on full text search ride

@doanduyhai

SASI Life-cycle: compaction

30

SSTable1 SSTable2 SSTable3

New SSTable

OnDiskIndex1 OnDiskIndex2 OnDiskIndex3

New OnDiskIndex

Page 31: Sasi, cassandra on full text search ride

@doanduyhai

OnDiskIndex Files

31

SStable1

SStable2

user_id4 FR user_id1 US user_id5 FR

user_id3 UK user_id2 DE

OnDiskIndex1

FR US

OnDiskIndex2

UK DE

Page 32: Sasi, cassandra on full text search ride

@doanduyhai

OnDiskIndex Files

32

SStable1

SStable2

user_id4 FR user_id1 US user_id5 FR

user_id3 UK user_id2 DE

OnDiskIndex1

FR US

OnDiskIndex2

UK DE

Suffix Tree Data structures

Page 33: Sasi, cassandra on full text search ride

33

Q & A

! "

Page 34: Sasi, cassandra on full text search ride

Query Planner

Page 35: Sasi, cassandra on full text search ride

@doanduyhai

Integrated query planner

35

Perform optimizations on predicates1.  build predicates tree 2.  predicates push-down & re-ordering3.  predicate fusions for != operator

Page 36: Sasi, cassandra on full text search ride

@doanduyhai

Query optimization example

36

WHERE age < 100 AND fname = 'p*' AND first_name != 'pa*' AND age > 21

Page 37: Sasi, cassandra on full text search ride

@doanduyhai

Query optimization example

37

AND is associative and commutative

Page 38: Sasi, cassandra on full text search ride

@doanduyhai

Query optimization example

38

!= transformed to exclusion on range scan

Page 39: Sasi, cassandra on full text search ride

@doanduyhai

Query optimization example

39

AND is associative and commutative

Page 40: Sasi, cassandra on full text search ride

40

Q & A

! "

Page 41: Sasi, cassandra on full text search ride

Some Benchmarks

Page 42: Sasi, cassandra on full text search ride

@doanduyhai

Hardware specs

42

13 bare-metal machines •  6 CPU HT (12 vcores)•  64Gb RAM•  4 SSDs in RAID0 for a total of 1.5Tb

Data set•  13 billions of rows•  1 numerical index with 36 distinct values •  2 text index with 7 distinct values •  1 text index with 3 distinct values

Page 43: Sasi, cassandra on full text search ride

@doanduyhai

Benchmark results

43

Page 44: Sasi, cassandra on full text search ride

@doanduyhai

Benchmark results

44

Page 45: Sasi, cassandra on full text search ride

@doanduyhai

Benchmark results

45

Page 46: Sasi, cassandra on full text search ride

@doanduyhai

Benchmark results

46

Page 47: Sasi, cassandra on full text search ride

@doanduyhai

Benchmark results

47

Full scan using server-side paging

Predicate count Fetched rows Query time in sec 1 36 109 986 6092 2 781 492 3303 1 044 547 3724 360 334 116

Page 48: Sasi, cassandra on full text search ride

Take Away

Page 49: Sasi, cassandra on full text search ride

@doanduyhai

Conclusion

49

Is it available ?•  yes in Cassandra 3.5

Future enhancement ?•  index on collections (List, Set & Map) !•  OR clause (WHERE (xxx OR yyy) AND zzz )•  != operator

Page 50: Sasi, cassandra on full text search ride

@doanduyhai

Conclusion

50

SASI vs Solr/ElasticSearch ?•  Cassandra is not a search engine !!! (database = durability) •  always slower because 2 passes (SASI index read + original Cassandra data)•  no scoring •  no ordering (ORDER BY)•  no grouping (GROUP BY) à Apache Spark for analytics

Still, SASI covers 80% of search use-cases and people are happy !

Page 51: Sasi, cassandra on full text search ride

51

@doanduyhai

[email protected]

https://academy.datastax.com/

Thank You