scalable full-text search with datastax enterprise

39
©2013 DataStax 1 Scalable Full-Text Search with DataStax Enterprise Piotr Kołaczkowski DataStax [email protected] @pkolaczk

Upload: lamhanh

Post on 14-Feb-2017

238 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 1

Scalable Full-Text Search with DataStax Enterprise

Piotr KołaczkowskiDataStax

[email protected]@pkolaczk

Page 2: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 2

DataStax Enterprise

Apache Cassandra

+

Apache Solr

+

Apache Hadoop

+

...

Page 3: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 3

Apache Cassandra

A database system:

• distributed

• replicated & durable

• scalable

• fault-tolerant (no SPOF)

• highly available

Page 4: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 4

Apache Cassandra

A database:

• distributed

• replicated & durable

• scalable

• fault-tolerant (no SPOF)

• highly available

• data-center aware

US Europe

Page 5: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 5

Apache Cassandra Storage Engine

• Wide tables, up to 2GB per row

• Fast writes

• Fast primary-key searches

• Durability (commit log)

• Secondary indexes

• No full text search

Page 6: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 6

Apache Solr

Database system specialized at searching text:

• Language-aware

• e.g. lowercasing ὈΔΥΣΕΎΣ produces ὀδυσεύς

• can do stemming / stop-world elimination, etc.

• Supports relevance scoring

• Supports complex range queries

• Centralized

Page 7: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 7

Solr

Apache Solr

Update Handler

Lucene

Request Handler

Lucene Index

Response Writer

search results

documents to index

SearchSearch

Page 8: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 8

Classic Partitioning with SPOF

mastermaster

slaveslave

slaveslave

partition 1 partition 2 partition 3 partition 4

routerrouter

clientclient

Page 9: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 9

Availability

“High availability implies that a single fault will not bring down your system. Not ‘we’ll recover quickly.’”

-- Ben Coverston: DataStax

“The biggest problem with failover is that you're almost never using it until it really hurts. It's like backups that you never test.”

-- Rick Branson: Instagram

Page 10: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 10

Fully Distributed, no SPOF

clientclient

p3p1

p6

p1p3

p1

Page 11: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 11

Partitioning

Primary key determines placement*

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru

johnny age: 12 gender: M

suzy age: 10 gender: F

Page 12: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 12

jim

carol

johnny

suzy

5e02739678...

a9a0198010...

f4eb27cea7...

78b421309e...

PK MD5 Hash

MD5* hash operation yields a 128-bit number

for keysof any size.

Partitioning

Page 13: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 13

The “Token Ring”

Page 14: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 14

StartStart EndEndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Page 15: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 15

StartStart EndEndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Page 16: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 16

StartStart EndEndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Page 17: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 17

StartStart EndEndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Page 18: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 18

StartStart EndEndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Page 19: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 19

Replication

carol a9a0198010...

Page 20: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 20

Replication

carol a9a0198010...

Page 21: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 21

Replication

carol a9a0198010...

Page 22: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 22

Bringing Cassandra and Solr Together

Cassandra Solr

KeyspaceCore

Table (Column Family)

Row Document

Column Field

Page 23: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 23

Schema

<schema name="wikipedia" version="1.1">

<types> <fieldType name="string" class="solr.StrField"/> <fieldType name="text" class="solr.TextField"/> </types>

<fields> <field name="part_id" type="string" indexed="true" stored="true"/> <field name="description" type="text" indexed="true" stored="true"/> </fields>

<defaultSearchField>description</defaultSearchField> <uniqueKey>part_id</uniqueKey>

</schema>

schema.xml and solrconfig.xml stored and distributed by Cassandra

Page 24: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 24

Data Mapping

part_id description

2N2222 Low power bipolar NPN transistor

TL074 Low noise JFET-input operational amplifier

LM3886 High performance integrated audio power amplifier

Cassandra table

field value

part_id 2N2222

description Low power bipolar NPN transistor

field value

part_id TL074

description Low noise JFET-input operational amplifier

field value

part_id LM3886

description High performance integrated audio power amplifier

Page 25: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 25

DSE Search Architecture

DSE daemon

Embedded Tomcat

Solr

Cassandra

Secondary Index API

SolrIndexes

C* Data

C* Commit Log

Update Handler

SolrREST

CQL

Thrift

Page 26: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 26

Inserting

clientclient

p3p1

p6

p1p3

p1

Coordinator

Replica

Replica

Replica

Page 27: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 27

Inserting through Cassandra API

Coordinator Node

Solr

Cassandra

SI API

Update Handler

1

CQL INSERT

2

Replica Node

Solr

Cassandra

SI API

Update Handler

4

3

Page 28: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 28

Inserting through Solr API

Coordinator Node

Solr

Cassandra

SI API

Update Handler

1

HTTP POST

23

Replica Node

Solr

Cassandra

SI API

Update Handler

5

4

Page 29: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 29

Querying

clientclient

p3p1

p6

p1p3

p1

Coordinator

Replica

Replica

Replica

p3

p3

p3

p6

p6

Page 30: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 30

How many nodes to contact?

• We don't know the primary key

• Theory: contact at least one replica for every token range

• Cassandra contacts all nodes

• Our custom Solr SearchComponent does intelligent shard selection

Page 31: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 31

Querying through CQL

SELECT title FROM solr WHERE solr_query='title:natio*';

title-------------------------------------------------------------------------- Bolivia national football team 2002 List of French born footballers who have played for other national teams Lithuania national basketball team at Eurobasket 2009 Bolivia national football team 2000 Kenya national under-20 football team Bolivia national football team 1999 Israel men's national inline hockey team Bolivia national football team 2001

Page 32: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 32

Querying through CQL

Coordinator Node

Solr

Cassandra

SI API

2

Replica Node

Solr

Cassandra

SI API

4

31

CQL SELECT

5

contacts all nodes

6

Page 33: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 33

Querying through Solr API

Coordinator Node

Solr

Cassandra

SI API

CassandraSearchComponent

Replica Node

Solr

Cassandra

SI API

21

HTTP GET

3

contacts selectednodes

4

Page 34: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 34

Shard Selection Algorithm

• Tries to minimize the number of selected shards

optimum number of shards = ⎡N / RF⎤

• Tries to fetch data from the closest nodes

• local node

• nodes on the same rack

• nodes in the same DC

• Balances the load

Page 35: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 35

Shard Selection Algorithm

1.Always select the local node first

2.Select the closest node that is covering the highest number of token ranges not yet covered.

3.Repeat the previous step until all ranges are covered.

Page 36: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 36

Querying Shards

Solr does not support indexing 128-bit numbers

Cassandra 128-bit token

127 0

63 0 63 0

_token_lhs _token_rhs

Page 37: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 37

Querying Shards

_token_rhs

_token_lhs

max

max

start

end

((+_token_lhs:3074457345618258602 +_token_rhs:[3074457345618258604 TO *]) OR (+_token_lhs:[3074457345618258603 TO 6148914691236517204]) OR (+_token_lhs:6148914691236517205 +_token_rhs:[* TO -3074457345618258602]))

Page 38: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 38

Workload Separation

C*

C*

C*

C*

clientclient Replication FactorSolr: 2Cassandra: 2

Page 39: Scalable Full-Text Search with DataStax Enterprise

©2013 DataStax 39

Questions?

• http://www.datastax.com/docs

• http://www.datastax.com/products/enterprise

Cassandra Summit 2013, June 11-12, San Francisco, CAhttp://www.datastax.com/company/news-and-events/events/cassandrasummit2013