solrcloud and nosql at the fifth elephant 2013, bangalore

27

Click here to load reader

Upload: anshum-gupta

Post on 16-Apr-2017

1.315 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

SolrCloud and NoSQL

Anshum Gupta

Page 2: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 20132

Who am I?

• Anshum Gupta• Search and related stuff for around 8 years now• Apache Lucene since 2006, Solr since 2010• Currently:

• Helped launch the first AWS search service, CloudSearch.• Places I’ve worked at:

Page 3: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Big Data

• Real Value = Process + Store + Search

• Search- No longer expensive- Affordable- Necessity- Can get as complicated as

you’d want it to get.

3

Loads of Data

Loads of Data

Loads of Data

Loads of Data

Loads of Data

Loads of Data

Loads of Data

Loads of Data

Loads of Data

Loads of Data

Loads of Data

Loads of Data

Loads of DataData

Search

Page 4: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

NoSQL Databases

• Wikipedia says:A NoSQL database provides a mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases in order to achieve horizontal scaling and higher availability. Some authors refer to them as "Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query language to be used.

• Non-traditional data stores• Doesn’t use / isn’t designed around SQL• May not give full ACID guarantees

- Offers other advantages such as greater scalability as a tradeoff

• Distributed, fault-tolerant architecture

Page 5: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

DB Rankings: Overall

Source: http://db-engines.com/en/ranking

Page 7: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

MongoDB

• Data Model: BSON• Distributed Model: Sharded master-slave async

replication.• Consistency: Per table write lock.

• Search:- Built in full text search, large gaps with ‘search’ players.- Alternate and popular solution: Use another search solution

along with MongoDB, Solr?. Consistency issues and more.

Page 8: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Cassandra

• Data Model: Column based data store.• Distributed Model: Uses consistent hashing for

distributed updates.• Consistency: Timestamps for consistency.

• Search- Lucandra : Lucene based search.- Solandra : Solr based search.

Page 9: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 20139

• Implements principles from the Amazon Dynamo paper.

• Riak Search - Distributed index and full-text search engine.- Merge Index – Storage backed used by Riak Search. It’s a pure

Erlang storage format and among other things uses the Apache Lucene file format.

- Riak Solr – Adds a subset of Apache Solr HTTP capabilities to Riak Search.

• Yokozuna- “next generation of Riak Search that marries Riak with Apache Solr”.- Sits alongside of Riak.

Page 10: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 201310

The story so far…

• Different approaches for:- Data Model- Distributed Update handling- Consistency management

• Work reasonably well on different fronts as far as storage is concerned.

• Search:- There’s barely anything native and in the core.- (Almost) Everyone is trying to fuse together with Lucene/Solr.

Page 11: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 201311

Adding Search to NoSQL

• To begin with, wasn’t built for that• Compromises• Integration is the buzzword.• Lucandra, Solandra…No strong contender yet.

Page 12: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 201312

Adding NoSQL to Search

• Already store documents• With growing data, more intuitive for this to happen• More intuitive = makes more sense = easier (perhaps)• No key player as yet.

Page 13: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Page 14: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Apache Solr 4 at a glance• Document Oriented NoSQL Search Server

- Data-format agnostic (JSON, XML, CSV, binary)- Schema-less options (more coming soon)

• Distributed- Multi-tenanted

• Fault Tolerant- HA + No single points of failure

• Atomic Updates• Optimistic Concurrency• Near Real-time Search• Full-Text search + Hit Highlighting• Tons of specialized queries: Faceted search, grouping, pseudo-join, spatial search, functions

The desire for these features drove some of the “SolrCloud” architecture

Page 15: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

SolrCloud Design Goals

• Automatic Distributed Indexing• HA for Writes• Durable Writes• Near Real-time Search• Real-time get• Optimistic Concurrency

Page 16: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

SolrCloud

• Distributed Indexing designed from the ground up to accommodate desired features

• CAP Theorem- Consistency, Availability, Partition Tolerance (saying goes “choose 2”)- Reality: Must handle P – the real choice is tradeoffs between C and A

• Ended up with a CP system (roughly)- Value Consistency over Availability- Eventual consistency is incompatible with optimistic concurrency- Closest to MongoDB in architecture

• We still do well with Availability- All N replicas of a shard must go down before we lose writability for that

shard- For a network partition, the “big” partition remains active (i.e. Availability

isn’t “on” or “off”)

Page 17: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

SolrCloud

shard1

replica2

replica3

replica2

replica3ZooKeeper quorum

ZK nod

e

ZK node

ZK nod

e

ZK node

ZK node

/configs /myconf solrconfig.xml schema.xml

/clusterstate.json/aliases.json

/livenodes server1:8983/solr server2:8983/solr/collections

/collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr

http://.../solr/collection1/query?q=awesome

Load-balanced sub-requestreplica1

shard2

replica1

ZooKeeper holds cluster state• Nodes in the cluster• Collections in the cluster• Schema & config for each

collection• Shards in each collection• Replicas in each shard• Collection aliases

Page 18: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Shard1 Shard2

Replica1 Replica3

Replica2 Replica4

Distributed Indexing

http://.../solr/collection1/update

• Update sent to any node• Solr determines what shard the document is on, and forwards to shard leader• Shard Leader versions document and forwards to all other shard replicas• HA for updates (if one leader fails, another takes it’s place)

Document Update

Leader

Non leading replica

Page 19: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Optimistic Concurrency

• Conditional update based on document version

Solr

1. /get document

2. Modify document, retaining _version_

3. /update resulting document

4. Go back to step #1 if fail code=409

client

Page 20: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Distributed Query RequestsDistributed query across all shards in the collectionhttp://localhost:8983/solr/collection1/query?q=foo

Explicitly specify node addresses to load-balance acrossshards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr A list of equivalent nodes are separated by “|” Different phases of the same distributed request use the same node

Specify logical shards to search acrossshards=NY,NJ,CT

Specify multiple collections to search acrosscollection=collection1,collection2

public CloudSolrServer(String zkHost) ZK aware SolrJ Java client that load-balances across all nodes in cluster Calculate where document belongs and directly send to shard leader (new)

Page 21: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Document Routing

80000000-bfffffff

00000000-3fffffff

40000000-7fffffff

c0000000-ffffffff

shard1shard4

shard3 shard2

id = BigCo!doc5

9f27 3c71(MurmurHash3)

q=my_queryshard.keys=BigCo!

9f27 0000 9f27 ffffto

(hash)

shard1

numShards=4router=compositeId

Hash Ring

Page 22: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Durable Writes

• Lucene flushes writes to disk on a “commit”- Uncommitted docs are lost on a crash (at lucene level)

• Solr 4 maintains it’s own transaction log- Contains uncommitted documents- Services real-time get requests- Recovery (log replay on restart)- Supports distributed “peer sync”

• Writes forwarded to multiple shard replicas- A replica can go away forever w/o collection data loss- A replica can do a fast “peer sync” if it’s only slightly out of

date- A replica can do a full index replication (copy) from a leader.

Page 23: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Collections APICreate a new document collectionhttp://localhost:8983/solr/admin/collections? action=CREATE &name=mycollection&numShards=4&replicationFactor=3

CREATE DELETE ALIAS

SPLITSHARD DELETESHARD RELOAD

Page 24: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Solr 4.3: Seamless Online Shard Splitting

Shard2_0

Shard1

replicaleader

Shard2

replicaleader

Shard3

replicaleader

Shard2_1

1. http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=mycollection&shard=Shard2

2. New sub-shards created in “construction” state3. Leader starts forwarding applicable updates, which are buffered by the sub-shards4. Leader index is split and installed on the sub-shards5. Sub-shards apply buffered updates then become “active” leaders and old shard

becomes “inactive”

update

Page 25: SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

The Fifth Elephant 2013, Bangalore12th July 2013

Solr 4.4: Schemaless• “Schemaless” really normally means that the client(s) have an implicit

schema.• “No Schema” impossible for anything based on Lucene

- A field must be indexed the same way across documents• Dynamic fields: convention over configuration

- Only pre-define types of fields, not fields themselves- No guessing. Any field name ending in _i is an integer

• “Guessed Schema” or “Type Guessing”- For previously unknown fields, guess using JSON type as a hint - Coming soon (4.4?) based on the Dynamic Schema work

• Many disadvantages to guessing- Lose ability to catch field naming errors- Can’t optimize based on types- Guessing incorrectly means having to start over