solr 4 the nosql search server

30
Solr 4 The NoSQL Search Server Yonik Seeley May 30, 2013

Upload: valmai

Post on 25-Feb-2016

56 views

Category:

Documents


2 download

DESCRIPTION

Solr 4 The NoSQL Search Server. Yonik Seeley May 30, 2013. NoSQL Databases. Wikipedia says: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Solr  4 The  NoSQL  Search Server

Solr 4The NoSQL Search Server

Yonik SeeleyMay 30, 2013

Page 2: Solr  4 The  NoSQL  Search Server

2 2

NoSQL Databases

•Wikipedia says:A NoSQL database provides a mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases in order to achieve horizontal scaling and higher availability. Some authors refer to them as "Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query language to be used.

•Non-traditional data stores•Doesn’t use / isn’t designed around SQL•May not give full ACID guarantees• Offers other advantages such as greater scalability as a tradeoff

•Distributed, fault-tolerant architecture

Page 3: Solr  4 The  NoSQL  Search Server

3 3

Solr Cloud Design Goals

• Automatic Distributed Indexing• HA for Writes• Durable Writes• Near Real-time Search• Real-time get• Optimistic Concurrency

Page 4: Solr  4 The  NoSQL  Search Server

4 4

Solr Cloud

• Distributed Indexing designed from the ground up to accommodate desired features

• CAP Theorem• Consistency, Availability, Partition Tolerance (saying goes “choose 2”)• Reality: Must handle P – the real choice is tradeoffs between C and A

• Ended up with a CP system (roughly)• Value Consistency over Availability• Eventual consistency is incompatible with optimistic concurrency• Closest to MongoDB in architecture

• We still do well with Availability• All N replicas of a shard must go down before we lose writability for that

shard• For a network partition, the “big” partition remains active (i.e. Availability

isn’t “on” or “off”)

Page 5: Solr  4 The  NoSQL  Search Server

5 5

Solr 4

Page 6: Solr  4 The  NoSQL  Search Server

6 6

Solr 4 at a glance

• Document Oriented NoSQL Search Server• Data-format agnostic (JSON, XML, CSV, binary)• Schema-less options (more coming soon)

• Distributed• Multi-tenanted

• Fault Tolerant• HA + No single points of failure

• Atomic Updates• Optimistic Concurrency• Near Real-time Search• Full-Text search + Hit Highlighting• Tons of specialized queries: Faceted search, grouping, pseudo-join, spatial search, functions

The desire for these features drove some of the “SolrCloud” architecture

Page 7: Solr  4 The  NoSQL  Search Server

7 7

Quick Start

1. Unzip the binary distribution (.ZIP file)Note: no “installation” required

2. Start Solr

3. Go!Browse to http://localhost:8983/solr for the new admin interface

$ cd example$ java –jar start.jar

Page 8: Solr  4 The  NoSQL  Search Server

8 8

New admin UI

Page 9: Solr  4 The  NoSQL  Search Server

9 9

Add and Retrieve document

$ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '[ { "id" : "book1", "title" : "American Gods", "author" : "Neil Gaiman" }]'

$ curl http://localhost:8983/solr/get?id=book1{ "doc": { "id" : "book1", "author": "Neil Gaiman", "title" : "American Gods", "_version_": 1410390803582287872 }}

Note: no type of “commit” is necessary to retrieve documents via /get(real-time get)

Page 10: Solr  4 The  NoSQL  Search Server

10 10

Simplified JSON Delete Syntax

• Singe delete-by-id{"delete":”book1"}

• Multiple delete-by-id{"delete":[”book1”,”book2”,”book3”]}

• Delete with optimistic concurrency{"delete":{"id":”book1", "_version_":123456789}}

• Delete by Query{"delete":{”query":”tag:category1”}}

Page 11: Solr  4 The  NoSQL  Search Server

11 11

Atomic Updates

$ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '[ {"id" : "book1", "pubyear_i" : { "add" : 2001 }, "ISBN_s" : { "add" : "0-380-97365-1"} }]'

$ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '[ {"id" : "book1", "copies_i" : { "inc" : 1}, "cat" : { "add" : "fantasy"}, "ISBN_s" : { "set" : "0-380-97365-0"} "remove_s" : { "set" : null } }]'

Page 12: Solr  4 The  NoSQL  Search Server

12 12

Optimistic Concurrency

• Conditional update based on document version

Solr

1. /get document

2. Modify document, retaining _version_

3. /update resulting document

4. Go back to step #1 if fail code=409

client

Page 13: Solr  4 The  NoSQL  Search Server

13 13

Version semantics

_version_

Update Semantics

> 1 Document version must exactly match supplied _version_

1 Document must exist

< 0 Document must not exist

0 Don’t care (normal overwrite if exists)

• Specifying _version_ on any update invokes optimistic concurrency

Page 14: Solr  4 The  NoSQL  Search Server

14 14

Optimistic Concurrency Example

$ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '[ { "id":"book2", "title":["Neuromancer"], "author":"William Gibson", "copiesIn_i":6, "copiesOut_i":4, "_version_":123456789 }]'

$ curl http://localhost:8983/solr/get?id=book2{ "doc” : { "id":"book2", "title":["Neuromancer"], "author":"William Gibson", "copiesIn_i":7, "copiesOut_i":3, "_version_":123456789 }}

curl http://localhost:8983/solr/update?_version_=123456789 -H 'Content-type:application/json' -d […]

Get the document

Modify and resubmit, using the same _version_

Alternately, specify the _version_ as a request parameter

Page 15: Solr  4 The  NoSQL  Search Server

15 15

Optimistic Concurrency Errors

• HTTP Code 409 (Conflict) returned on version mismatch

$ curl -i http://localhost:8983/solr/update -H 'Content-type:application/json' -d '[{"id":"book1", "author":"Mr Bean", "_version_":54321}]'

HTTP/1.1 409 ConflictContent-Type: text/plain;charset=UTF-8Transfer-Encoding: chunked { "responseHeader":{ "status":409, "QTime":1}, "error":{ "msg":"version conflict for book1 expected=12345 actual=1408814192853516288", "code":409}}

Page 16: Solr  4 The  NoSQL  Search Server

16 16

Schema

Page 17: Solr  4 The  NoSQL  Search Server

17 17

Schema REST API

•Restlet is now integrated with Solr•Get a specific fieldcurl http://localhost:8983/solr/schema/fields/price{"field":{ "name":"price", "type":"float", "indexed":true, "stored":true }}•Get all fieldscurl http://localhost:8983/solr/schema/fields•Get Entire Schema!curl http://localhost:8983/solr/schema

Page 18: Solr  4 The  NoSQL  Search Server

18 18

Dynamic Schema

• Add a new field (Solr 4.4)curl -XPUT http://localhost:8983/solr/schema/fields/strength -d ‘

{"type":”float", "indexed":"true”} ‘• Works in distributed (cloud) mode too!• Schema must be managed & mutable (not currently the default)<schemaFactory class="ManagedIndexSchemaFactory"> <bool name="mutable">true</bool> <str name="managedSchemaResourceName">managed-schema</str></schemaFactory>

Page 19: Solr  4 The  NoSQL  Search Server

19 19

Schemaless• “Schemaless” really normally means that the client(s) have an implicit

schema• “No Schema” impossible for anything based on Lucene

• A field must be indexed the same way across documents

• Dynamic fields: convention over configuration• Only pre-define types of fields, not fields themselves• No guessing. Any field name ending in _i is an integer

• “Guessed Schema” or “Type Guessing”• For previously unknown fields, guess using JSON type as a hint • Coming soon (4.4?) based on the Dynamic Schema work

• Many disadvantages to guessing• Lose ability to catch field naming errors• Can’t optimize based on types• Guessing incorrectly means having to start over

Page 20: Solr  4 The  NoSQL  Search Server

20 20

Solr Cloud

Page 21: Solr  4 The  NoSQL  Search Server

21 21

Solr Cloud

shard1

replica2

replica3

replica2

replica3

ZooKeeper quorum

ZK nod

e

ZK node

ZK nod

e

ZK node

ZK node

/configs /myconf solrconfig.xml schema.xml

/clusterstate.json/aliases.json

/livenodes server1:8983/solr server2:8983/solr/collections

/collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr

http://.../solr/collection1/query?q=awesome

Load-balanced sub-requestreplica1

shard2

replica1

ZooKeeper holds cluster state• Nodes in the cluster• Collections in the cluster• Schema & config for each collection• Shards in each collection• Replicas in each shard• Collection aliases

Page 22: Solr  4 The  NoSQL  Search Server

22 22

Distributed Indexing

shard1

http://.../solr/collection1/update

shard2

• Update sent to any node• Solr determines what shard the document is on, and forwards to shard leader• Shard Leader versions document and forwards to all other shard replicas• HA for updates (if one leader fails, another takes it’s place)

Page 23: Solr  4 The  NoSQL  Search Server

23 23

Collections APICreate a new document collectionhttp://localhost:8983/solr/admin/collections?

action=CREATE &name=mycollection&numShards=4&replicationFactor=3

Delete a collectionhttp://localhost:8983/solr/admin/collections?

action=DELETE&name=mycollection

Create an alias to a collection (or a group of collections)http://localhost:8983/solr/admin/collections?

action=CREATEALIAS&name=tri_state&collections=NY,NJ,CT

Page 24: Solr  4 The  NoSQL  Search Server

24 24

http://localhost:8983/solr/#/~cloud

Page 25: Solr  4 The  NoSQL  Search Server

25 25

Distributed Query RequestsDistributed query across all shards in the collectionhttp://localhost:8983/solr/collection1/query?q=foo

Explicitly specify node addresses to load-balance acrossshards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr A list of equivalent nodes are separated by “|” Different phases of the same distributed request use the same node

Specify logical shards to search acrossshards=NY,NJ,CT

Specify multiple collections to search acrosscollection=collection1,collection2

public CloudSolrServer(String zkHost) ZK aware SolrJ Java client that load-balances across all nodes in cluster Calculate where document belongs and directly send to shard leader (new)

Page 26: Solr  4 The  NoSQL  Search Server

26 26

Durable Writes

•Lucene flushes writes to disk on a “commit”• Uncommitted docs are lost on a crash (at lucene level)

•Solr 4 maintains it’s own transaction log• Contains uncommitted documents• Services real-time get requests• Recovery (log replay on restart)• Supports distributed “peer sync”

•Writes forwarded to multiple shard replicas• A replica can go away forever w/o collection data loss• A replica can do a fast “peer sync” if it’s only slightly out of date• A replica can do a full index replication (copy) from a peer

Page 27: Solr  4 The  NoSQL  Search Server

27 27

Near Real Time (NRT) softCommit

•softCommit opens a new view of the index without flushing + fsyncing files to disk• Decouples update visibility from update durability

•commitWithin now implies a soft commit•Current autoCommit defaults from solrconfig.xml:

<autoCommit> <maxTime>15000</maxTime> <openSearcher>false</openSearcher> </autoCommit>

<!-- <autoSoftCommit> <maxTime>5000</maxTime> </autoSoftCommit> -->

Page 28: Solr  4 The  NoSQL  Search Server

28 28

Document Routing

80000000-bfffffff

00000000-3fffffff

40000000-7fffffff

c0000000-ffffffff

shard1shard4

shard3 shard2

id = BigCo!doc5

9f27 3c71

(MurmurHash3)

q=my_queryshard.keys=BigCo!

9f27 0000 9f27 ffffto

(hash)

shard1

numShards=4router=compositeId

hash ring

Page 29: Solr  4 The  NoSQL  Search Server

29 29

Seamless Online Shard Splitting

Shard2_0

Shard1

replicaleader

Shard2

replicaleader

Shard3

replicaleader

Shard2_1

1. http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=mycollection&shard=Shard2

2. New sub-shards created in “construction” state3. Leader starts forwarding applicable updates, which are buffered by the sub-shards4. Leader index is split and installed on the sub-shards5. Sub-shards apply buffered updates then become “active” leaders and old shard

becomes “inactive”

update

Page 30: Solr  4 The  NoSQL  Search Server

30 30

Questions?