scaling search with solrcloud
DESCRIPTION
Scaling your search application with Solr CloudTRANSCRIPT
![Page 1: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/1.jpg)
Scaling with Solr Cloud
Saumitra Srivastav [email protected]
Bangalore Apache Solr Group September-2014 Meetup
![Page 2: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/2.jpg)
What is Solr Cloud?
- set of features which add distributed capabilities in Solr
- fault tolerance and high availability
- distributed indexing and search
- enable and simplify horizontal scaling a search index using sharding and replication
![Page 3: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/3.jpg)
Non-Cloud Single Node Deployment
Machine(server) - 1
Solr Node ( jetty on port 8983 )
Core - 1
Conf Data
Core - 2
Conf Data
Core - N
Conf Data
.........
.........
![Page 4: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/4.jpg)
Use Solr Cloud for ...
- performance
- scalability
- high-availability
- simplicity
- elasticity
![Page 5: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/5.jpg)
Solr Cloud Glossary
- Cluster
- Node
- Shard
- Leader & Replica
- Overseer
- Collection
- Zookeeper
![Page 6: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/6.jpg)
High Level View
![Page 7: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/7.jpg)
Glossary
- Cluster - set of solr nodes
- Node - a JVM instance running Solr. - also known as a Solr server.
- Core
- an individual Solr instance (represents a logical index).
- multiple cores can run on a single node.
![Page 8: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/8.jpg)
Glossary
- Collection - one or more documents grouped together in a
single logical index. - can be spread across multiple cores.
- Shard - a logical section of a single collection - Implemented as core
- Replica - A copy of a shard or single logical index - used in failover or load balancing.
![Page 9: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/9.jpg)
Glossary
- Leader - The main node for each shard that routes
document adds, updates, or deletes to other replicas
- if leader goes down, a new node will be elected to take it's place
- Overseer
- A single node in SolrCloud that is responsible for processing actions involving the entire cluster
- if overseer goes down, a new node will be elected to take it's place
![Page 10: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/10.jpg)
Zookeeper
- distributed coordination - maintaining configuration information
Solr Node 1 10.0.0.1:8983
Solr Node 3 10.0.0.3:8983
Solr Node 2 10.0.0.2:8983
Solr Node 4 10.0.0.4:8983
Zookeeper
![Page 11: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/11.jpg)
Zookeeper
Solr Node 1 10.0.0.1:8983
Solr Node 3 10.0.0.3:8983
Solr Node 2 10.0.0.2:8983
Solr Node 4 10.0.0.4:8983
zk-1:2181
zk-2:2182
zk-3:2183
Quorum
Client
![Page 12: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/12.jpg)
Zookeeper - Central Configuration
![Page 13: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/13.jpg)
Zookeeper - distributed coordination
- Keep track of /live_nodes
- Collection metadata and replica state in /clusterstate.json
- Alias list in /aliasies.json
- Leader election
![Page 14: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/14.jpg)
Collections
- Collection is a distributed index defined by:
- named configuration - stored in ZooKeeper
- number of shards
- replication factor
- Number of copies of each document in the collection
- document routing strategy:
- how documents get assigned to shards
![Page 15: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/15.jpg)
Collections API
localhost:8983/solr/admin/collections?action=CREATE &name=collection1 &numShards=4 &replicationFactor=2 &maxShardsPerNode=1 &createNodeSet=localhost:8933 &collection.configName=collection1Config
![Page 16: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/16.jpg)
Collections
![Page 17: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/17.jpg)
Sharding
- Collection has a fixed number of shards - existing shards can be split
- When to shard?
- Large number of docs - Large document sizes - Parallelization during indexing and queries - Data partitioning (custom hashing)
![Page 18: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/18.jpg)
Replication
- Why replicate? - High-availability - Load balancing
- How does it work in SolrCloud? - Near-real-time, NOT master-slave - Leader forwards to replicas in parallel, waits
for response - Error handling during indexing is tricky
![Page 19: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/19.jpg)
Indexing
![Page 20: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/20.jpg)
Indexing
1. Get cluster state from ZK
2. Route document directly to leader (hash on doc ID)
3. Persist document on durable storage (tlog)
4. Forward to healthy replicas
5. Acknowledge write succeed to client
![Page 21: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/21.jpg)
Querying
![Page 22: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/22.jpg)
Querying
- Query client can be ZK aware or just query via a load balancer
- Client can send query to any node in the cluster
- Controller node distributes the query to a replica for each shard to identify documents matching query
- Controller node sorts the results from step 3 and issues a second query for all fields for a page of results
![Page 23: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/23.jpg)
Transaction Log (tlog)
- file where the raw documents are written for recovery purposes
- each node has its own tlog
- replayed on server restart - in case of non gracefull shutdown
- “rolled over” automatically on hard commit
- old one is closed and a new one is opened
![Page 24: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/24.jpg)
Transaction Log (tlog)
![Page 25: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/25.jpg)
Commits
- Hard Commit & Soft Commit
- Hard commits are about durability, soft commits are about visibility
- Further reading: https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
![Page 26: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/26.jpg)
What happens on hard Commit?
- The tlog is truncated.
- A new tlog is started.
- Old tlogs will be deleted if there are more than 100 documents in newer tlogs.
- The current index segment is closed and flushed.
- Background segment merges may be initiated.
![Page 27: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/27.jpg)
What happens on soft commit?
- The tlog has NOT been truncated. It will continue to grow.
- New documents WILL be visible.
- some caches will have to be reloaded
- top-level caches will be invalidated.
![Page 28: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/28.jpg)
Shard Splitting
- Can split shards into two sub-shards
- Live splitting. No downtime needed.
- Requests start being forwarded to sub-shards automatically
- Expensive operation: Use as required during low traffic
![Page 29: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/29.jpg)
Overseer
- Persists collection state change events to zooKeeper
- Controller for Collection API commands
- One per cluster (for all collections); elected using leader election
- Asynchronous (pub/sub messaging)
- Automated failover to a healthy node
- Can be assigned to a dedicated node
![Page 30: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/30.jpg)
Overseer
![Page 31: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/31.jpg)
Controlling data partitioning
- Shard vs Replicas - Custom Routing - Collection Aliasing
![Page 32: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/32.jpg)
Shard vs Replica
More data? Shard
Replica Replica
Shard Shard
Replica More queries? Replica Replica Replica
![Page 33: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/33.jpg)
Document Routing
- How to assign documents to shards - Default Routing - Custom routing
- Routers
- CompositeID - Implicit
![Page 34: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/34.jpg)
Default Routing
- Each shard covers a hash-range
- Hash doc-ID into 32-bit integer, map to range
- Leads to balanced (roughly) shards
![Page 35: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/35.jpg)
Default Routing
Shard 1 0 - 7fffffff
Collection Document-1
Id = bookdoc1
Document-2
Id = magazinedoc1
Document-3
Id = bookdoc2
32 bit Hash of
Document ID Shard 2
80000000 -ffffffff
858919514
2516704228
413288864
![Page 36: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/36.jpg)
Default Routing - Querying
Shard 1 Shard 2 Shard 3 Shard 4
Shard 5 Shard 6 Shard 7 Shard 8
Collection
Application
q=soccer
![Page 37: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/37.jpg)
Custom Routing
- Route documents to specific shards
- based on a shard key component in the document ID
![Page 38: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/38.jpg)
Custom Routing
- send documents with a prefix in the document ID
- prefix in ID will be used to calculate the hash to determine the shard
- Prefix must be separated by exclamation mark(!)
- Example: 1. Book!doc1 2. Magazine!doc1 3. Book!author!doc2
![Page 39: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/39.jpg)
Custom Routing - Indexing
Shard 1 0 - 7fffffff
Collection Document-1
Id = book!doc1
Document-2
Id = magazine!doc1
Document-3
Id = book!doc2
Shard 2 80000000 -
ffffffff
![Page 40: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/40.jpg)
Custom Routing - Querying
http://10.0.0.7:8983/solr/collection1/select? q=soccer& _route_=books
http://10.0.0.7:8983/solr/collection1/select? q=soccer& _route_=books,magazines
![Page 41: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/41.jpg)
Custom Routing - Querying
Shard 1 Shard 2 Shard 3 Shard 4
Shard 5 Shard 6 Shard 7 Shard 8
Collection
Application
q=soccer&_route_=books!
![Page 42: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/42.jpg)
Implicit Router
- A field can be defined while creating collection to be used for routing
http://localhost:8983/solr/admin/collections? action=CREATE& name=articles& router.name=implicit& router.field=article-type
![Page 43: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/43.jpg)
Collection Aliasing
- allows you to setup a virtual collection that actually points to one or more real collections
- Virtual collection == alias
localhost:8983/solr/admin/collections? action=CREATEALIAS &name=alias-name &collections=collection-list
![Page 44: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/44.jpg)
Collection Aliasing
- Time-series data
June
last3months
latest
July Aug Sep Oct
alias
alias
Real Collections
![Page 45: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/45.jpg)
Collection Aliasing
June
last3months
latest
July Aug Sep Oct
alias
alias
Real Collections
localhost:8983/solr/admin/collections? action=CREATEALIAS &name=last3months &collections=aug,sep,oct
localhost:8983/solr/admin/collections? action=CREATEALIAS &name=latest &collections=oct
![Page 46: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/46.jpg)
Collection Aliasing
June
last3months
latest
July Aug Sep Oct
alias
alias
Real Collections
localhost:8983/solr/admin/collections? action=CREATEALIAS &name=last3months &collections=sep,oct,nov
localhost:8983/solr/admin/collections? action=CREATEALIAS &name=latest &collections=nov
Nov
![Page 47: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/47.jpg)
Collection Aliasing
- Aliases can be:
• updated on the fly
• queried just like a normal collection
• used for indexing as long as it is pointing to a single collection
![Page 48: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/48.jpg)
Other Features
- Near-Real-Time Search
- Atomic Updates
- Optimistic Locking
- HTTPS
- Use HDFS for storing indexes
- Use MapReduce for building index
![Page 49: Scaling search with SolrCloud](https://reader034.vdocument.in/reader034/viewer/2022051412/548330d7b4af9f870d8b4967/html5/thumbnails/49.jpg)
Thanks
- Attributions: • Shalin Mangar’s slides on “SolrCloud: Searching Big Data” • Rafał Kuć’s slides on “Scaling Solr with SolrCloud”
- Connect
• [email protected] • [email protected] • https://www.linkedin.com/in/saumitras • @_saumitra_
- Join:
• http://www.meetup.com/Bangalore-Apache-Solr-Lucene-Group/