solr compute cloud – an elastic solr infrastructure: presented by nitin sharma, bloomreach
Post on 08-Jul-2015
1.512 Views
Preview:
DESCRIPTION
TRANSCRIPT
Solr Compute Cloud – An Elastic Solr Infrastructure
Nitin Sharma - Member of technical staff, BloomReach - nitin.sharma@bloomreach.com
Abstract
Scaling search platforms is an extremely hard problem • Serving hundreds of millions of documents • Low latency • High throughput workloads • Optimized cost.
At BloomReach, we have implemented SC2, an elastic Solr infrastructure for big data applications that: • Supports heterogeneous workloads while hosted in the cloud. • Dynamically grows/shrinks search servers
• Application and Pipeline level isolation, NRT search and indexing. • Offers latency guarantees and application-specific performance tuning. • Provides high-availability features like cluster replacement, cross-data center support, disaster
recovery etc.
About Us BloomReach BloomReach has developed a personalized discovery platform that features applications that analyze big data to makes our customers’ digital content more discoverable, relevant and profitable. Myself I work on search platform scaling for BloomReach’s big data. My relevant experience and background includes scaling real-time services for latency sensitive applications and building performance and search-quality metrics infrastructure for personalization platforms.
The BloomReach Personalized
Discovery Platform
BloomReach’s Applications
Organic Search
Cont
ent u
nder
stand
ing
What it does
Content optimization, management and measurement
Benefit
Enhanced discoverability and customer acquisition in organic search
What it does
Personalized onsite search and navigation across devices
Benefit
Relevant and consistent onsite experiences for new and known users
What it does
Merchandising tool that understands products and identifies opportunities
Benefit
Prioritize and optimize online merchandising
SNAP
Compass
Agenda
• BloomReach search use cases and architecture • Old architecture and issues • Scaling challenges • Elastic SolrCloud architecture and benefits • Lessons learned
BloomReach Search Use Cases 1. Front-end (serving) queries – Uptime and Latency sensitive 2. Batch search pipelines – Throughput sensitive 3. Time bound indexing requirements – Customer Specific 4. Time bound Solr config updates
BloomReach Search Architecture
Solr Cluster
Zookeeper Ensemble Map Reduce Pipelines (Reads)
Indexing Pipelines Pipeline 1
Pipeline 2
Pipeline n
Indexing 1
Indexing 2
Indexing n
Heavy Load
Moderate Load
Light Load
Legend
Public API
Search Traffic
Search Traffic
Throughput Issues…
Solr Cluster
Zookeeper Ensemble
Pipeline 1
Pipeline 2
Pipeline n
Indexing 1
Indexing 2
Indexing n
Public API
Search Traffic
● Heterogeneous read workload
● Same collection - different pipelines, different query patterns, different schedule
● Cache tuning is virtually
impossible
● Larger pipeline starving the small ones
● Machine utilization determines throughput and stability of a pipeline at any point
● No isolation among jobs
Stability and Uptime Issues…
Solr Cluster
Zookeeper Ensemble
Pipeline 1
Pipeline 2
Pipeline n
Indexing 1
Indexing 2
Indexing n
Public API
Search Traffic
● Bad clients – bring down the cluster/degrade performance
● Bad queries (with heavy load) – render nodes unresponsive
● Garbage collection issues
● ZK stability issues (as we scale collections)
● CPU /Load Issues ● Higher number of
concurrent pipelines, higher number of issues
Indexing Issues…
Solr Cluster
Zookeeper Ensemble
Pipeline 1
Pipeline 2
Pipeline n
Indexing 1
Indexing 2
Indexing n
Public API
Search Traffic
● Commit frequencies vary with indexer types
● Indexer run during another pipeline – performance
● Indexer client leaks
● Too many stored fields
● Non-batch updates
Rethinking…
• Shared cluster for pipelines does not scale.
• Guaranteeing an uptime of 99.99+ - non trivial
• Every job runs great in isolation. When you put them together, they fail. • Running index-heavy load and read-heavy load - cluster performance issues.
• Any direct access to production cluster – cluster stability (client leaks, bad queries etc.). What if every pipeline had its own cluster?
Solr Compute Cloud (SC2)
• Elastic Infrastructure – Provision Solr Clusters on demand, on-the-fly.
• Create, Use, Terminate Model - Create a temporary cluster with necessary data, use it and throw it away. • Technologies behind SC2 (built in House)
Cluster Management API - Dynamic cluster provisioning and resource allocation.
Solr HAFT – High availability and data management library for SolrCloud.
• Isolation - Pipelines get their own cluster. One cannot disrupt another. • Dynamic Scaling – Every pipeline can state its own replication requirements.
• Production Safeguard - No direct access. Safeguards from bad clients/access patterns.
• Cost Saving – Provision for the average; withstand peak with elastic growth.
Solr Compute Cloud
Solr Cluster
Zookeeper Ensemble
Pipeline 1
Solr Compute
Cloud API
Solr Cluster Collection A Replicas: 6
1. Read pipeline requests collection and desired replicas from SC2 API.
2. SC2 API provisions cluster dynamically with needed setup (and streams Solr data).
3. SC2 calls HAFT service to replicate data from production to provisioned cluster.
4. Pipeline uses this cluster to run job.
1
4
Request: {Collection: A, Replica: 6}
2
Solr HAFT
Service
3
3
Read
Replicate
Solr Compute Cloud…
Solr Cluster
Zookeeper Ensemble
Pipeline 1
Solr Compute
Cloud API
Solr Cluster Collection A Replicas: 6
1. Pipeline finishes running the job.
2. Pipeline calls SC2 API to terminate the cluster.
3. SC2 terminates the cluster.
2 Terminate: {Cluster}
3
Solr HAFT
Service
1
Solr Compute Cloud – Read Pipeline View
Zookeeper Ensemble Pipeline 1
Solr Compute
Cloud API
Solr Cluster Collection A Replicas: 6
Request: {Collection: A, Replica: 6}
Pipeline 2 Solr Cluster Collection B Replicas: 2
Request: {Collection: B, Replica: 2}
Pipeline n Solr Cluster Collection C Replicas: 1
Request: {Collection: C, Replica: 1}
Solr HAFT
Service
Production Solr Cluster
Solr Compute Cloud – Indexing
Production Solr Cluster
Zookeeper Ensemble
Indexing
Solr Compute
Cloud API
Solr Cluster Collection A Replicas: 6
1. Read pipeline requests collection and desired replicas from SC2 API.
2. SC2 API provisions cluster dynamically with needed setup (and streams Solr data).
3. Indexer uses this cluster
to index the data.
4. Indexer calls HAFT service to replicate the index from dynamic cluster to production.
5. HAFT service reads data from dynamic cluster and replicates to production Solr.
1
3
Request: {Collection: A, Replica: 2}
2
Replicate
Solr HAFT Service
4
5 Read
Solr Compute Cloud – Global View
Zookeeper Ensemble
Solr Compute
Cloud API
Solr HAFT Service
Production Solr Cluster
Indexing Pipelines 1
Elastic Clusters
Read Pipelines 1
Read Pipelines n
Indexing Pipelines n
Provision: {Cluster}
Terminate: {Cluster}
Replicate Index
Replicate Index
Run Job
Solr Compute Cloud API
1. API to provision clusters on demand.
2. Dynamic cluster and resource allocation (includes cost optimization)
3. Track request state, cluster performance and cost.
4. Terminate long-running, runaway clusters.
Solr HAFT Service 1. High availability and fault tolerance 2. Home-grown technology 3. Open Source - J (Work in progress) 4. Features
• One push disaster recovery • High availability operations
• Replace node • Add replicas • Repair collection • Collection versioning
• Cluster backup operations • Dynamic replica creation • Cluster clone • Cluster swap • Cluster state reconstruction
Solr HAFT Service
Clone Alias
Clone Collections
Custom Commit Node Replacement
Node Repair
Clone Cluster
Collection Versioning
Black Box Recording
Lucene Segment Optimize
Index Management Actions
High Availability Actions
Cluster Backup Operations
Solr Metadata Zookeeper Metadata
Verification Monitoring
Solr HAFT Service – Functional View
Dynamic Replica Creation
Cluster Clone
Cluster Swap
Cluster State Reconstruction
Disaster Recovery in New Architecture
Old Production
Solr Cluster
Zookeeper Ensemble
New Solr
Cluster
Zookeeper Ensemble
Solr HAFT Service
Push Button
Recovery
Brave Soul on Pager Duty
1
2
DNS
3
1. Guy on Pager clicks the recovery button 2. Solr HAFT Service
triggers Cluster Setup State Reconstruction Cluster Clone Cluster Swap
3. Production DNS – New
Cluster
SC2 vs Non-SC2 (Stability Features) Property Non-‐SC2 SC2
Linear Scalability for Heterogeneous Workload
Pipeline Level IsolaGon
Dynamic CollecGon Scaling
PrevenGon from Bad Clients
Pipeline Specific Performance
No Direct Access to ProducGon Cluster
Can Sleep at night? J
SC2 vs Non-SC2 (Availability Features)
Property Non-‐SC2 SC2
Cross Data-‐Center Support
Cluster Cloning
CollecGon Versioning
One-‐Push Disaster Recovery
Repair API for Nodes/CollecGons
Node Replacement
Lessons Learned 1. Solr is a search platform. Do not use it as a database (for scans and lookups).
Evaluate your stored fields.
2. Understand access patterns, QPS and queries in detail. Be careful when tuning caches.
3. Have access control for large-scale jobs that directly talk to your cluster. (Internal DDOS attacks are hard to track.)
4. Instrument every piece of infrastructure and collect metrics.
5. Build automated disaster recovery (You will need it. J)
Questions?
Thank You!
NiGn Sharma niGn.sharma@bloomreach.com hQps://www.linkedin.com/in/kniGnsharma
top related