solr compute cloud – an elastic solr infrastructure: presented by nitin sharma, bloomreach

Solr Compute Cloud – An Elastic Solr Infrastructure

Nitin Sharma - Member of technical staff, BloomReach - nitin.sharma@bloomreach.com

Abstract

Scaling search platforms is an extremely hard problem •  Serving hundreds of millions of documents •  Low latency •  High throughput workloads •  Optimized cost.

At BloomReach, we have implemented SC2, an elastic Solr infrastructure for big data applications that: •  Supports heterogeneous workloads while hosted in the cloud. •  Dynamically grows/shrinks search servers

•  Application and Pipeline level isolation, NRT search and indexing. •  Offers latency guarantees and application-specific performance tuning. •  Provides high-availability features like cluster replacement, cross-data center support, disaster

recovery etc.

About Us BloomReach BloomReach has developed a personalized discovery platform that features applications that analyze big data to makes our customers’ digital content more discoverable, relevant and profitable. Myself I work on search platform scaling for BloomReach’s big data. My relevant experience and background includes scaling real-time services for latency sensitive applications and building performance and search-quality metrics infrastructure for personalization platforms.

The BloomReach Personalized

Discovery Platform

BloomReach’s Applications

Organic Search

What it does

Content optimization, management and measurement

Benefit

Enhanced discoverability and customer acquisition in organic search

What it does

Personalized onsite search and navigation across devices

Benefit

Relevant and consistent onsite experiences for new and known users

What it does

Merchandising tool that understands products and identifies opportunities

Benefit

Prioritize and optimize online merchandising

Compass

Agenda

•  BloomReach search use cases and architecture •  Old architecture and issues •  Scaling challenges •  Elastic SolrCloud architecture and benefits •  Lessons learned

BloomReach Search Use Cases 1.  Front-end (serving) queries – Uptime and Latency sensitive 2.  Batch search pipelines – Throughput sensitive 3.  Time bound indexing requirements – Customer Specific 4.  Time bound Solr config updates

BloomReach Search Architecture

Solr Cluster

Zookeeper Ensemble Map Reduce Pipelines (Reads)

Indexing Pipelines Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Heavy Load

Moderate Load

Light Load

Legend

Public API

Search Traffic

Throughput Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

●  Heterogeneous read workload

●  Same collection - different pipelines, different query patterns, different schedule

●  Cache tuning is virtually

impossible

●  Larger pipeline starving the small ones

●  Machine utilization determines throughput and stability of a pipeline at any point

●  No isolation among jobs

Stability and Uptime Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

●  Bad clients – bring down the cluster/degrade performance

●  Bad queries (with heavy load) – render nodes unresponsive

●  Garbage collection issues

●  ZK stability issues (as we scale collections)

●  CPU /Load Issues ●  Higher number of

concurrent pipelines, higher number of issues

Indexing Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

●  Commit frequencies vary with indexer types

●  Indexer run during another pipeline – performance

●  Indexer client leaks

●  Too many stored fields

●  Non-batch updates

Rethinking…

•  Shared cluster for pipelines does not scale.

•  Guaranteeing an uptime of 99.99+ - non trivial

•  Every job runs great in isolation. When you put them together, they fail. •  Running index-heavy load and read-heavy load - cluster performance issues.

•  Any direct access to production cluster – cluster stability (client leaks, bad queries etc.). What if every pipeline had its own cluster?

Solr Compute Cloud (SC2)

•  Elastic Infrastructure – Provision Solr Clusters on demand, on-the-fly.

•  Create, Use, Terminate Model - Create a temporary cluster with necessary data, use it and throw it away. •  Technologies behind SC2 (built in House)

Cluster Management API - Dynamic cluster provisioning and resource allocation.

Solr HAFT – High availability and data management library for SolrCloud.

•  Isolation - Pipelines get their own cluster. One cannot disrupt another. •  Dynamic Scaling – Every pipeline can state its own replication requirements.

•  Production Safeguard - No direct access. Safeguards from bad clients/access patterns.

•  Cost Saving – Provision for the average; withstand peak with elastic growth.

Solr Compute Cloud

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Solr Compute

Cloud API

Solr Cluster Collection A Replicas: 6

1.  Read pipeline requests collection and desired replicas from SC2 API.

2.  SC2 API provisions cluster dynamically with needed setup (and streams Solr data).

3.  SC2 calls HAFT service to replicate data from production to provisioned cluster.

4.  Pipeline uses this cluster to run job.

Request: {Collection: A, Replica: 6}

Solr HAFT

Service

Replicate

Solr Compute Cloud…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Solr Compute

Cloud API

1.  Pipeline finishes running the job.

2.  Pipeline calls SC2 API to terminate the cluster.

3.  SC2 terminates the cluster.

2 Terminate: {Cluster}

Solr HAFT

Service

Solr Compute Cloud – Read Pipeline View

Zookeeper Ensemble Pipeline 1

Solr Compute

Cloud API

Pipeline 2 Solr Cluster Collection B Replicas: 2

Request: {Collection: B, Replica: 2}

Pipeline n Solr Cluster Collection C Replicas: 1

Request: {Collection: C, Replica: 1}

Solr HAFT

Service

Production Solr Cluster

Solr Compute Cloud – Indexing

Zookeeper Ensemble

Indexing

Solr Compute

Cloud API

1.  Read pipeline requests collection and desired replicas from SC2 API.

2.  SC2 API provisions cluster dynamically with needed setup (and streams Solr data).

3.  Indexer uses this cluster

to index the data.

4.  Indexer calls HAFT service to replicate the index from dynamic cluster to production.

5.  HAFT service reads data from dynamic cluster and replicates to production Solr.

Replicate

Solr HAFT Service

5 Read

Solr Compute Cloud – Global View

Zookeeper Ensemble

Solr Compute

Cloud API

Solr HAFT Service

Indexing Pipelines 1

Elastic Clusters

Read Pipelines 1

Read Pipelines n

Indexing Pipelines n

Provision: {Cluster}

Terminate: {Cluster}

Replicate Index

Run Job

Solr Compute Cloud API

1.  API to provision clusters on demand.

2.  Dynamic cluster and resource allocation (includes cost optimization)

3.  Track request state, cluster performance and cost.

4.  Terminate long-running, runaway clusters.

Solr HAFT Service 1.  High availability and fault tolerance 2.  Home-grown technology 3.  Open Source - J (Work in progress) 4.  Features

•  One push disaster recovery •  High availability operations

•  Replace node •  Add replicas •  Repair collection •  Collection versioning

•  Cluster backup operations •  Dynamic replica creation •  Cluster clone •  Cluster swap •  Cluster state reconstruction

Solr HAFT Service

Clone Alias

Clone Collections

Custom Commit Node Replacement

Node Repair

Clone Cluster

Collection Versioning

Black Box Recording

Lucene Segment Optimize

Index Management Actions

High Availability Actions

Cluster Backup Operations

Solr Metadata Zookeeper Metadata

Verification Monitoring

Solr HAFT Service – Functional View

Dynamic Replica Creation

Cluster Clone

Cluster Swap

Cluster State Reconstruction

Disaster Recovery in New Architecture

Old Production

Solr Cluster

Zookeeper Ensemble

New Solr

Cluster

Zookeeper Ensemble

Solr HAFT Service

Push Button

Recovery

Brave Soul on Pager Duty

1.  Guy on Pager clicks the recovery button 2.  Solr HAFT Service

triggers Cluster Setup State Reconstruction Cluster Clone Cluster Swap

3. Production DNS – New

Cluster

SC2 vs Non-SC2 (Stability Features) Property Non-‐SC2 SC2

Linear Scalability for Heterogeneous Workload

Pipeline Level IsolaGon

Dynamic CollecGon Scaling

PrevenGon from Bad Clients

Pipeline Specific Performance

No Direct Access to ProducGon Cluster

Can Sleep at night? J

SC2 vs Non-SC2 (Availability Features)

Property Non-‐SC2 SC2

Cross Data-‐Center Support

Cluster Cloning

CollecGon Versioning

One-‐Push Disaster Recovery

Repair API for Nodes/CollecGons

Node Replacement

Lessons Learned 1. Solr is a search platform. Do not use it as a database (for scans and lookups).

Evaluate your stored fields.

2. Understand access patterns, QPS and queries in detail. Be careful when tuning caches.

3. Have access control for large-scale jobs that directly talk to your cluster. (Internal DDOS attacks are hard to track.)

4. Instrument every piece of infrastructure and collect metrics.

5. Build automated disaster recovery (You will need it. J)

Questions?

Thank You!

NiGn Sharma niGn.sharma@bloomreach.com hQps://www.linkedin.com/in/kniGnsharma

solr compute cloud – an elastic solr infrastructure: presented by nitin sharma, bloomreach

pipeline n indexing

onsite search

nrt search

pipelines pipeline

bloomreach bloomreach

search platform scaling

pipeline level isolation

impossible larger pipeline

Software

apache solr

solr architecture

bloomreach aws tech

solr flair: search user interfaces powered by apache solr

optimizing solr to improve search -...

peak holiday 2020 report - bloomreach

solr + jquery =

solr recipes

inside solr 5 - bangalore solr/lucene meetup

bloomreach - bloomstore compute cloud infrastructure

solr lucene conference 2014 - nitin presentation

typo3 camp poznan - solr usecases with hosted solr

the%nosql%database% -...

solr flair

optimizing solr to improve...

bloomreach...

wewe - bloomreach...

smc web content management guide bloomreach

the state of commerce experience - bloomreach

oak / solr integration tommaso teofili · oak / solr...