solr lucene revolution 2014 - solr compute cloud - nitin

Solr Compute Cloud – An Elastic Solr Infrastructure

Nitin Sharma

- Member of technical staff, BloomReach

- [email protected]

mailto:[email protected]

Abstract

Scaling search platforms is an extremely hard problem• Serving hundreds of millions of documents • Low latency • High throughput workloads • Optimized cost.

At BloomReach, we have implemented SC2, an elastic Solr infrastructure for big data applications that: • Supports heterogeneous workloads while hosted in the cloud.• Dynamically grows/shrinks search servers

• Application and Pipeline level isolation, NRT search and indexing.• Offers latency guarantees and application-specific performance tuning. • Provides high-availability features like cluster replacement, cross-data center support, disaster

recovery etc.

About Us

BloomReach

BloomReach has developed a personalized discovery platform that features applications that analyze big data to makes our customers’ digital content more discoverable, relevant and profitable.

MyselfI work on search platform scaling for BloomReach’s big data. My relevant experience and background includes scaling real-time services for latency sensitive applications and building performance and search-quality metrics infrastructure for personalization platforms.

The BloomReach

Personalized Discovery

Platform

BloomReach’s Applications

Organic Search

Conte

nt

und

ers

tand

ing

What it does

Content optimization, management and measure-

ment

Benefit

Enhanced discoverability and customer acquisition in organic

search

What it does

Personalized onsite search and

navigation across devices

Benefit

Relevant and consistent onsite experiences for new and known

users

What it does

Merchandising tool that un-derstands products and identifies opportunities

Benefit

Prioritize and optimize online merchandising

SNAP

Compass

Agenda

• BloomReach search use cases and architecture• Old architecture and issues• Scaling challenges• Elastic SolrCloud architecture and benefits• Lessons learned

BloomReach Search Use Cases

1. Front-end (serving) queries – Uptime and Latency sensitive

2. Batch search pipelines – Throughput sensitive

3. Time bound indexing requirements – Customer Specific

4. Time bound Solr config updates

BloomReach Search Architecture

Solr Cluster

Zookeeper Ensemble Map Reduce Pipelines (Reads)

Indexing Pipelines Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Heavy Load

Moderate Load

Light Load

Legend

Public API

Search Traffic

Search Traffic

Throughput Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

● Heterogeneous read workload

● Same collection - different pipelines, different query patterns, different schedule

● Cache tuning is virtually impossible

● Larger pipeline starving the small ones

● Machine utilization determines throughput and stability of a pipeline at any point

● No isolation among jobs

Stability and Uptime Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

● Bad clients – bring down the cluster/degrade performance

● Bad queries (with heavy load) – render nodes unresponsive

● Garbage collection issues

● ZK stability issues (as we scale collections)

● CPU /Load Issues

● Higher number of concurrent pipelines, higher number of issues

Indexing Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

● Commit frequencies vary with indexer types

● Indexer run during another pipeline – performance

● Indexer client leaks

● Too many stored fields

● Non-batch updates

Rethinking…

• Shared cluster for pipelines does not scale.

• Guaranteeing an uptime of 99.99+ - non trivial

• Every job runs great in isolation. When you put them together, they fail.

• Running index-heavy load and read-heavy load - cluster performance issues.

• Any direct access to production cluster – cluster stability (client leaks, bad queries etc.).

What if every pipeline had its own cluster?

Solr Compute Cloud (SC2)

• Elastic Infrastructure – Provision Solr Clusters on demand, on-the-fly.

• Create, Use, Terminate Model - Create a temporary cluster with necessary data, use it and throw it away.

• Technologies behind SC2 (built in House)

Cluster Management API - Dynamic cluster provisioning and resource allocation.

Solr HAFT – High availability and data management library for SolrCloud.

• Isolation - Pipelines get their own cluster. One cannot disrupt another.

• Dynamic Scaling – Every pipeline can state its own replication requirements.

• Production Safeguard - No direct access. Safeguards from bad clients/access patterns.

• Cost Saving – Provision for the average; withstand peak with elastic growth.

Solr Compute Cloud

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Solr Compute

Cloud API

Solr Cluster Collection A Replicas: 6

1. Read pipeline requests collection and desired replicas from SC2 API.

2. SC2 API provisions cluster dynamically with needed setup (and streams Solr data).

3. SC2 calls HAFT service to replicate data from production to provisioned cluster.

4. Pipeline uses this cluster to run job.

1

4

Request: {Collection: A, Replica: 6}

2

Solr HAFT

Service

3

3

Read

Replicate

Solr Compute Cloud…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Solr Compute

Cloud API


1. Pipeline finishes running the job.

2. Pipeline calls SC2 API to terminate the cluster.

3. SC2 terminates the cluster.

2Terminate: {Cluster}

3

Solr HAFT

Service

1

Solr Compute Cloud – Read Pipeline View

Zookeeper Ensemble

Pipeline 1

Solr Compute

Cloud API



Pipeline 2Solr Cluster Collection B Replicas: 2

Request: {Collection: B, Replica: 2}

Pipeline nSolr Cluster Collection CReplicas: 1

Request: {Collection: C, Replica: 1}

Solr HAFT

Service

Production Solr Cluster

Solr Compute Cloud – Indexing


Zookeeper Ensemble

Indexing

Solr Compute

Cloud API


1. Read pipeline requests collection and desired replicas from SC2 API.

2. SC2 API provisions cluster dynamically with needed setup (and streams Solr data).

3. Indexer uses this cluster to index the data.

4. Indexer calls HAFT service to replicate the index from dynamic cluster to production.

5. HAFT service reads data from dynamic cluster and replicates to production Solr.

1

3


2

Replicate

Solr HAFT Service

4

5Read

Solr Compute Cloud – Global View

Zookeeper Ensemble

Solr Compute

Cloud API

Solr HAFT Service


Indexing Pipelines 1

Elastic Clusters

Read Pipelines 1

Read Pipelines n

Indexing Pipelines n

Provision: {Cluster}

Terminate: {Cluster}

Replicate Index

Replicate Index

Run Job

Solr Compute Cloud API

1. API to provision clusters on demand.

2. Dynamic cluster and resource allocation (includes cost optimization)

3. Track request state, cluster performance and cost.

4. Terminate long-running, runaway clusters.

Solr HAFT Service1. High availability and fault tolerance

2. Home-grown technology

3. Open Source - (Work in progress)

4. Features• One push disaster recovery • High availability operations

• Replace node• Add replicas• Repair collection• Collection versioning

• Cluster backup operations• Dynamic replica creation• Cluster clone• Cluster swap• Cluster state reconstruction

Solr HAFT Service

Clone Alias

Clone Collections

Custom Commit Node Replacement

Node Repair

Clone Cluster

Collection Versioning

Black Box Recording

Lucene Segment Optimize

Index Management Actions

High Availability Actions

Cluster Backup Operations

Solr MetadataZookeeper Metadata

Verification Monitoring

Solr HAFT Service – Functional View

Dynamic Replica Creation

Cluster Clone

Cluster Swap

Cluster State Reconstruction

Disaster Recovery in New Architecture

Old Production

Solr Cluster

Zookeeper Ensemble

New Solr

Cluster

Zookeeper Ensemble

Solr HAFT Service

Push Button

Recovery

Brave Soul on Pager Duty

1

2

DNS

3

1. Guy on Pager clicks the recovery button

2. Solr HAFT Service triggers

Cluster Setup

State Reconstruction

Cluster Clone

Cluster Swap 3. Production DNS – New

Cluster

SC2 vs Non-SC2 (Stability Features)Property Non-SC2 SC2

Linear Scalability for Heterogeneous Workload

Pipeline Level Isolation

Dynamic Collection Scaling

Prevention from Bad Clients

Pipeline Specific Performance

No Direct Access to Production Cluster

Can Sleep at night?

SC2 vs Non-SC2 (Availability Features)

Property Non-SC2 SC2

Cross Data-Center Support

Cluster Cloning

Collection Versioning

One-Push Disaster Recovery

Repair API for Nodes/Collections

Node Replacement

Lessons Learned

1. Solr is a search platform. Do not use it as a database (for scans and lookups). Evaluate your stored fields.

2. Understand access patterns, QPS and queries in detail. Be careful when tuning caches.

3. Have access control for large-scale jobs that directly talk to your cluster. (Internal DDOS attacks are hard to track.)

4. Instrument every piece of infrastructure and collect metrics.

5. Build automated disaster recovery (You will need it. )

Questions?

Thank You!

Nitin [email protected]://www.linkedin.com/in/knitinsharma




https://www.linkedin.com/in/knitinsharma




solr lucene revolution 2014 - solr compute cloud - nitin

Engineering

pipeline n indexing

onsite search

nrt search

pipelines pipeline

bloomreach bloomreach

search platform scaling

pipeline level isolation

impossible larger pipeline