solr lucene revolution 2014 - solr compute cloud - nitin

27

Upload: bloomreacheng

Post on 17-Aug-2015

43 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Page 2: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr Compute Cloud – An Elastic Solr Infrastructure

Nitin Sharma

- Member of technical staff, BloomReach

- [email protected]

Page 3: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Abstract

Scaling search platforms is an extremely hard problem• Serving hundreds of millions of documents • Low latency • High throughput workloads • Optimized cost.

At BloomReach, we have implemented SC2, an elastic Solr infrastructure for big data applications that: • Supports heterogeneous workloads while hosted in the cloud.• Dynamically grows/shrinks search servers

• Application and Pipeline level isolation, NRT search and indexing.• Offers latency guarantees and application-specific performance tuning. • Provides high-availability features like cluster replacement, cross-data center support, disaster

recovery etc.

Page 4: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

About Us

BloomReach

BloomReach has developed a personalized discovery platform that features applications that analyze big data to makes our customers’ digital content more discoverable, relevant and profitable.

MyselfI work on search platform scaling for BloomReach’s big data. My relevant experience and background includes scaling real-time services for latency sensitive applications and building performance and search-quality metrics infrastructure for personalization platforms.

Page 5: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

The BloomReach

Personalized Discovery

Platform

Page 6: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

BloomReach’s Applications

Organic Search

Conte

nt

und

ers

tand

ing

What it does

Content optimization, management and measure-

ment

Benefit

Enhanced discoverability and customer acquisition in organic

search

What it does

Personalized onsite search and

navigation across devices

Benefit

Relevant and consistent onsite experiences for new and known

users

What it does

Merchandising tool that un-derstands products and identifies opportunities

Benefit

Prioritize and optimize online merchandising

SNAP

Compass

Page 7: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Agenda

• BloomReach search use cases and architecture• Old architecture and issues• Scaling challenges• Elastic SolrCloud architecture and benefits• Lessons learned

Page 8: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

BloomReach Search Use Cases

1. Front-end (serving) queries – Uptime and Latency sensitive

2. Batch search pipelines – Throughput sensitive

3. Time bound indexing requirements – Customer Specific

4. Time bound Solr config updates

Page 9: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

BloomReach Search Architecture

Solr Cluster

Zookeeper Ensemble Map Reduce Pipelines (Reads)

Indexing Pipelines Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Heavy Load

Moderate Load

Light Load

Legend

Public API

Search Traffic

Search Traffic

Page 10: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Throughput Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

● Heterogeneous read workload

● Same collection - different pipelines, different query patterns, different schedule

● Cache tuning is virtually impossible

● Larger pipeline starving the small ones

● Machine utilization determines throughput and stability of a pipeline at any point

● No isolation among jobs

Page 11: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Stability and Uptime Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

● Bad clients – bring down the cluster/degrade performance

● Bad queries (with heavy load) – render nodes unresponsive

● Garbage collection issues

● ZK stability issues (as we scale collections)

● CPU /Load Issues

● Higher number of concurrent pipelines, higher number of issues

Page 12: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Indexing Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

● Commit frequencies vary with indexer types

● Indexer run during another pipeline – performance

● Indexer client leaks

● Too many stored fields

● Non-batch updates

Page 13: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Rethinking…

• Shared cluster for pipelines does not scale.

• Guaranteeing an uptime of 99.99+ - non trivial

• Every job runs great in isolation. When you put them together, they fail.

• Running index-heavy load and read-heavy load - cluster performance issues.

• Any direct access to production cluster – cluster stability (client leaks, bad queries etc.).

What if every pipeline had its own cluster?

Page 14: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr Compute Cloud (SC2)

• Elastic Infrastructure – Provision Solr Clusters on demand, on-the-fly.

• Create, Use, Terminate Model - Create a temporary cluster with necessary data, use it and throw it away.

• Technologies behind SC2 (built in House)

Cluster Management API - Dynamic cluster provisioning and resource allocation.

Solr HAFT – High availability and data management library for SolrCloud.

• Isolation - Pipelines get their own cluster. One cannot disrupt another.

• Dynamic Scaling – Every pipeline can state its own replication requirements.

• Production Safeguard - No direct access. Safeguards from bad clients/access patterns.

• Cost Saving – Provision for the average; withstand peak with elastic growth.

Page 15: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr Compute Cloud

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Solr Compute

Cloud API

Solr Cluster Collection A Replicas: 6

1. Read pipeline requests collection and desired replicas from SC2 API.

2. SC2 API provisions cluster dynamically with needed setup (and streams Solr data).

3. SC2 calls HAFT service to replicate data from production to provisioned cluster.

4. Pipeline uses this cluster to run job.

1

4

Request: {Collection: A, Replica: 6}

2

Solr HAFT

Service

3

3

Read

Replicate

Page 16: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr Compute Cloud…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Solr Compute

Cloud API

Solr Cluster Collection A Replicas: 6

1. Pipeline finishes running the job.

2. Pipeline calls SC2 API to terminate the cluster.

3. SC2 terminates the cluster.

2Terminate: {Cluster}

3

Solr HAFT

Service

1

Page 17: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr Compute Cloud – Read Pipeline View

Zookeeper Ensemble

Pipeline 1

Solr Compute

Cloud API

Solr Cluster Collection A Replicas: 6

Request: {Collection: A, Replica: 6}

Pipeline 2Solr Cluster Collection B Replicas: 2

Request: {Collection: B, Replica: 2}

Pipeline nSolr Cluster Collection CReplicas: 1

Request: {Collection: C, Replica: 1}

Solr HAFT

Service

Production Solr Cluster

Page 18: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr Compute Cloud – Indexing

Production Solr Cluster

Zookeeper Ensemble

Indexing

Solr Compute

Cloud API

Solr Cluster Collection A Replicas: 6

1. Read pipeline requests collection and desired replicas from SC2 API.

2. SC2 API provisions cluster dynamically with needed setup (and streams Solr data).

3. Indexer uses this cluster to index the data.

4. Indexer calls HAFT service to replicate the index from dynamic cluster to production.

5. HAFT service reads data from dynamic cluster and replicates to production Solr.

1

3

Request: {Collection: A, Replica: 2}

2

Replicate

Solr HAFT Service

4

5Read

Page 19: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr Compute Cloud – Global View

Zookeeper Ensemble

Solr Compute

Cloud API

Solr HAFT Service

Production Solr Cluster

Indexing Pipelines 1

Elastic Clusters

Read Pipelines 1

Read Pipelines n

Indexing Pipelines n

Provision: {Cluster}

Terminate: {Cluster}

Replicate Index

Replicate Index

Run Job

Page 20: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr Compute Cloud API

1. API to provision clusters on demand.

2. Dynamic cluster and resource allocation (includes cost optimization)

3. Track request state, cluster performance and cost.

4. Terminate long-running, runaway clusters.

Page 21: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr HAFT Service1. High availability and fault tolerance

2. Home-grown technology

3. Open Source - (Work in progress)

4. Features• One push disaster recovery • High availability operations

• Replace node• Add replicas• Repair collection• Collection versioning

• Cluster backup operations• Dynamic replica creation• Cluster clone• Cluster swap• Cluster state reconstruction

Page 22: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr HAFT Service

Clone Alias

Clone Collections

Custom Commit Node Replacement

Node Repair

Clone Cluster

Collection Versioning

Black Box Recording

Lucene Segment Optimize

Index Management Actions

High Availability Actions

Cluster Backup Operations

Solr MetadataZookeeper Metadata

Verification Monitoring

Solr HAFT Service – Functional View

Dynamic Replica Creation

Cluster Clone

Cluster Swap

Cluster State Reconstruction

Page 23: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Disaster Recovery in New Architecture

Old Production

Solr Cluster

Zookeeper Ensemble

New Solr

Cluster

Zookeeper Ensemble

Solr HAFT Service

Push Button

Recovery

Brave Soul on Pager Duty

1

2

DNS

3

1. Guy on Pager clicks the recovery button

2. Solr HAFT Service triggers

Cluster Setup

State Reconstruction

Cluster Clone

Cluster Swap 3. Production DNS – New

Cluster

Page 24: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

SC2 vs Non-SC2 (Stability Features)Property Non-SC2 SC2

Linear Scalability for Heterogeneous Workload

Pipeline Level Isolation

Dynamic Collection Scaling

Prevention from Bad Clients

Pipeline Specific Performance

No Direct Access to Production Cluster

Can Sleep at night?

Page 25: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

SC2 vs Non-SC2 (Availability Features)

Property Non-SC2 SC2

Cross Data-Center Support

Cluster Cloning

Collection Versioning

One-Push Disaster Recovery

Repair API for Nodes/Collections

Node Replacement

Page 26: Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Lessons Learned

1. Solr is a search platform. Do not use it as a database (for scans and lookups). Evaluate your stored fields.

2. Understand access patterns, QPS and queries in detail. Be careful when tuning caches.

3. Have access control for large-scale jobs that directly talk to your cluster. (Internal DDOS attacks are hard to track.)

4. Instrument every piece of infrastructure and collect metrics.

5. Build automated disaster recovery (You will need it. )