hadoop-scale search with apache solr

Shalin Shekhar Mangar, Lucidworks April 24, 2015

Hadoop-scale Search with Solr

Viva La Evolución

10M+ total downloads

Solr is both established & growing

250,000+monthly downloads

Largest community of developers.

2500+open Solr jobs.

Solr most widely used search solution on the planet.

LucidworksUnmatched Solr expertise.

1/3of the active committers

70%of the open source code is committed

Lucene/Solr Revolutionworld’s largest open source user

conference dedicated to Lucene/Solr.

Solr has tens of thousands of applications in production.

You use Solr everyday.

Solr in a Nutshell

• Full text search (Info Retr.)

• Facets/Guided Nav galore!

• Lots of data types

• Spelling, auto-complete, highlighting

• Cursors

• More Like This

• De-duplication

• Apache Lucene

• Grouping and Joins

• Stats, expressions, transformations and more

• Lang. Detection

• Extensible

• Massive Scale/Fault tolerance

Solr Key Features

Why Hadoop & Solr?

I have Hadoop, why do I need Solr?NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across structured and unstructured big data

Empower users of all technical ability to interact with, and derive value from, big data — all using a natural language search interface (no MapReduce, Pig, SQL, etc.)

• Preliminary data exploration and analysis

• Near real-time indexing and querying

• Thousands of simultaneous, parallel requests

Share machine-learning insights created on Hadoop to a broad audience through an interactive medium

I have Solr, why do I need Hadoop?Least expensive storage solution in market

Leverage Hadoop processing power (MapReduce) to build indexes or send document updates to Solr

Store Solr indexes and transaction logs within HDFS

Augment Solr data by storing additional information for last-second retrieval in Hadoop

Case 0: Enterprise data deployment

Lucidworks HDFS connector processes documents and

sends to SolrCloud

Enterprise documents are stored in HDFS

Users make ad-hoc, full-text queries across the full content

of all documents in Solr

And retrieve source files directly from

HDFS as necessary

Sink documents into HDFS

• Documents can be migrated from other file storage systems via Flume/Kafka or other scripts

• MapReduce allows for batch processing of documents (e.g. OCR, NER, clustering, etc.)

Index document contents into Solr• The Lucidworks Hadoop connector

parses content from files using many different tools

• Tika, GrokIngest, CSV mapping, Pig, etc.

• Content and data are added to fields in a Solr document

• The resulting document is sent to Solr for indexing

Index document contents into Solr

• Users are empowered with ad-hoc, full-text search in Solr

• Provides standard search tools such as autocomplete, more-like-this, spellchecking, faceting, etc.

• Users only access HDFS as needed

Lucidworks + Hadoop• Ingestion tools for various file formats, etc.

• Hive 2-way Load/Store support

• Pig Load/Store

• http://lucidworks.com/product/integrations/hadoop/

• (More on Spark in a bit)

http://lucidworks.com/product/integrations/hadoop/

What’s old is new again!

• Build/Store indexes in HDFS

• https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

• Block cache is your friend

Deployment and Security support

• https://github.com/LucidWorks/yarn-proto

• Slider and Ambari support coming soon

• Authz, Authc and Doc Filtering coming in May

Hadoop Basics

https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

https://github.com/LucidWorks/yarn-proto

Case 1: Compliance• Monitoring and customer service search for large volume transactional data

• Initial Setup:

• 20 machines, 32 GB RAM, 800 GB SSD, 2 Solr nodes per machine

• Indexing from Kafka to Solr (Lucidworks Fusion)

• 14B+ docs indexed/searchable in POC (disk limited)

• Growth to 4B+ per day w/ 6 month life expectancy

Case 2: Web Analytics

• Large scale ad-hoc analytics over weblogs using Tableau as a front end BI tool for Solr

• Initial setup:

• 4 machines, 16 cores, 128 GB of RAM, 4x1 TB disks, several Solr nodes per machine

• Data originally in Hive

• POC: 10s of Billions of events growing to 160B+ per week

current_log_writer collection alias rolls over to a new transient collection every two hours; the shards in the transient collection are merged into the 2-hour shard and added to the daily collection

Connector writes to the collection alias, up to 50K docs / sec

Latest 2-hour shard gets built from merging shards at time bucket boundary

Multiple shards needed to support 50K writes per second

Every daily collection has 12 (or 24) shards, each covering 2-hour blocks of log messages

Sample ArchitectureFusion Logstash connector

current_log_writer (Collection Alias)

logs_feb26_h24(Transient Collection)

Shard 1

Shard 2

Shard 4

Shard 3

logs_feb01(daily collection)



h02Shard

h24Shard

h22Shard

Every daily collection has 12 (or 24) shards, each covering 2-hour blocks of log messages

h02Shard

h24Shard

h22Shard

h02Shard

h24Shard

h22Shard Can add replicas

to support higher query volume & fault-tolerance

Sample Query Execution

recent_logs (collection alias)



todays_logs (collection alias)

Fusion SiLK Dashboard

todays_logs collection alias rolls over to a new day automatically at day boundary


Case 3: Lots of Users, Lots of Data

• Search of consumer data storage

• Key challenges: not all users are equals. Users grow and change all the time

• Petabytes of data, millions of users, 1000’s of nodes

• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw andhttp://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-number-of-collections-shalin-shekhar-mangar

• Search of consumer cloud storage

• Key challenges: not all users are equals. Users grow and change all the time

• Petabytes of data, millions of users, 1000’s of nodes,

• 1000’s of collections while isolating access

• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw andhttp://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-number-of-collections-shalin-shekhar-mangar

https://www.youtube.com/watch?v=_Erkln5WWLw

http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-number-of-collections-shalin-shekhar-mangar

https://www.youtube.com/watch?v=_Erkln5WWLw

http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-number-of-collections-shalin-shekhar-mangar

• Improve Zookeeper interactions and performance to handle thousands of collections

• Deep paging

• Split shards on arbitrary hash ranges

• Large scale testing

• Collection migration

Case 3: Key Solr Improvements

https://github.com/LucidWorks/solr-scale-tk

Testing: Solr Scale Toolkit

Python w/ Fabric, Boto Etc.Test Automation Scripts

KafkaMQ / Data Integration

Logstash Log Agg / Analysis

CollectD/SiLK System / JMX Monitoring

Test Results

DB

Support Services

Client Node 1

JMeter / Client Nodes

Client Node 2

Zookeeper

Test Data Stored in

Amazon S3

Node 1: Custom AMI

Solr Cluster (NxM Nodes)

Solr Node 1 8983 Core Core

Solr Node M 898X Core Core

SolrCloud Traffic Between All Solr Nodes and ZK

Key Point: Each test will define the density of cores per node and number of Solr nodes per machine, as well as the instance type and number of machines

ZK Ensemble

ZooKeeper-1

ZooKeeper-2

ZooKeeper-3

System monitoringof N Machines

JMX Notifications

Logs aggregated from NxM Solr Nodes

ZK JMX Stats

Easily deploy clusters of nodes using our custom

AMI’s

Indexing & query requests from tests

https://github.com/LucidWorks/solr-scale-tk

• Power user search and recommendations over news content and engagement signals (shares, views, etc.) using Lucidworks Fusion

• Combines content and collaborative filtering approaches to calculate search boosts and “people who did X also did Y” results

• Data: ~10M events (POC) growing to 3-4B per month

Case 4: Signals for Search and Discovery

• Lucidworks Fusion 1.4 (May ’15) will ship Apache Spark as a scheduled service for large scale aggregations, machine learning and more

• We’ve already seen 3x speedup in some tests

• Will ship w/ ALS rec algo and Mahout algos

• Solr as Spark RRD

• https://github.com/LucidWorks/spark-solr

• http://www.slideshare.net/thelabdude/apachecon-na-2015-spark-solr-integration

Next Level Signals

https://github.com/LucidWorks/spark-solr

http://www.slideshare.net/thelabdude/apachecon-na-2015-spark-solr-integration

Billions of Docs

Optional

REST

Security woven throughout

Prox

yRecs

Worker

Pipes Metrics

NLP Sched.

Blobs Admin

Connectors

Worker Cluster Mgr.

Spark

Shards Shards

Solr

HD

FS

Shared Config Mgmt

Leader Election

Load Balancing

ZK 1

Zookeeper

ZK N

Signals

Fusion Architecture

Millions of Users

• Native, pluggable Security in Solr (May)

• Numerous performance enhancements for replication in shards

• New distributed query algorithm for large number of shards

• Advanced rule based replica placement strategy

• Many new extensions for facets and analytics

• Percentiles (t-digest)

• Facet combinations

Roadmap

Next stepsDownload Solr: http://lucene.apache.org/solr Download Fusion: http://www.lucidworks.com/products/fusion Contact Lucidworks: http://lucidworks.com/company/contact/ Contact Me: [email protected] http://twitter.com/shalinmangar Bangalore Solr/Lucene Meetup: http://www.meetup.com/Bangalore-Apache-Solr-Lucene-Group/

http://lucene.apache.org/solr

http://www.lucidworks.com/products/fusion

http://lucidworks.com/company/contact/

mailto:[email protected]?subject=

http://twitter.com/shalinmangar

http://www.meetup.com/Bangalore-Apache-Solr-Lucene-Group/

hadoop-scale search with apache solr

Software

solr document

solr users

solr store solr indexes

solr everyday

hdfs augment solr data

open solr jobs

hdfs documents

hadoopscale search