hadoop-scale search with apache solr
TRANSCRIPT
Shalin Shekhar Mangar, Lucidworks April 24, 2015
Hadoop-scale Search with Solr
Viva La Evolución
10M+ total downloads
Solr is both established & growing
250,000+monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search solution on the planet.
LucidworksUnmatched Solr expertise.
1/3of the active committers
70%of the open source code is committed
Lucene/Solr Revolutionworld’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands of applications in production.
You use Solr everyday.
Solr in a Nutshell
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete, highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
Why Hadoop & Solr?
I have Hadoop, why do I need Solr?NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across structured and unstructured big data
Empower users of all technical ability to interact with, and derive value from, big data — all using a natural language search interface (no MapReduce, Pig, SQL, etc.)
• Preliminary data exploration and analysis
• Near real-time indexing and querying
• Thousands of simultaneous, parallel requests
Share machine-learning insights created on Hadoop to a broad audience through an interactive medium
I have Solr, why do I need Hadoop?Least expensive storage solution in market
Leverage Hadoop processing power (MapReduce) to build indexes or send document updates to Solr
Store Solr indexes and transaction logs within HDFS
Augment Solr data by storing additional information for last-second retrieval in Hadoop
Case 0: Enterprise data deployment
Lucidworks HDFS connector processes documents and
sends to SolrCloud
Enterprise documents are stored in HDFS
Users make ad-hoc, full-text queries across the full content
of all documents in Solr
And retrieve source files directly from
HDFS as necessary
Sink documents into HDFS
• Documents can be migrated from other file storage systems via Flume/Kafka or other scripts
• MapReduce allows for batch processing of documents (e.g. OCR, NER, clustering, etc.)
Index document contents into Solr• The Lucidworks Hadoop connector
parses content from files using many different tools
• Tika, GrokIngest, CSV mapping, Pig, etc.
• Content and data are added to fields in a Solr document
• The resulting document is sent to Solr for indexing
Index document contents into Solr
• Users are empowered with ad-hoc, full-text search in Solr
• Provides standard search tools such as autocomplete, more-like-this, spellchecking, faceting, etc.
• Users only access HDFS as needed
Lucidworks + Hadoop• Ingestion tools for various file formats, etc.
• Hive 2-way Load/Store support
• Pig Load/Store
• http://lucidworks.com/product/integrations/hadoop/
• (More on Spark in a bit)
What’s old is new again!
• Build/Store indexes in HDFS
• https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
• Block cache is your friend
Deployment and Security support
• https://github.com/LucidWorks/yarn-proto
• Slider and Ambari support coming soon
• Authz, Authc and Doc Filtering coming in May
Hadoop Basics
Case 1: Compliance• Monitoring and customer service search for large volume transactional data
• Initial Setup:
• 20 machines, 32 GB RAM, 800 GB SSD, 2 Solr nodes per machine
• Indexing from Kafka to Solr (Lucidworks Fusion)
• 14B+ docs indexed/searchable in POC (disk limited)
• Growth to 4B+ per day w/ 6 month life expectancy
Case 2: Web Analytics
• Large scale ad-hoc analytics over weblogs using Tableau as a front end BI tool for Solr
• Initial setup:
• 4 machines, 16 cores, 128 GB of RAM, 4x1 TB disks, several Solr nodes per machine
• Data originally in Hive
• POC: 10s of Billions of events growing to 160B+ per week
current_log_writer collection alias rolls over to a new transient collection every two hours; the shards in the transient collection are merged into the 2-hour shard and added to the daily collection
Connector writes to the collection alias, up to 50K docs / sec
Latest 2-hour shard gets built from merging shards at time bucket boundary
Multiple shards needed to support 50K writes per second
Every daily collection has 12 (or 24) shards, each covering 2-hour blocks of log messages
Sample ArchitectureFusion Logstash connector
current_log_writer (Collection Alias)
logs_feb26_h24(Transient Collection)
Shard 1
Shard 2
Shard 4
Shard 3
logs_feb01(daily collection)
logs_feb25(daily collection)
logs_feb26(daily collection)
h02Shard
h24Shard
h22Shard
Every daily collection has 12 (or 24) shards, each covering 2-hour blocks of log messages
h02Shard
h24Shard
h22Shard
h02Shard
h24Shard
h22Shard Can add replicas
to support higher query volume & fault-tolerance
Sample Query Execution
recent_logs (collection alias)
logs_feb01(daily collection)
logs_feb25(daily collection)
todays_logs (collection alias)
Fusion SiLK Dashboard
todays_logs collection alias rolls over to a new day automatically at day boundary
logs_feb26(daily collection)
Case 3: Lots of Users, Lots of Data
• Search of consumer data storage
• Key challenges: not all users are equals. Users grow and change all the time
• Petabytes of data, millions of users, 1000’s of nodes
• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw andhttp://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-number-of-collections-shalin-shekhar-mangar
• Search of consumer cloud storage
• Key challenges: not all users are equals. Users grow and change all the time
• Petabytes of data, millions of users, 1000’s of nodes,
• 1000’s of collections while isolating access
• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw andhttp://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-number-of-collections-shalin-shekhar-mangar
• Improve Zookeeper interactions and performance to handle thousands of collections
• Deep paging
• Split shards on arbitrary hash ranges
• Large scale testing
• Collection migration
Case 3: Key Solr Improvements
https://github.com/LucidWorks/solr-scale-tk
Testing: Solr Scale Toolkit
Python w/ Fabric, Boto Etc.Test Automation Scripts
KafkaMQ / Data Integration
Logstash Log Agg / Analysis
CollectD/SiLK System / JMX Monitoring
Test Results
DB
Support Services
Client Node 1
JMeter / Client Nodes
Client Node 2
Zookeeper
Test Data Stored in
Amazon S3
Node 1: Custom AMI
Solr Cluster (NxM Nodes)
Solr Node 1 8983 Core Core
Solr Node M 898X Core Core
SolrCloud Traffic Between All Solr Nodes and ZK
Key Point: Each test will define the density of cores per node and number of Solr nodes per machine, as well as the instance type and number of machines
ZK Ensemble
ZooKeeper-1
ZooKeeper-2
ZooKeeper-3
System monitoringof N Machines
JMX Notifications
Logs aggregated from NxM Solr Nodes
ZK JMX Stats
Easily deploy clusters of nodes using our custom
AMI’s
Indexing & query requests from tests
• Power user search and recommendations over news content and engagement signals (shares, views, etc.) using Lucidworks Fusion
• Combines content and collaborative filtering approaches to calculate search boosts and “people who did X also did Y” results
• Data: ~10M events (POC) growing to 3-4B per month
Case 4: Signals for Search and Discovery
• Lucidworks Fusion 1.4 (May ’15) will ship Apache Spark as a scheduled service for large scale aggregations, machine learning and more
• We’ve already seen 3x speedup in some tests
• Will ship w/ ALS rec algo and Mahout algos
• Solr as Spark RRD
• https://github.com/LucidWorks/spark-solr
• http://www.slideshare.net/thelabdude/apachecon-na-2015-spark-solr-integration
Next Level Signals
Billions of Docs
Optional
REST
Security woven throughout
Prox
yRecs
Worker
Pipes Metrics
NLP Sched.
Blobs Admin
Connectors
Worker Cluster Mgr.
Spark
Shards Shards
Solr
HD
FS
Shared Config Mgmt
Leader Election
Load Balancing
ZK 1
Zookeeper
ZK N
Signals
Fusion Architecture
Millions of Users
• Native, pluggable Security in Solr (May)
• Numerous performance enhancements for replication in shards
• New distributed query algorithm for large number of shards
• Advanced rule based replica placement strategy
• Many new extensions for facets and analytics
• Percentiles (t-digest)
• Facet combinations
Roadmap
Next stepsDownload Solr: http://lucene.apache.org/solr Download Fusion: http://www.lucidworks.com/products/fusion Contact Lucidworks: http://lucidworks.com/company/contact/ Contact Me: [email protected] http://twitter.com/shalinmangar Bangalore Solr/Lucene Meetup: http://www.meetup.com/Bangalore-Apache-Solr-Lucene-Group/