real-time inverted search nyc aslug oct 2014
DESCRIPTION
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr’s full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.TRANSCRIPT
REAL-TIME INVERTED SEARCH IN THE CLOUD USING LUCENE AND STORM
Joshua Conlin, Bryan Bende, James [email protected]
Problem Statement
Storm
Methodology
Results
Table of Contents
Booz Allen Hamilton
– Large consulting firm supporting many industries
• Healthcare, Finance, Energy, Defense
– Strategic Innovation Group
• Focus on innovative solutions that can be applied across industries
• Major focus on data science, big data, & information retrieval
• Multiple clients utilizing Solr for implementing search capabilities
• Explore Date Science
• Self-paced data science training, launching TODAY!
• https://exploredatascience.com
Who are we ?
Client Applications & Architecture
IngestIngest
SolrCloudSolrCloud
Web AppWeb App
Typical client applications allow users to:
• Query document index using Lucene syntax
• Filter and facet results
• Save queries for future use
How do we instantly notify users of new documents that match their
saved queries?
Constraints:
• Process documents in real-time, notify as soon as possible
• Scale with the number of saved queries (starting with tens of thousands)
• Result set of notifications must match saved queries
• Must not impact performance of the web application
• Data arrives at varying speeds and varying sizes
Problem Statement
1. Fork ingest to a second Solr instance, run stored queries periodically
– Pros: Easy to setup, works for small amount of data data & small # of queries
– Cons: Bound by time to execute all queries
2. Same secondary Solr instance, but distribute queries to multiple servers
– Pros: Reduces query processing time by dividing across several servers
– Cons: Now writing custom code to distribute queries, possible synchronization issues
ensuring each server executes queries against the same data
3. Give each server its own Solr instance and subset of queries
– Pros: Very scalable, only bound by number of servers
– Cons: Difficult to maintain, still writing custom code to distribute data and queries
Possible Solutions
Is there a way we can set up this system so that it’s:
• easy to maintain,
• easy to scale, and
• easy to synchronize?
Possible Solutions
• Integrate Solr and/or Lucene with a stream processing framework• Process data in real-time, leverage proven framework for distributed stream
processing
Candidate Solution
IngestIngest
SolrCloudSolrCloud
Web AppWeb App
StormStorm
NotificationsNotifications
• Storm is an open source stream processing framework.
• It’s a scalable platform that lets you distribute processes across a cluster quickly
and easily.
• You can add more resources to your cluster and easily utilize those resources in
your processing.
Storm - Overview
• Nimbus – the control node for the cluster, distributes topology through the cluster• Supervisor – one on each machines in the cluster, controls the allocation of worker
assignments on its machine• Worker – JVM process for running topology components
Storm - Components
Nimbus
Supervisor
Worker
Worker
Worker
Worker
Supervisor
Worker
Worker
Worker
Worker
Supervisor
Worker
Worker
Worker
Worker
• Topology – defines a running process, which includes all of the processes to be run, the connections between those processes, and their configuration
• Stream – the flow of data through a topology; it is an unbounded collection of tuples that is passed from process to process
• Storm has 2 types of processing units:
– Spout – the start of a stream; it can be thought of as the source of the data; that data can be read in however the spout wants—from a database, from a message queue, etc.
– Bolt – the primary processing unit for a topology; it accepts any number of streams, does whatever processing you’ve set it to do, and outputs any number of streams based on how you configure it
Storm – Core Concepts
• Stream Groupings – defines how topology processing units (spouts and bolts) are connected to each other; some common groupings are:
– All Grouping – stream is sent to all bolts
– Shuffle Grouping – stream is evenly distributed across bolts
– Fields grouping – sends tuples that match on the designated “field” to the same bolt
Storm – Core Concepts (continued)
Storm - Parallelism
Source: http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
How can we use this framework to solve our problem?
How to Utilize Storm
Let Storm distribute out the data and queries between processing nodes
…but we would still need to manage a Solr instance on each VM, and we would even need to ensure synchronization between query processing bolts running on the same VM.
What if instead of having a Solr installation on each machine we ran
Solr in memory inside each of the processing bolts?
• Use Storm spout to distribute new documents
• Use Storm bolt to execute queries against EmbeddedSolrServer with
RAMDirectory– Incoming documents added to index– Queries executed– Documents removed from index
• Use Storm bolt to process query results
How to Utilize Storm
Bolt
EmbeddedSolrServer
RAMDirectory
This has several advantages:
• It removes the need to maintain a Solr instance on each VM.
• It’s easier to scale and more flexible; it doesn’t matter which Supervisor the bolts
get sent to, all the processing is self-contained.
• It removes the need to synchronize processing between bolts.
• Documents are volatile, existing queries over new data
Advantages
Execution Topology
Data Spout
Data Spout
Data Spout
Query Spout
ExecutorBolt
ExecutorBolt
ExecutorBolt
ExecutorBolt
ExecutorBolt
NotificationBolt
Data Spout – Receives incoming data files and sends to every Executor Bolt
Query Spout – Coordinates updates to queries
Executor Bolt – Loads and executes queries
Notification Bolt – Generates notifications based on resultsAll
Grouping
ShuffleGrouping
1. Queries are loaded into memory
2. Incoming documents are added to the Lucene index
3. Documents are processed when one of the following conditions are met:
a) The number of documents have exceeded the max batch size
b) The time since the last execution is longer than the max interval time
4. Matching queries and document UIDs are emitted
5. Remove all documents from index
Executor Bolt
Query List
Documents
emit()
12
3
4
• Attempted to run Solr with in-memory index inside Storm bolt• Solr 4.5 requires:
– http-client 4.2.3– http-core 4.2.2
• Storm 0.8.2 & 0.9.0 require:– http-client 4.1.1– http-core 4.1
• Could exclude libraries from super jar and rely on storm/lib, but Solr expecting SystemDefaultHttpClient from 4.2.3
• Could build Storm with newer version of libraries, but not guaranteed to work
Solr In-Memory Processing Bolt Issues
Advantages: • Fast, Lightweight• No Dependency Conflicts• RAMDirectory backed• Easy Solr to Lucene Document Conversion• Solr Schema based
Lucene In-Memory Processing Bolt
Bolt
Lucene Index
RAMDirectory
1. Initialization– Parse Common Solr Schema– Replace Solr Classes
2. Add Documents– Convert SolrInputDocument to Lucene
Document– Add to index
Lucene In-Memory Processing Bolt
public void addDocument(SolrInputDocument doc) throws Exception {if (doc != null) { Document luceneDoc = solrDocumentConverter.convert(doc); indexWriter.addDocument(luceneDoc); indexWriter.commit(); }}
public Document convert(SolrInputDocument solrDocument) throws Exception {
return DocumentBuilder.toDocument(solrDocument, indexSchema);
}
Parse Read/Parse/Update Solr Schema File using Stax
Create IndexSchema from new Solr Schema data
• Infrastructure:– 8 node cluster on Amazon EC2– Each VM has 2 cores and 8G of memory
• Data:– 92,000 news article summaries– Average file size: ~1k
• Queries:– Generated 1 million sample queries– Randomly selected terms from document set– Stored in MariaDB (username, query string)– Query Executor Bolt configured to as any subset of these queries
Prototype Solution
• Metrics Provided by Storm UI– Emitted: number of tuples emitted– Transferred: number of tuples transferred (emitted * # follow-on bolts)– Acked: number of tuples acknowledged– Execute Latency: timestamp when execute function ends - timestamp when execute is
passed tuple– Process Latency: timestamp when ack is called - timestamp when execute is passed tuple– Capacity: % of the time in the last 10 minutes the bolt spent executing tuples
• Many metrics are samples, don’t always indicate problems• Good measurement is comparing number of tuples transferred from spout, to number
of tuples acknowledged in bolt– If transferred number is getting increasingly higher than number of acknowledged tuples,
then the topology is not keeping up with the rate of data
Prototype Solution – Monitoring Performance
• 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts• Article spout emitting as fast as possible• Query execution at 1k docs or 60 seconds elapsed time• Increased number of queries on each trial: 10k, 50k, 100k, 200k, 300k, 400k, 500k
Trial Runs – First Attempt
Results:• Articles emitted too fast for
bolts to keep up• If data continued to stream
at this rate, topology would back up and drop tuples
Node 7
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 6
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 5
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 1
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 2
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 3
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 8
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Article Spout
Nod
e 1
• 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts• Article spout now places articles on queue in background thread every 100ms• Everything else the same…
Trial Runs – Second Attempt
Results:• Topology performing much
better, keeping up with data flow for query size of 10k, 50k, 100k, 200k
• Slows down around 300k queries, approx 37.5k queries/bolt
Node 7
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 6
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 5
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 1
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 2
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 3
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Node 8
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt
Article Spout
Nod
e 1
• Each node has 4 worker slots so lets scale up• 16 workers, 1 spout, 16 Query Executor Bolts, 8 Result Bolts• Everything else the same…
Trials Runs – Third Attempt
Results:• 300k queries now keeping
up no problem• 400k doing ok…• 500k backing up a bit
Node 7
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 2
Node 6
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 2
Node 5
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 2
Node 1
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 2
Node 2
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 2
Node 3
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 2
Node 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 2
Node 8
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 2
Article Spout
Nod
e 1
• Next logical step, 32 workers, 1 spout, 32 Query Executor Bolts• Didn’t result in anticipated performance gain, 500k still too much• Hypothesizing that 2-core VMs might not be enough to get full performance from 4
worker slots
Trial Runs – Fourth Attempt
Node 7
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 4
Node 6
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 4
Node 5
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 4
Node 1
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 4
Node 2
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 4
Node 3
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 4
Node 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 4
Node 8
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4Worker x 4
Worker x 4
Worker x 4Result Bolt
Worker x 4Query Bolt x 4
Article Spout
Nod
e 1
• Most important factor affecting performance is relationship between data rate and number of queries
• Ideal Storm configuration is dependent on hardware executing the topology
• Optimal configuration resulted in 250 queries per second per bolt, 4k queries per second across topology
• High level of performance from relatively small cluster
Trials Runs – Conclusions
• Low barrier to entry working with Storm
• Easy conversion of Solr indices to Lucene Indices
• Simple integration between Lucene and Storm; Solr more complicated
• Configuration is key, tune topology to your needs
• Overall strategy appears to scale well for our use case, limited only by hardware
Conclusions
• Adjust the batch size on the query executor bolt
• Combine duplicate queries (between users) if your system has many duplicates
• Investigate additional optimizations during Solr to Lucene
• Run topology with more complex queries (fielded, filtered, etc.)
• Investigate handling of bolt failure
• If ratio of incoming data to queries was reversed, consider switching the groupings between the spouts and executor bolts
Future Considerations
Questions?
• Storm has moved to top-level Apache project– https://storm.incubator.apache.org/– Released 0.9.1, 0.9.2, 0.9.3-rc1– Newer releases resolve classpath issue with EmbeddedSolrServer– Improved Netty transport, new topology visualization
Updates Since Solr Lucene Revolution 2013
Source: http://storm.incubator.apache.org/2014/06/25/storm092-released.html
• Launch Storm clusters on Amazon Web Services
– storm-deploy - https://github.com/nathanmarz/storm-deploy
• Created before Storm moved to Apache, limited activity
• install-0.9.1 branch has updates to pull Storm from Apache repo• lein deploy-storm --start --name mycluster --branch master --commit v0.9.2-incubating
• Always launches m1.small - https://github.com/nathanmarz/storm-deploy/issues/67
– storm-deploy-alternative- https://github.com/KasperMadsen/storm-deploy-alternative
• Java alternative to storm-deploy
• Latest Apache Storm releases not supported yet, works with 0.8.2 and 0.9.0
– wirbelsturm - https://github.com/miguno/wirbelsturm
• Based on Vagrant and Puppet
• http://www.michael-noll.com/blog/2014/03/17/wirbelsturm-one-click-deploy-storm-kafka-clusters-with-vagrant-puppet
/• Steeper learning curve to get going
How can we test our topology at various scales with minimal setup?
• Make the topology independent of Storm cluster
– Previous spout required data to be on server where spout is running
• Better approach - poll an external source for data (Redis, Kafaka, etc)
– Previous executor bolt loaded queries from a database
• Better approach - package a file of queries into topology jar
– Previous executor bolt expected a Solr config directory on the server
• Better approach – package config into topology jar, extract from classpath to
disk on start up
How can we test our topology at various scales with minimal setup?
Redis SpoutExecutor Bolt
queries
SOLR_HOME
Result Bolt
Storm Cluster
• Presentation by Flax at Solr Lucene Revoltion 2013 in Dublin– Turning Search Upside Down: Using Lucene for Very Fast Stored Queries
– https://www.youtube.com/watch?v=rmRCsrJp2A8&list=UUKuRrzEQYP8pfCgCN8il4gQ
• Open sourced by Flax shortly after– https://github.com/flaxsearch/luwak
• True inverted search solution– Index queries
– Turn an incoming document into a query
– Determine which queries match that document
• Easy to integrate into existing Storm solution
• Clean API and documentation
Luwak
Monitor monitor = new Monitor(new LuceneQueryParser("field"), new TermFilteredPresearcher());
MonitorQuery mq = new MonitorQuery("query1", "field:text");
monitor.update(mq);
InputDocument doc = InputDocument.builder("doc1”).addField(textfield, document, new
StandardTokenizer(Version.LUCENE_50)).build();
SimpleMatcher matches = monitor.match(doc, SimpleMatcher.FACTORY);
• How fast can we process all 92k
articles with varying query sizes?
• Performance comparison outside of
Storm, single-thread Java process
• Solr & Lucene solutions batch docs– Allow 1,000 docs to be added to in-
memory index
– Execute all queries, clear, start over
• Luwak evaluates one document at a
time against indexed queries
Performance Comparison
• Conclusion– Storm = scalable stream processing framework– Luwak = high performance inverted search solution– Luwak + Storm = scalable, high performance, inverted search solution!
• Contact Info– [email protected] / Twitter @bbende– [email protected] / Twitter @jmconlin
• Thanks for having us!
Wrap-Up