real-time inverted search nyc aslug oct 2014

REAL-TIME INVERTED SEARCH IN THE CLOUD USING LUCENE AND STORM

Joshua Conlin, Bryan Bende, James [email protected]

[email protected]

[email protected]

Problem Statement

Storm

Methodology

Results

Table of Contents

Booz Allen Hamilton

– Large consulting firm supporting many industries

• Healthcare, Finance, Energy, Defense

– Strategic Innovation Group

• Focus on innovative solutions that can be applied across industries

• Major focus on data science, big data, & information retrieval

• Multiple clients utilizing Solr for implementing search capabilities

• Explore Date Science

• Self-paced data science training, launching TODAY!

• https://exploredatascience.com

Who are we ?

Client Applications & Architecture

IngestIngest

SolrCloudSolrCloud

Web AppWeb App

Typical client applications allow users to:

• Query document index using Lucene syntax

• Filter and facet results

• Save queries for future use

How do we instantly notify users of new documents that match their

saved queries?

Constraints:

• Process documents in real-time, notify as soon as possible

• Scale with the number of saved queries (starting with tens of thousands)

• Result set of notifications must match saved queries

• Must not impact performance of the web application

• Data arrives at varying speeds and varying sizes

Problem Statement

1. Fork ingest to a second Solr instance, run stored queries periodically

– Pros: Easy to setup, works for small amount of data data & small # of queries

– Cons: Bound by time to execute all queries

2. Same secondary Solr instance, but distribute queries to multiple servers

– Pros: Reduces query processing time by dividing across several servers

– Cons: Now writing custom code to distribute queries, possible synchronization issues

ensuring each server executes queries against the same data

3. Give each server its own Solr instance and subset of queries

– Pros: Very scalable, only bound by number of servers

– Cons: Difficult to maintain, still writing custom code to distribute data and queries

Possible Solutions

Is there a way we can set up this system so that it’s:

• easy to maintain,

• easy to scale, and

• easy to synchronize?

Possible Solutions

• Integrate Solr and/or Lucene with a stream processing framework• Process data in real-time, leverage proven framework for distributed stream

processing

Candidate Solution

IngestIngest

SolrCloudSolrCloud

Web AppWeb App

StormStorm

NotificationsNotifications

• Storm is an open source stream processing framework.

• It’s a scalable platform that lets you distribute processes across a cluster quickly

and easily.

• You can add more resources to your cluster and easily utilize those resources in

your processing.

Storm - Overview

• Nimbus – the control node for the cluster, distributes topology through the cluster• Supervisor – one on each machines in the cluster, controls the allocation of worker

assignments on its machine• Worker – JVM process for running topology components

Storm - Components

Nimbus

Supervisor

Worker

Worker

Worker

Worker

Supervisor

Worker

Worker

Worker

Worker

Supervisor

Worker

Worker

Worker

Worker

• Topology – defines a running process, which includes all of the processes to be run, the connections between those processes, and their configuration

• Stream – the flow of data through a topology; it is an unbounded collection of tuples that is passed from process to process

• Storm has 2 types of processing units:

– Spout – the start of a stream; it can be thought of as the source of the data; that data can be read in however the spout wants—from a database, from a message queue, etc.

– Bolt – the primary processing unit for a topology; it accepts any number of streams, does whatever processing you’ve set it to do, and outputs any number of streams based on how you configure it

Storm – Core Concepts

• Stream Groupings – defines how topology processing units (spouts and bolts) are connected to each other; some common groupings are:

– All Grouping – stream is sent to all bolts

– Shuffle Grouping – stream is evenly distributed across bolts

– Fields grouping – sends tuples that match on the designated “field” to the same bolt

Storm – Core Concepts (continued)

Storm - Parallelism

Source: http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html

How can we use this framework to solve our problem?

How to Utilize Storm

Let Storm distribute out the data and queries between processing nodes

…but we would still need to manage a Solr instance on each VM, and we would even need to ensure synchronization between query processing bolts running on the same VM.

What if instead of having a Solr installation on each machine we ran

Solr in memory inside each of the processing bolts?

• Use Storm spout to distribute new documents

• Use Storm bolt to execute queries against EmbeddedSolrServer with

RAMDirectory– Incoming documents added to index– Queries executed– Documents removed from index

• Use Storm bolt to process query results

How to Utilize Storm

Bolt

EmbeddedSolrServer

RAMDirectory

This has several advantages:

• It removes the need to maintain a Solr instance on each VM.

• It’s easier to scale and more flexible; it doesn’t matter which Supervisor the bolts

get sent to, all the processing is self-contained.

• It removes the need to synchronize processing between bolts.

• Documents are volatile, existing queries over new data

Advantages

Execution Topology

Data Spout

Data Spout

Data Spout

Query Spout

ExecutorBolt

ExecutorBolt

ExecutorBolt

ExecutorBolt

ExecutorBolt

NotificationBolt

Data Spout – Receives incoming data files and sends to every Executor Bolt

Query Spout – Coordinates updates to queries

Executor Bolt – Loads and executes queries

Notification Bolt – Generates notifications based on resultsAll

Grouping

ShuffleGrouping

1. Queries are loaded into memory

2. Incoming documents are added to the Lucene index

3. Documents are processed when one of the following conditions are met:

a) The number of documents have exceeded the max batch size

b) The time since the last execution is longer than the max interval time

4. Matching queries and document UIDs are emitted

5. Remove all documents from index

Executor Bolt

Query List

Documents

emit()

12

3

4

• Attempted to run Solr with in-memory index inside Storm bolt• Solr 4.5 requires:

– http-client 4.2.3– http-core 4.2.2

• Storm 0.8.2 & 0.9.0 require:– http-client 4.1.1– http-core 4.1

• Could exclude libraries from super jar and rely on storm/lib, but Solr expecting SystemDefaultHttpClient from 4.2.3

• Could build Storm with newer version of libraries, but not guaranteed to work

Solr In-Memory Processing Bolt Issues

Advantages: • Fast, Lightweight• No Dependency Conflicts• RAMDirectory backed• Easy Solr to Lucene Document Conversion• Solr Schema based

Lucene In-Memory Processing Bolt

Bolt

Lucene Index

RAMDirectory

1. Initialization– Parse Common Solr Schema– Replace Solr Classes

2. Add Documents– Convert SolrInputDocument to Lucene

Document– Add to index

Lucene In-Memory Processing Bolt

public void addDocument(SolrInputDocument doc) throws Exception {if (doc != null) { Document luceneDoc = solrDocumentConverter.convert(doc); indexWriter.addDocument(luceneDoc); indexWriter.commit(); }}

public Document convert(SolrInputDocument solrDocument) throws Exception {

return DocumentBuilder.toDocument(solrDocument, indexSchema);

}

Parse Read/Parse/Update Solr Schema File using Stax

Create IndexSchema from new Solr Schema data

• Infrastructure:– 8 node cluster on Amazon EC2– Each VM has 2 cores and 8G of memory

• Data:– 92,000 news article summaries– Average file size: ~1k

• Queries:– Generated 1 million sample queries– Randomly selected terms from document set– Stored in MariaDB (username, query string)– Query Executor Bolt configured to as any subset of these queries

Prototype Solution

• Metrics Provided by Storm UI– Emitted: number of tuples emitted– Transferred: number of tuples transferred (emitted * # follow-on bolts)– Acked: number of tuples acknowledged– Execute Latency: timestamp when execute function ends - timestamp when execute is

passed tuple– Process Latency: timestamp when ack is called - timestamp when execute is passed tuple– Capacity: % of the time in the last 10 minutes the bolt spent executing tuples

• Many metrics are samples, don’t always indicate problems• Good measurement is comparing number of tuples transferred from spout, to number

of tuples acknowledged in bolt– If transferred number is getting increasingly higher than number of acknowledged tuples,

then the topology is not keeping up with the rate of data

Prototype Solution – Monitoring Performance

• 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts• Article spout emitting as fast as possible• Query execution at 1k docs or 60 seconds elapsed time• Increased number of queries on each trial: 10k, 50k, 100k, 200k, 300k, 400k, 500k

Trial Runs – First Attempt

Results:• Articles emitted too fast for

bolts to keep up• If data continued to stream

at this rate, topology would back up and drop tuples

Node 7

Worker x 4Worker x 4





Worker x 4

Worker x 4Result Bolt

Worker x 4Query Bolt

Node 6






Worker x 4



Node 5






Worker x 4



Node 1






Worker x 4



Node 2






Worker x 4



Node 3






Worker x 4



Node 4






Worker x 4



Node 8






Worker x 4



Article Spout

Nod

e 1

• 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts• Article spout now places articles on queue in background thread every 100ms• Everything else the same…

Trial Runs – Second Attempt

Results:• Topology performing much

better, keeping up with data flow for query size of 10k, 50k, 100k, 200k

• Slows down around 300k queries, approx 37.5k queries/bolt

Node 7






Worker x 4



Node 6






Worker x 4



Node 5






Worker x 4



Node 1






Worker x 4



Node 2






Worker x 4



Node 3






Worker x 4



Node 4






Worker x 4



Node 8






Worker x 4



Article Spout

Nod

e 1

• Each node has 4 worker slots so lets scale up• 16 workers, 1 spout, 16 Query Executor Bolts, 8 Result Bolts• Everything else the same…

Trials Runs – Third Attempt

Results:• 300k queries now keeping

up no problem• 400k doing ok…• 500k backing up a bit

Node 7






Worker x 4


Worker x 4Query Bolt x 2

Node 6






Worker x 4



Node 5






Worker x 4



Node 1






Worker x 4



Node 2






Worker x 4



Node 3






Worker x 4



Node 4






Worker x 4



Node 8






Worker x 4



Article Spout

Nod

e 1

• Next logical step, 32 workers, 1 spout, 32 Query Executor Bolts• Didn’t result in anticipated performance gain, 500k still too much• Hypothesizing that 2-core VMs might not be enough to get full performance from 4

worker slots

Trial Runs – Fourth Attempt

Node 7






Worker x 4



Node 6






Worker x 4



Node 5






Worker x 4



Node 1






Worker x 4



Node 2






Worker x 4



Node 3






Worker x 4



Node 4






Worker x 4



Node 8






Worker x 4



Article Spout

Nod

e 1

• Most important factor affecting performance is relationship between data rate and number of queries

• Ideal Storm configuration is dependent on hardware executing the topology

• Optimal configuration resulted in 250 queries per second per bolt, 4k queries per second across topology

• High level of performance from relatively small cluster

Trials Runs – Conclusions

• Low barrier to entry working with Storm

• Easy conversion of Solr indices to Lucene Indices

• Simple integration between Lucene and Storm; Solr more complicated

• Configuration is key, tune topology to your needs

• Overall strategy appears to scale well for our use case, limited only by hardware

Conclusions

• Adjust the batch size on the query executor bolt

• Combine duplicate queries (between users) if your system has many duplicates

• Investigate additional optimizations during Solr to Lucene

• Run topology with more complex queries (fielded, filtered, etc.)

• Investigate handling of bolt failure

• If ratio of incoming data to queries was reversed, consider switching the groupings between the spouts and executor bolts

Future Considerations

Questions?

• Storm has moved to top-level Apache project– https://storm.incubator.apache.org/– Released 0.9.1, 0.9.2, 0.9.3-rc1– Newer releases resolve classpath issue with EmbeddedSolrServer– Improved Netty transport, new topology visualization

Updates Since Solr Lucene Revolution 2013

Source: http://storm.incubator.apache.org/2014/06/25/storm092-released.html

https://storm.incubator.apache.org/

https://storm.incubator.apache.org/

• Launch Storm clusters on Amazon Web Services

– storm-deploy - https://github.com/nathanmarz/storm-deploy

• Created before Storm moved to Apache, limited activity

• install-0.9.1 branch has updates to pull Storm from Apache repo• lein deploy-storm --start --name mycluster --branch master --commit v0.9.2-incubating

• Always launches m1.small - https://github.com/nathanmarz/storm-deploy/issues/67

– storm-deploy-alternative- https://github.com/KasperMadsen/storm-deploy-alternative

• Java alternative to storm-deploy

• Latest Apache Storm releases not supported yet, works with 0.8.2 and 0.9.0

– wirbelsturm - https://github.com/miguno/wirbelsturm

• Based on Vagrant and Puppet

• http://www.michael-noll.com/blog/2014/03/17/wirbelsturm-one-click-deploy-storm-kafka-clusters-with-vagrant-puppet

/• Steeper learning curve to get going

How can we test our topology at various scales with minimal setup?

https://github.com/nathanmarz/storm-deploy



https://github.com/nathanmarz/storm-deploy/issues/67



https://github.com/KasperMadsen/storm-deploy-alternative

https://github.com/KasperMadsen/storm-deploy-alternative

https://github.com/miguno/wirbelsturm

https://github.com/miguno/wirbelsturm

http://www.michael-noll.com/blog/2014/03/17/wirbelsturm-one-click-deploy-storm-kafka-clusters-with-vagrant-puppet/



• Make the topology independent of Storm cluster

– Previous spout required data to be on server where spout is running

• Better approach - poll an external source for data (Redis, Kafaka, etc)

– Previous executor bolt loaded queries from a database

• Better approach - package a file of queries into topology jar

– Previous executor bolt expected a Solr config directory on the server

• Better approach – package config into topology jar, extract from classpath to

disk on start up

How can we test our topology at various scales with minimal setup?

Redis SpoutExecutor Bolt

queries

SOLR_HOME

Result Bolt

Storm Cluster

• Presentation by Flax at Solr Lucene Revoltion 2013 in Dublin– Turning Search Upside Down: Using Lucene for Very Fast Stored Queries

– https://www.youtube.com/watch?v=rmRCsrJp2A8&list=UUKuRrzEQYP8pfCgCN8il4gQ

• Open sourced by Flax shortly after– https://github.com/flaxsearch/luwak

• True inverted search solution– Index queries

– Turn an incoming document into a query

– Determine which queries match that document

• Easy to integrate into existing Storm solution

• Clean API and documentation

Luwak

Monitor monitor = new Monitor(new LuceneQueryParser("field"), new TermFilteredPresearcher());

MonitorQuery mq = new MonitorQuery("query1", "field:text");

monitor.update(mq);

InputDocument doc = InputDocument.builder("doc1”).addField(textfield, document, new

StandardTokenizer(Version.LUCENE_50)).build();

SimpleMatcher matches = monitor.match(doc, SimpleMatcher.FACTORY);

https://www.youtube.com/watch?v=rmRCsrJp2A8&list=UUKuRrzEQYP8pfCgCN8il4gQ

https://www.youtube.com/watch?v=rmRCsrJp2A8&list=UUKuRrzEQYP8pfCgCN8il4gQ

https://github.com/flaxsearch/luwak




• How fast can we process all 92k

articles with varying query sizes?

• Performance comparison outside of

Storm, single-thread Java process

• Solr & Lucene solutions batch docs– Allow 1,000 docs to be added to in-

memory index

– Execute all queries, clear, start over

• Luwak evaluates one document at a

time against indexed queries

Performance Comparison

• Conclusion– Storm = scalable stream processing framework– Luwak = high performance inverted search solution– Luwak + Storm = scalable, high performance, inverted search solution!

• Contact Info– [email protected] / Twitter @bbende– [email protected] / Twitter @jmconlin

• Thanks for having us!

Wrap-Up

mailto:[email protected]




real-time inverted search nyc aslug oct 2014

Software