scaling up solr 4.1 to power big search in social media analytics

Scaling Solr 4 to Power Big Search in Social Media Analytics Timothy Potter Architect, Big Data Analytics, Dachis Group / Co-author Solr In Action

Upload: lucenerevolution

Post on 17-Dec-2014




3 download


Presented by Timothy Potter, Architect, Big Data Analytics, Dachis Group My presentation focuses on how we implemented Solr 4.1 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 500,000,000 documents and is growing by 3 to 4 million documents per day. The presentation will include details about: Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper Operations concerns like how to handle a failed node and monitoring How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4.1 is scalable, stable, and is production ready. (note: we are in production on 18 nodes in EC2 with a recent nightly build off the branch_4x).


Page 1: Scaling up solr 4.1 to power big search in social media analytics

Scaling Solr 4 to Power Big Search in Social Media


Timothy Potter Architect, Big Data Analytics, Dachis Group / Co-author Solr In Action

Page 2: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

• Anyone running SolrCloud in

production today?

• Who is running pre-Solr 4 version in


• Who has fired up Solr 4.x in SolrCloud


• Personal interest – who was

purchased Solr in Action in MEAP?

Audience poll

Page 3: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

• Gain insights into the key design decisions you need

to make when using Solr cloud

Wish I knew back then ...

• Solr 4 feature overview in context

• Zookeeper

• Distributed indexing

• Distributed search

• Real-time GET

• Atomic updates

• A day in the life ...

• Day-to-day operations

• What happens if you lose a node?

Goals of this talk

Page 4: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

Our business intelligence platform analyzes relationships, behaviors, and

conversations between 30,000 brands and 100M social accounts every 15 minutes.

About Dachis Group

Page 5: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

Page 6: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

• In production on 4.2.0

• 18 shards ~ 33M docs / shard, 25GB on disk per shard

• Multiple collections

• ~620 Million docs in main collection (still growing)

• ~100 Million docs in 30-day collection

• Inherent Parent / Child relationships (tweet and re-tweets)

• ~5M atomic updates to existing docs per day

• Batch-oriented updates

• Docs come in bursts from Hadoop; 8,000 docs/sec

• 3-4M new documents per day (deletes too)

• Business Intelligence UI, low(ish) query volume

Solution Highlights

Page 7: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

• Scalability

Scale-out: sharding and replication

A little scale-up too: Fast disks (SSD), lots of RAM!

• High-availability

Redundancy: multiple replicas per shard

Automated fail-over: automated leader election

• Consistency

Distributed queries must return consistent results

Accepted writes must be on durable storage

• Simplicity - wip

Self-healing, easy to setup and maintain,

able to troubleshoot

• Elasticity - wip

Add more replicas per shard at any time

Split large shards into two smaller ones

Pillars of my ideal search solution

Page 8: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

Nuts and Bolts

Nice tag cloud!

Page 9: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

1. Zookeeper needs at least 3 nodes to establish quorum with fault

tolerance. Embedded is only for evaluation purposes, you need to

deploy a stand-alone ensemble for production

2. Every Solr core creates ephemeral “znodes” in Zookeeper which

automatically disappear if the Solr process crashes

3. Zookeeper pushes notifications to all registered “watchers” when a

znode changes; Solr caches cluster state

1. Zookeeper provides “recipes” for solving common problems faced

when building distributed systems, e.g. leader election

2. Zookeeper provides centralized configuration distribution, leader

election, and cluster state notifications

Zookeeper in a nutshell

Page 10: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

• Number and size of indexed fields

• Number of documents

• Update frequency

• Query complexity

• Expected growth

• Budget

Number of shards?

Yay for shard splitting in 4.3 (SOLR-3755)!

Page 11: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

We use Uwe Schindler’s advice on 64-bit Linux:

<directoryFactory name="DirectoryFactory"



java -Xmx4g ...

(hint: rest of our RAM goes to the OS to load index in memory mapped I/O)

Small cache sizes with aggressive eviction – spread GC penalty out over time vs. all at once every time

you open a new searcher

<filterCache class="solr.LFUCache" size="50"

initialSize="50" autowarmCount="25"/>

Index Memory Management

Page 12: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

• Not a master

• Leader is a replica (handles queries)

• Accepts update requests for the shard

• Increments the _version_ on the new or

updated doc

• Sends updates (in parallel) to all


Leader = Replica + Addl’ Work

Page 13: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

Don’t let your tlog’s get too big – use “hard” commits with openSearcher=“false”

Distributed Indexing

View of cluster state from Zk

Shard 1 Leader

Node 1 Node 2

Shard 2 Leader

Shard 2 Replica

Shard 1 Replica


CloudSolrServer “smart client”

Hash on docID



3 Set the _version_

tlog tlog

Get URLs of current leaders?



2 shards with 1 replica each






8,000 docs / sec to 18 shards

Page 14: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

Send query request to any node

Two-stage process

1. Query controller sends query to all

shards and merges results

One host per shard must be online

or queries fail

2. Query controller sends 2nd query to

all shards with documents in the

merged result set to get requested


Solr client applications built for 3.x do

not need to change (our query code still

uses SolrJ 3.6)


JOINs / Grouping need custom hashing

Distributed search

View of cluster state from Zk

Shard 1 Leader

Node 1 Node 2

Shard 2 Leader

Shard 2 Replica

Shard 1 Replica






Get URLs of all live nodes



Query controller

Or just a load balancer works too

get fields

Page 15: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

Search by daily activity volume

Drive analysis that measures the impact of a social message over time ... Company posts a tweet on Monday, how much activity around that message on Thursday?

Page 16: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

Problem: Find all documents that had activity on a specific day

• tweets that had retweets or YouTube videos that had comments

• Use Solr join support to find parent documents by matching on child criteria

fq=_val_:"{!join from=echo_grouping_id_s to=id}day_tdt:[2013-05-01T00:00:00Z

TO 2013-05-02T00:00:00Z}" ...

... But, joins don’t work in distributed queries and is probably too slow anyway

Solution: Index daily activity into multi-valued fields. Use real-time GET to lookup

document by ID to get the current daily volume fields



daily_volume_tdtm: [2013-05-01, 2013-05-02] <= doc has child signals on May 1 and 2

daily_volume_ssm: 2013-05-01|99, 2013-05-02|88 <= stored only field, doc had 99 child signals on May 1, 88 on May 2

daily_volume_s: 13050288|13050199 <= flattened multi-valued field for sorting using a custom ValueSource

Atomic updates and real-time get

Page 17: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

Will it work? Definitely!

Search can be addicting to your organization, queries we

tested for 6 months ago vs. what we have today are vastly


Buy RAM – OOMs and aggressive garbage collection

cause many issues

Give RAM from ^ to the OS – MMapDirectory

Need a disaster recovery process in addition to Solr cloud

replication; helps with migrating to new hardware too

Use Jetty ;-)

Store all fields! Atomic updates are a life saver

Lessons learned

Page 18: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

Schema will evolve – we thought we understood our data model but have since

added at least 10 new fields and deprecated some too

Partition if you can! e.g. 30-day collection

We don't optimize – segment merging works great

Size your staging environment so that shards have about as many docs and same

resources as prod. I have many more nodes in prod but my staging servers have

roughly the same number of docs per shard, just fewer shards.

Don’t be afraid to customize Solr! It’s designed to be customized with plug-ins

• ValueSource is very powerful

• Check out PostFilters:

{!frange l=1 u=1 cost=200 cache=false}imca(53313,employee)

Lessons learned cont.

Page 19: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

• Backups


• Monitoring

Replicas serving queries?

All replicas report same number of docs?

Zookeeper health

New search warm-up time • Configuration update process

Our solrconfig.xml changes frequently – see Solr’s • Upgrade Solr process (it’s moving fast right now)

• Recover failed replica process

• Add new replica

• Kill the JVM on OOM (from Mark Miller)



Minimum DevOps Reqts

Page 20: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

Nodes will crash! (ephemeral znodes)

Or, sometimes you just need to restart a

JVM (rolling restarts to upgrade)

Peer sync via update log (tlog)

100 updates else ...

Good ol’ Solr replication from leader to


Node recovery

Page 21: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

• Moving to a near real-time streaming model using Storm

• Buying more RAM per node

• Looking forward to shard splitting as it has

become difficult to re-index 600M docs

• Re-building the index with DocValues

• We've had shards get out of sync after major failure –

resolved it by going back to raw data and doing a key by key

comparison of what we expected to be in the index and re-indexing

any missing docs.

• Custom hashing to put all docs for a specific brand in the same


Roadmap / Futures

Page 22: Scaling up solr 4.1 to power big search in social media analytics

® 2011 Dachis Group.

If you find yourself in this

situation, buy more RAM!

Obligatory lolcats slide

Page 23: Scaling up solr 4.1 to power big search in social media analytics


Timothy Potter

[email protected]

twitter: @thelabdude