solr power ftw: powering nosql the world over

26
Solr Power FTW Alex Pinkin @apinkin #solrnosql

Upload: alex-pinkin

Post on 20-May-2015

811 views

Category:

Technology


2 download

DESCRIPTION

Solr is an open source, Lucene based search platform originally developed by CNET and used by the likes of Netflix, Yelp, and StubHub which has been rapidly growing in popularity and features during the last few years. Learn how Solr can be used as a Not Only SQL (NoSQL) database along the lines of Cassandra, Memcached, and Redis. NoSQL data stores are regularly described as non-relational, distributed, internet-scalable and are used at both Facebook and Digg. This presentation will quickly cover the fundamentals of NoSQL data stores, the basics of Lucene, and what Solr brings to the table. Following that we will dive into the technical details of making Solr your primary query engine on large scale web applications, thus relegating your traditional relational database to little more than a simple key store. Real solutions to problems like handling four billion requests per month will be presented. We'll talk about sizing and configuring the Solr instances to maintain rapid response times under heavy load. We'll show you how to change the schema on a live system with tens of millions of documents indexed while supporting real-time results. And finally, we'll answer your questions about ways to work around the lack of transactions in Solr and how you can do all of this in a highly available solution.

TRANSCRIPT

Page 1: Solr Power FTW: Powering NoSQL the World Over

Solr Power FTW

Alex Pinkin @apinkin #solrnosql

Page 2: Solr Power FTW: Powering NoSQL the World Over

What Will I Cover?

● Who I am

● What Bazaarvoice does

● SOLR and NoSQL

● Can SOLR handle 20K queries per second?

● Lessons learned: large scale multi data center deployment

● Conclusion

Page 3: Solr Power FTW: Powering NoSQL the World Over

Alex Pinkin

● Software Engineering Lead, Data Infrastructure team,Bazaarvoice

● Loves to play with SQL and NoSQL

@apinkin

Page 4: Solr Power FTW: Powering NoSQL the World Over

Bazaarvoice

● Bazaarvoice is a software as a service company powering user generated contentsuch as ratings and reviews on thousands of web sites

● 5 billion page views per month

● 230 billion impressions

● 75 million UGC

Page 5: Solr Power FTW: Powering NoSQL the World Over

NoSQL ?

Page 6: Solr Power FTW: Powering NoSQL the World Over

SQL vs NoSQL

Page 7: Solr Power FTW: Powering NoSQL the World Over

NoSQL is Not Only SQL

● Departs from relational model

● No fixed schema

● No joins

● Eventual consistency is OK

● Scale horizontally

Page 8: Solr Power FTW: Powering NoSQL the World Over

Types of NoSQL

● Key-value (Redis, Riak, Voldemort)

● Document (MongoDB, CouchDB)

● Graph (Neo4J, FlockDB)

● Column family (Cassandra, HBase)

Page 9: Solr Power FTW: Powering NoSQL the World Over

SOLR as NoSQL

● Non-relational model - Check

● No fixed schema - Check (dynamic fields)

● No joins - Check (denormalization)

● Horizontal scaling - Check (with work)

Page 10: Solr Power FTW: Powering NoSQL the World Over

SOLR stats - Bazaarvoice

Documents 250 MM

Index size 200 GB

QPS, avg 2,350

QPS, max 10,200

Response time, avg 12 ms

Servers 6+20

Page 11: Solr Power FTW: Powering NoSQL the World Over

SOLR Case Study

Page 12: Solr Power FTW: Powering NoSQL the World Over

SOLR Case Study

Page 13: Solr Power FTW: Powering NoSQL the World Over

Life Before SOLR

● Indexes for sorting and filtering

● Aggregate tables for stats

● Nightly jobs

● Bugs...

Page 14: Solr Power FTW: Powering NoSQL the World Over

Enter SOLR

● Index content and product catalog● De-normalization ● Filtering and sorting● Index every 15 minutes (20 seconds NRT)

Page 15: Solr Power FTW: Powering NoSQL the World Over

SOLR - Statistics

● COUNT, SUM, AVG, MIN, MAX(StatsComponent)

● Stored fields

● Whenever content changes, re-calc stats for all affected subjects

Page 16: Solr Power FTW: Powering NoSQL the World Over

Scaling reads - Replication

Page 17: Solr Power FTW: Powering NoSQL the World Over

Replication - Multiple Data Centers

Page 18: Solr Power FTW: Powering NoSQL the World Over

Replication - Multiple Data Centers

Chatty if using multiple cores Relay

● Core auto-warming disabled● Connection wait and read timeouts increased● Replication poll interval increased (15 min)● Compression enabled

...<str name="httpConnTimeout">20000</str><str name="httpReadTimeout">65000</str><str name="pollInterval">00:15:00</str><str name="compression">internal</str>...

Page 19: Solr Power FTW: Powering NoSQL the World Over

SOLR Cloud - Bazaarvoice version

● Multiple cores (100+ per server)

● Re-balance indexes across cores and servers○ Automatic○ Manual

● Deployment map stored in MySQL○ Host - Core - Partition○ Statistics

● Partition lifecycle

Page 20: Solr Power FTW: Powering NoSQL the World Over

Schema Changes

Re-indexing is time consuming for large indexes

Process

1. Full re-index off-line2. Incremental indexing prior to release3. Incremental indexing after the release

Bottleneck: reading from MySQL

Goal: Transparent re-indexing

Page 21: Solr Power FTW: Powering NoSQL the World Over

Performance Tuning

● Heap size ● Cache sizing● Auto-warming● Stored fields● Merge factor ● Commit frequency● Optimize frequency

Process: Simulate and measure● Replay logs● Analyze metrics● Monitor GC

Page 22: Solr Power FTW: Powering NoSQL the World Over

Performance Tuning - GC

# Java memory usage settings# Force the NewSize to be larger than the JVM typically allocates.# In practice, the JVM has been allocating an extremely small Young generation which objects to be prematurely promoted to the Tenured generationJAVA_MEM_OPTS="-Xms27g -Xmx27g -XX:NewRatio=8"

# -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps --> Turn on GC Logging# -XX:+UseConcMarkSweepGC --> Use the concurrent collector# -XX:+CMSIncrementalMode --> Incremental mode for the concurrent collector# -XX:+CMSIncrementalPacing --> Let the JVM adjust the amount of incremental collection JAVA_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=55 -XX:ParallelGCThreads=8 -XX:SurvivorRatio=4"

Page 23: Solr Power FTW: Powering NoSQL the World Over

SOLR Performance - Summary

● SOLR loves RAM!

● Log replay

● Same config, same hardware

● Get the most out of one instance

SOLR

Page 24: Solr Power FTW: Powering NoSQL the World Over

Conclusion - SOLR Strengths

● Lightning fast given enough RAM

● Good scale out supportincluding multi-data center

● Great community

Page 25: Solr Power FTW: Powering NoSQL the World Over

Conclusion - SOLR's Gaps

● Not fully elastic

● Real time takes work

● Secondary data store = sync overhead

● Schema changes

Page 26: Solr Power FTW: Powering NoSQL the World Over

Questions

@apinkin