real-time searching of big data with solr and hadoop

Real-Time Searching of Big Data with Solr and HadoopRod Cope, CTO & FounderOpenLogic, Inc.

OpenLogic, Inc.

Agenda

IntroductionThe ProblemThe SolutionDetailsFinal ThoughtsQ & A

2

OpenLogic, Inc.

Introduction

Rod CopeCTO & Founder of OpenLogic25 years of software development experienceIBM Global Services, Anthem, General Electric

OpenLogicOpen Source Support, Governance, and Scanning SolutionsCertified library w/SLA support on 500+ Open Source packagesOver 200 Enterprise customers

3

OpenLogic, Inc.

The Problem

“Big Data”All the world’s Open Source SoftwareMetadata, code, indexesIndividual tables contain manyterabytesRelational databases aren’t scale-free

Growing every dayNeed real-time random access to all dataLong-running and complex analysis jobs

4

OpenLogic, Inc.

The Solution

Hadoop, HBase, and SolrHadoop – distributed file systemHBase – “NoSQL” data store – column-orientedSolr – search server based on LuceneAll are scalable, flexible, fast, well-supported,used in production environments

And a supporting cast of thousands…Stargate, MySQL, Rails, Redis, Resque, Nginx, Unicorn, HAProxy, Memcached,Ruby, JRuby, CentOS, …

5

OpenLogic, Inc. 6

Solution Architecture

RailsRailsRuby

on Rails RailsRailsResque Workers

RailsRailsSolr

Web Browser Rails

MySQL

RailsRailsHBase

Nginx & Unicorn

Livereplication

Livereplication

Livereplication(3x)

RailsRailsStargate

Redis

Livereplication

Data LANInternet Application LAN

RailsRailsMaven Repo

Scanner Client

Maven Client

Caching and load balancing not shown

OpenLogic, Inc.

Hadoop/HBase Implementation

Private Cloud100+ CPU cores100+ Terabytes of diskMachines don’t have identityAdd capacity by plugging innew machines

Why not EC2?Great for computational burstsExpensive for long-term storage of Big DataNot yet consistent enough for mission-critical usage of HBase

7

OpenLogic, Inc.

Public Clouds and Big Data

Amazon EC2EBS Storage

100TB * $0.10/GB/month = $120k/yearDouble Extra Large instances

13 EC2 compute units, 34.2GB RAM20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year3 year reserved instances

20 * 4k = $80k up front to reserve(20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/year to operate

Totals for 20 virtual machines1st year cost: $120k + $80k + $86k = $286k2nd & 3rd year costs: $120k + $86k = $206kAverage: ($286k + $206k + $206k) / 3 = $232k/year

8

OpenLogic, Inc.

Private Clouds and Big Data

Buy your own20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k

Over 33 EC2 compute units each

Total: $53k/year (amortized over 3 years)

9

OpenLogic, Inc.

Public Clouds are Expensive for Big DataAmazon EC2

20 instances * 13 EC2 compute units = 260 EC2 compute unitsCost: $232k/year

Buy your own20 machines * 33 EC2 compute units = 660 EC2 compute unitsCost: $53k/yearDoes not include hosting & maintenance costs

Don’t think system administration goes awayYou still “own” all the instances – monitoring, debugging, support

10

OpenLogic, Inc.

Getting Data out of HBase

HBase NoSQLThink hash table, not relational databaseScanning vs. querying

How do find my data if primary key won’t cut it?Solr to the rescue

Very fast, highly scalable search server with built-in sharding and replication – based on LuceneDynamic schema, powerful query language, faceted search, accessible via simple REST-like web API w/XML, JSON, Ruby, and other data formats

11

OpenLogic, Inc.

SolrSharding

Query any server – it executes the same query against all other servers in the groupReturns aggregated result to original caller

Async replication (slaves poll their masters)Can use repeaters if replicating across data centers

OpenLogicSolr farm, sharded, cross-replicated, fronted with HAProxy

Load balanced writes across masters, reads across masters and slaves

Billions of lines of code in HBase, all indexed in Solr for real-time search in multiple waysOver 20 Solr fields indexed per source file

12

OpenLogic, Inc.

Machine 2 Machine 3 Machine 26Machine 1

Solr Implementation – Sharding + Replication

13

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

…

HAProxyHAProxy

OpenLogic, Inc.


Solr Implementation – Sharding + Replication

14

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

…

HAProxyHAProxy

OpenLogic, Inc.


Write Example

15

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

…

HAProxyHAProxy

OpenLogic, Inc.


Read Example

16

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

…

HAProxyHAProxy

OpenLogic, Inc.


Delete Example

17

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

…

HAProxyHAProxy

OpenLogic, Inc.


Write Example - Failover

18

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

…

HAProxyHAProxy

OpenLogic, Inc.


Read Example - Failover

19

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

…

HAProxyHAProxy

OpenLogic, Inc.

Configuration is Key

Many moving partsIt’s easy to let typos slip throughConsider automated configurationvia Chef, Puppet, or similar

Pay attention to the detailsOperating system – max open files, sockets, and other limitsHadoop and HBase configuration

http://wiki.apache.org/hadoop/Hbase/Troubleshooting

Solr merge factor and norms

Don’t starve HBase or Solr for memorySwapping will cripple your system

20



OpenLogic, Inc.

Commodity Hardware

“Commodity hardware” != 3 year old desktopDual quad-core, 32GB RAM, 4+ disksDon’t bother with RAID on Hadoop data disks

Be wary of non-enterprise drives

Expect ugly hardware issues at some point

21

OpenLogic, Inc.

OpenLogic’s Hadoop and Solr Deployment

Dual quad-core and dual hex-core Dell boxes32-64GB RAM

ECC (highly recommended by Google)

6 x 2TB enterprise hard drivesRAID 1 on two of the drives

OS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc.Key “source” data backups

Hadoop datanode gets remaining drivesRedundant enterprise switchesDual- and quad-gigabit NIC’s

22

OpenLogic, Inc.

Expect Things to Fail – A Lot

HardwarePower supplies, hard drives

Operating SystemKernel panics, zombie processes, dropped packets

Software ServersHadoop datanodes, HBase regionservers, Stargate servers, Solr servers

Your Code and DataStray Map/Reduce jobs, strange corner cases in your data leading to program failures

23

OpenLogic, Inc.

Cutting Edge

HadoopSPOF around Namenode, append functionality

HBaseBackup, replication, and indexing solutions in flux

Solr Several competing solutions around cloud-like scalability and fault-tolerance, including ZooKeeper and Hadoop integrationNo clear winner, none quite ready for production

24

OpenLogic, Inc.

Loading Big Data

Experiment with different Solr merge factorsDuring huge loads, it can help to use a higher factor for load performance

Minimize index manipulation gymnasticsStart with something like 25

When you’re done with the massive initial load/import, turn it back down for search performance

Minimize number of queriesStart with something like 5Example:

curl http://solr1:8080/solr/master/update?optimize=true&maxSegments=5This can take a few minutes, so you might need to adjust various timeouts

Note that a small merge factor will hurt indexing performance if you need to do massive loads on a frequent basis or continuous indexing

25

http://solr1:8080/solr/master/update?optimize=true&maxSegments=5







OpenLogic, Inc.

Loading Big Data (cont.)

Test your write-focused load balancingLook for large skews in Solr index sizeNote: you may have to commit, optimize, write again, and commit before you can really tell

Make sure your replication slaves are keeping upUsing identical hardware helpsIf index directories don’t look the same, something is wrong

26

OpenLogic, Inc.


Don’t commit to Solr too frequentlyIt’s easy to auto-commit or commit after every recordDoing this 100’s of times per second will take Solr down, especially if you have serious warm up queries configured

Avoid putting large values in HBase (> 5MB)Works, but may cause instability and/or performance issuesRows and columns are cheap, so use more of them instead

27

OpenLogic, Inc.


Don’t use a single machine to load the clusterYou might not live long enough to see it finish

At OpenLogic, we spread raw source data across many machines and hard drives via NFS

Be very careful with NFS configuration – can hang machines

Load data into HBase via Hadoop map/reduce jobsTurn off WAL for much better performance

put.setWriteToWAL(false)

Index in Solr as you goGood way to test your load balancingwrite schemes and replication set up

This will find your weak spots!

28

OpenLogic, Inc.

Scripting Languages Can Help

Writing data loading jobs can be tediousScripting is faster and easier than writing JavaGreat for system administration tasks, testingStandard HBase shell is based on JRubyVery easy Map/Reduce jobs with J/Ruby and WukongUsed heavily at OpenLogic

Productivity of RubyPower of Java Virtual MachineRuby on Rails, Hadoop integration, GUI clients

29

OpenLogic, Inc.

Java (27 lines)

30

public class Filter { public static void main( String[] args ) {

List list = new ArrayList(); list.add( "Rod" ); list.add( "Neeta" ); list.add( "Eric" ); list.add( "Missy" );

Filter filter = new Filter(); List shorts = filter.filterLongerThan( list, 4 ); System.out.println( shorts.size() );

Iterator iter = shorts.iterator(); while ( iter.hasNext() ) { System.out.println( iter.next() ); }}

public List filterLongerThan( List list, int length ) { List result = new ArrayList(); Iterator iter = list.iterator(); while ( iter.hasNext() ) { String item = (String) iter.next(); if ( item.length() <= length ) { result.add( item ); } } return result;}

}

OpenLogic, Inc.

Scripting languages (4 lines)

31

Groovy

list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.find_all { |name| name.size <= 4 }puts shorts.sizeshorts.each { |name| puts name }

-> 2 -> Rod Eric

JRuby

list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.findAll { name -> name.size() <= 4 }println shorts.sizeshorts.each { name -> println name }

-> 2 -> Rod Eric

OpenLogic, Inc.

Not Possible Without Open Source

32

OpenLogic, Inc.

Not Possible Without Open Source

Hadoop, HBase, SolrApache, Tomcat, ZooKeeper, HAProxyStargate, JRuby, Lucene, Jetty, HSQLDB, GeronimoApache Commons, JUnitCentOSDozens more

Too expensive to build or buy everything

33

OpenLogic, Inc.

Final ThoughtsYou can host big data in your own private cloud

Tools are available today that didn’t exist a few years agoFast to prototype – production readiness takes timeExpect to invest in training and support

HBase and Solr are fast100+ random queries/sec per instanceGive them memory and stand back

HBase scales, Solr scales (to a point)Don’t worry about outgrowing a few machinesDo worry about outgrowing a rack of Solr instances

Look for ways to partition your data other than “automatic” sharding

34

OpenLogic, Inc.

Q & A

Any questions for [email protected]

* Unless otherwise credited, all images in this presentation are either open source project logos or were licensed from BigStockPhoto.com

real-time searching of big data with solr and hadoop

Documents

ec2 compute unitscost

ec2 compute units eachtotal

big databuy

big dataamazon ec220

public clouds

year costs

scalable search server

faceted search