trihug: lucene solr hadoop

Where It All Began

Using Apache Hadoop for Search with Apache Lucene and Solr

Lucid Imagination, Inc.

Topics

Search

What is:

Apache Lucene?

Apache Nutch?

Apache Solr?

Where does Hadoop (ecosystem) fit?

Indexing

Search

Search 101

Search tools are designed for dealing with fuzzy data

Works well with structured and unstructured dataPerforms well when dealing with large volumes of data

Many apps don’t need the limits that databases place on contentSearch fits well alongside a DB too

Given a user’s information need, (query) find and, optionally, score content relevant to that need

Many different ways to solve this problem, each with tradeoffs

What’s “relevant” mean?

Search 101

Relevance IndexingFinds and maps terms and documents

Conceptually similar to a book index

At the heart of fast search/retrieve

Vector Space Model (VSM) for relevanceCommon across many search enginesApache Lucene is a highly optimized implementation of the VSM

Lucene is a mature, high performance Java API to provide search capabilities to applications

Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.)

Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc.

Created in 1997 and now part of the Apache Software Foundation

Important to note that Lucene does not have distributed index (shard) support

ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies

Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat

http://labs.google.com/papers/mapreduce.html

Only much later did it spin out to become the Hadoop that we all know

In other words, Hadoop was born from the need to scale search crawling and indexing

Originally used Lucene for search/indexing, now uses Solr

Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene

Without knowing Java!

Also provides:

Easy setup and configuration

Faceting

Highlighting

Replication/Sharding

Lucene Best Practices

http://search.lucidimagination.com

Lucene Basics

Content is modeled via Documents and Fields

Content can be text, integers, floats, dates, custom

Analysis can be employed to alter content before indexing

Searches are supported through a wide range of Query options

Keyword

Phrases

Wildcards, other

Quick Solr DemoPre-reqs:

Apache Ant 1.7.x

svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunk

cd solr-trunk/solr/

ant example

cd example

java –jar start.jar

cd exampledocs; java –jar post.jar *.xml

http://localhost:8983/solr/browse

Anatomy of a Distributed Search System

Indexers

Shard[0] Shard[n]

Input Docs

Application

Sharding Alg.

Searchers

Shard[0]

Shard[n]

Fan In/Out

Coordination Layer

Sharding Algorithm

Good document distribution across shards is important

Simple approach:

hash(id) % numShards

Fine if number of shards doesn’t change or easy to reindex

Better:

Consistent Hashing• http://en.wikipedia.org/wiki/Consistent_hashing

Also key: how to deal with the shape/size of the cluster changing

Hadoop and Search

Much of the Hadoop ecosystem is useful for search related functionality

Indexing

Process of adding documents to inverted index to make them searchable

In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help

Search

Query the index and return documents and other info (facets, etc.) related to the result set

Subsecond response time usually required

ZooKeeper, Avro and others are still useful

Indexing (Lucene)

Hadoop ships with contrib/index• Almost no documentation, but…

• Good example of map-side indexing

• Mapper does analysis and creates in memory index which is written out to segments

• Indexes merged on the reduce side

Katta• http://katta.sourceforge.net

Shard management, distributed search, etc.

Both give you large amount of control, but you have to build out all the search framework around it

Indexing (Solr)

https://issues.apache.org/jira/browse/SOLR-1301

Map side formats

Reduce-side indexing

Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.)

Manually install index into a Solr core once built

https://issues.apache.org/jira/browse/SOLR-1045

Map-side indexing

Incomplete, but based on Hadoop contrib/index

Write a distributed Update Handler to handle on the server side

Indexing (Nutch to Solr)

Use Nutch to crawl content, Solr to index and serve

Doesn’t support indexing to Solr shards just yet

Need to write/use Solr distributed Update Handler

Still useful for smaller crawls (< 100M pages)

http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/

Searching

Hadoop Core is not all that useful for distributed search

Exception: Hadoop RPC layer, possibly

Exception: Log analysis, etc. for search related items

Other Hadoop ecosystem tools are useful:

Apache ZooKeeper (more in a moment)

HDFS – storage of shards (pull down to local disk)

Avro, Thrift, Protocol Buffers (serialization utilities)

ZooKeeper and Search

ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization

In the context of search, it’s useful for:

Sharing configuration across nodes

Maintaining status about shards• Up/down/latency/rebalancing and more

Coordinating searches across shards/load balancing

ZooKeeper and Search (Practical)

Katta employs ZooKeeper for search coordination, etc.

Query distribution, status, etc.

Solr Cloud

All the benefits of Solr + ZooKeeper for coordinating distributed capabilities

Query distribution, configuration sharing, status, etc.

About to be committed to Solr trunk

http://wiki.apache.org/solr/SolrCloud

Other Search Related Tasks

Log Analysis

Query analytics

trihug: lucene solr hadoop

apache solr

search enginesapache

search tools

search crawling

search server

search capabilities

context of search

search framework

Technology

numeric range queries in lucene and solr

solr jdbc - lucene/solr revolution 2016

lucene rev preso bialecki solr crawlers-lr

nyc lucene/solr meetup: spark / solr

approaching join index - lucene/solr revolution 2014

faceting with lucene block join query - lucene/solr...

inside solr 5 - bangalore solr/lucene meetup

introduction to apache lucene/solr

apache solr/lucene internals by anatoliy sokolenko

lucene solr meetup july 2010 short

text categorization with lucene and solr

search engine-building with lucene and solr

boosting documents in solr (lucene revolution 2011)

open-source search engines and lucene/solr

apache solr cms integration @ lucene/solr revolution san...

solr lucene conference 2014 - nitin presentation

lucene for solr developers

oslo lucene/solr meetup

lucene solr 4 spatial extended deep dive

better search with apache lucene and solr