trihug: lucene solr hadoop
Post on 26-Jan-2015
123 Views
Preview:
DESCRIPTION
TRANSCRIPT
Where It All Began
Using Apache Hadoop for Search with Apache Lucene and Solr
Lucid Imagination, Inc.
Topics
Search
What is:
Apache Lucene?
Apache Nutch?
Apache Solr?
Where does Hadoop (ecosystem) fit?
Indexing
Search
Other
Lucid Imagination, Inc.
Search 101
Search tools are designed for dealing with fuzzy data
Works well with structured and unstructured dataPerforms well when dealing with large volumes of data
Many apps don’t need the limits that databases place on contentSearch fits well alongside a DB too
Given a user’s information need, (query) find and, optionally, score content relevant to that need
Many different ways to solve this problem, each with tradeoffs
What’s “relevant” mean?
Search 101
Relevance IndexingFinds and maps terms and documents
Conceptually similar to a book index
At the heart of fast search/retrieve
Vector Space Model (VSM) for relevanceCommon across many search enginesApache Lucene is a highly optimized implementation of the VSM
Lucid Imagination, Inc.
Lucene is a mature, high performance Java API to provide search capabilities to applications
Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.)
Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc.
Created in 1997 and now part of the Apache Software Foundation
Important to note that Lucene does not have distributed index (shard) support
Lucid Imagination, Inc.
Nutch
ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies
Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat
http://labs.google.com/papers/mapreduce.html
Only much later did it spin out to become the Hadoop that we all know
In other words, Hadoop was born from the need to scale search crawling and indexing
Originally used Lucene for search/indexing, now uses Solr
Lucid Imagination, Inc.
Solr
Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene
Without knowing Java!
Also provides:
Easy setup and configuration
Faceting
Highlighting
Replication/Sharding
Lucene Best Practices
http://search.lucidimagination.com
Lucid Imagination, Inc.
Lucene Basics
Content is modeled via Documents and Fields
Content can be text, integers, floats, dates, custom
Analysis can be employed to alter content before indexing
Searches are supported through a wide range of Query options
Keyword
Terms
Phrases
Wildcards, other
Lucid Imagination, Inc.
Quick Solr DemoPre-reqs:
Apache Ant 1.7.x
SVN
svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunk
cd solr-trunk/solr/
ant example
cd example
java –jar start.jar
cd exampledocs; java –jar post.jar *.xml
http://localhost:8983/solr/browse
Lucid Imagination, Inc.
Anatomy of a Distributed Search System
Indexers
…
Shard[0] Shard[n]
…
Input Docs
Users
Application
Sharding Alg.
Searchers
…
Shard[0]
…
Shard[n]
…
Fan In/Out
Coordination Layer
Lucid Imagination, Inc.
Sharding Algorithm
Good document distribution across shards is important
Simple approach:
hash(id) % numShards
Fine if number of shards doesn’t change or easy to reindex
Better:
Consistent Hashing• http://en.wikipedia.org/wiki/Consistent_hashing
Also key: how to deal with the shape/size of the cluster changing
Lucid Imagination, Inc.
Hadoop and Search
Much of the Hadoop ecosystem is useful for search related functionality
Indexing
Process of adding documents to inverted index to make them searchable
In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help
Search
Query the index and return documents and other info (facets, etc.) related to the result set
Subsecond response time usually required
ZooKeeper, Avro and others are still useful
Lucid Imagination, Inc.
Indexing (Lucene)
Hadoop ships with contrib/index• Almost no documentation, but…
• Good example of map-side indexing
• Mapper does analysis and creates in memory index which is written out to segments
• Indexes merged on the reduce side
Katta• http://katta.sourceforge.net
Shard management, distributed search, etc.
Both give you large amount of control, but you have to build out all the search framework around it
Lucid Imagination, Inc.
Indexing (Solr)
https://issues.apache.org/jira/browse/SOLR-1301
Map side formats
Reduce-side indexing
Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.)
Manually install index into a Solr core once built
https://issues.apache.org/jira/browse/SOLR-1045
Map-side indexing
Incomplete, but based on Hadoop contrib/index
Write a distributed Update Handler to handle on the server side
Indexing (Nutch to Solr)
Use Nutch to crawl content, Solr to index and serve
Doesn’t support indexing to Solr shards just yet
Need to write/use Solr distributed Update Handler
Still useful for smaller crawls (< 100M pages)
http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/
Lucid Imagination, Inc.
Searching
Hadoop Core is not all that useful for distributed search
Exception: Hadoop RPC layer, possibly
Exception: Log analysis, etc. for search related items
Other Hadoop ecosystem tools are useful:
Apache ZooKeeper (more in a moment)
HDFS – storage of shards (pull down to local disk)
Avro, Thrift, Protocol Buffers (serialization utilities)
Lucid Imagination, Inc.
ZooKeeper and Search
ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization
In the context of search, it’s useful for:
Sharing configuration across nodes
Maintaining status about shards• Up/down/latency/rebalancing and more
Coordinating searches across shards/load balancing
Lucid Imagination, Inc.
ZooKeeper and Search (Practical)
Katta employs ZooKeeper for search coordination, etc.
Query distribution, status, etc.
Solr Cloud
All the benefits of Solr + ZooKeeper for coordinating distributed capabilities
Query distribution, configuration sharing, status, etc.
About to be committed to Solr trunk
http://wiki.apache.org/solr/SolrCloud
Lucid Imagination, Inc.
Other Search Related Tasks
Log Analysis
Query analytics
Related Searches
Relevance assessments
Classification and Clustering
Mahout – http://mahout.apache.org
HBase and other stores for documents
Avro, Thrift, Protocol Buffers for serialization of objects across the wire
Lucid Imagination, Inc.
Resources
http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/
http://hadoop.apache.org
http://nutch.apache.org
http://lucene.apache.org
http://www.lucidimagination.com
top related