building a distributed search system with hadoop and lucene

18
Building a distributed search system with Apache Hadoop and Lucene Anno Accademico 2012-2013

Upload: mirko-calvaresi

Post on 07-Dec-2014

3.514 views

Category:

Technology


0 download

DESCRIPTION

Big Data Problem Map and Reduce approach: Apache Hadoop Distributing a Lucene index using Hadoop Measuring Performance Conclusion

TRANSCRIPT

Page 1: Building a distributed search system with Hadoop and Lucene

 

Building a distributed search system with Apache Hadoop and Lucene

      

Anno Accademico 2012-2013 

Page 2: Building a distributed search system with Hadoop and Lucene

Outline

• Big Data Problem• Map and Reduce approach: Apache Hadoop• Distributing a Lucene index using Hadoop• Measuring Performance• Conclusion

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Page 3: Building a distributed search system with Hadoop and Lucene

“Big Data”

This works analyzes the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (10E12 bytes) or Petabyte (10E15 bytes) and with an exponential growth rate.• Facebook processes 2.5 billion contents/day.• Youtube: 72 hours of video uploaded per minutes. 

• Twitter:50 million tweets per day.

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Page 4: Building a distributed search system with Hadoop and Lucene

Multitier architecture vs Cloud computing

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Front End Servers

Database Servers

Client

Front End Servers

Cloud

Client

Data asynchronous analysis

Real ti

me pr

oces

sing

Real ti

me pr

oces

sing

Page 5: Building a distributed search system with Hadoop and Lucene

Apache Hadoop architecture

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

A Hadoop cluster scales computation capacity, storage capacity and IO bandwidthby simply adding commodity servers 

Page 6: Building a distributed search system with Hadoop and Lucene

HDFS: the distributed file system

• Files are stored as sets of (large) blocks– Default block size: 64 MB (ext4 default is 4kB)– Blocks are replicated for durability and availability

• Namespace is managed by a single name node– Actual data transfer is directly between client & data node– Pros and cons of this decision?

foo.txt: 3,9,6bar.data: 2,4

block #2 of foo.txt?

9Read block 9

9

9

9 93

332

2

24

4

46

6Name node

Data nodesClient

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Page 7: Building a distributed search system with Hadoop and Lucene

Map and Reduce

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.  

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Page 8: Building a distributed search system with Hadoop and Lucene

Recap: Map Reduce approach

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Reducer

Inpu

t dat

a

Out

put d

ata

"The Shuffle"

Intermediate (key,value) pairs

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Page 9: Building a distributed search system with Hadoop and Lucene

Map and Reduce: where is applicable

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

• Distributed “Grep”• Count of URL Access Frequency• Reverse Web-Link Graph• Term-Vector per Host• Reduce a n level graph in a redundant hash 

table 

mirko calvaresi
Page 10: Building a distributed search system with Hadoop and Lucene

Implementation: distributing a Lucene index using Map and Reduce 

The scope of the implementation is to:1. populate a Lucene distributed index using the 

HDFS cluster 2. distributing and retrieving results using Map 

and Reduce

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Page 11: Building a distributed search system with Hadoop and Lucene

Apache Lucene: indexing

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

n

Apache Lucene is the standard de facto in the open source community for textual search 

DocumentField(type)->ValueField(type)->ValueField(type)->Value

Page 12: Building a distributed search system with Hadoop and Lucene

Apache Lucene: searching

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

In Lucene each document is a vector.A measure of the relevance is the value of the θ angle between the document and the query vector

Page 13: Building a distributed search system with Hadoop and Lucene

Distributing Lucene indexes using Hadoop

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Index 1

Lucene Indexer Job

Indexing Searching

Index 2

Index 3

PDF doc archive

Map Phase: Creates and populate each indexReduce Phase: None

HDFS

 Cluster

Index 1

Lucene Search Job

Index 2

Index 3

HDFS

 Cluster

map

Sort

Reduce

ResulSet

Combine

map map

{Search Filter}(list of Lucene Restrictions)

Map Phase: Queries the indexesReduce Phase: Merges and orders result set

Page 14: Building a distributed search system with Hadoop and Lucene

Measuring Performance

The entire execution time can be formally defined as: 

While the single Map (or Reduce) phase:

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Where α is the % of reduce tasks still on going after map phase completion.  

Page 15: Building a distributed search system with Hadoop and Lucene

Measuring Performance

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

With 4 or more data nodes  Hadoop infrastructure setup cost is compensated

Page 16: Building a distributed search system with Hadoop and Lucene

Measuring Performance (Word Count)

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Having a single Big file speeds up Hadoop consistently, so performance are not really determined by the quantity of data but how many splits  are added to the HDFS

Page 17: Building a distributed search system with Hadoop and Lucene

Job Detail Page

Tasks Queue

Tasks currently running

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

Page 18: Building a distributed search system with Hadoop and Lucene

Conclusion

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

 

What • Analysis of the current status of Open Source technologies• Analysis of the potential applications for the web• Implemented a full working Hadoop architecture• Designed a web portal based on the previous architecture

Objectives: • Explore Map and Reduce approach to analyze unstructured data• Measure performance and understand the Apache Hadoop framework

Outcomes• Setup of the entire architecture in my company environment (Innovation 

Engineering)• Main benefits in the indexing phase• Poor impact on the search side (for standard queries format)• In general major benefits when the HDFS is populated by a relatively small 

number of Big (GB) files