ads final project

10
Information Retrieval Theory : A case study involving Apache Lucene + Solr: A distributed Search Engine By Alok Dhamanaskar Manuel Correa

Upload: manuel-correa

Post on 07-Dec-2014

58 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Ads final project

Information Retrieval Theory : A case study involving Apache Lucene + Solr: A

distributed Search Engine

ByAlok Dhamanaskar

Manuel Correa

Page 2: Ads final project

Outline

●Problem Description

●About Lucene, Solr, and Hadoop HDFS

●Solution: Implementation

●Data tested

●Demo

●Conclusions

●Questions

Page 3: Ads final project

Problem Description

●Search among large data-sets across thousands of documents, databases, etc.. with a simple query

●SQL does not support full text search across multiple fields, with ranking and other data mining

●Data might contain geospatial data and operations searching by distant and buffers

●Also most data intensive applications demand a high availability and be persistent

Page 4: Ads final project

About Lucene, Solr, and Hadoop HDFS

●Lucene: Java Index Engine

oRanked searching

oQuery types: Phrase, wild-card, proximity

oDocument fields searching

oSorting

●Solr:

oWeb Application that interacts with Lucene Engine

oRestful interfaces for searching, indexing, deleting, etc...

oExtend Lucene: Geospatial Search, Schemas integration, Monitoring, Sharding index

●Hadoop HDFS

oDistributed File system

Page 5: Ads final project

Project Implementation

●Implementation of Hadoop (Cloudera version)●Integration with the FS through Fuse●Setup of Solr instances●Data manipulation in DB(Oracle and SQLServer)●Data Index from Database to Solr●Distributed Search implementation in Solr●Solr Client Web Application development

Page 6: Ads final project

Solution: Implementation

Page 7: Ads final project

Data tested

●Public Data from the Carl Vinson Institute of Government - ITOS

oIndexed 10 Schema

oMore than 500 columns indexed including Location information

oApproximately 200,000 document created

oMore than 15,000,000 data items indexed for each document

oInformation related with: Government Buildings, Clinics, Hospitals, Fire Stations, Teen centers, Service facilities, shelters, Child support offices, Historical resources, and Archaeological sites

Page 8: Ads final project

DEMO

●Hadoop HDFS pseudo-distributed implementation

●HDFS mountable with Fuse

●Solr instances configuration

●Solr Client Web application

Page 9: Ads final project

Conclusions

●HDFS offers high availability to store index documents

●Solr offer a light-weight solution to implement a powerful search engine

●Solr is a "cheap" solution to implement basic geospatial search engine

●Solr's Restful API makes it easy to integrate with any Enterprise System

Page 10: Ads final project

Questions?