ads final project
DESCRIPTION
TRANSCRIPT
Information Retrieval Theory : A case study involving Apache Lucene + Solr: A
distributed Search Engine
ByAlok Dhamanaskar
Manuel Correa
Outline
●Problem Description
●About Lucene, Solr, and Hadoop HDFS
●Solution: Implementation
●Data tested
●Demo
●Conclusions
●Questions
Problem Description
●Search among large data-sets across thousands of documents, databases, etc.. with a simple query
●SQL does not support full text search across multiple fields, with ranking and other data mining
●Data might contain geospatial data and operations searching by distant and buffers
●Also most data intensive applications demand a high availability and be persistent
About Lucene, Solr, and Hadoop HDFS
●Lucene: Java Index Engine
oRanked searching
oQuery types: Phrase, wild-card, proximity
oDocument fields searching
oSorting
●Solr:
oWeb Application that interacts with Lucene Engine
oRestful interfaces for searching, indexing, deleting, etc...
oExtend Lucene: Geospatial Search, Schemas integration, Monitoring, Sharding index
●Hadoop HDFS
oDistributed File system
Project Implementation
●Implementation of Hadoop (Cloudera version)●Integration with the FS through Fuse●Setup of Solr instances●Data manipulation in DB(Oracle and SQLServer)●Data Index from Database to Solr●Distributed Search implementation in Solr●Solr Client Web Application development
Solution: Implementation
Data tested
●Public Data from the Carl Vinson Institute of Government - ITOS
oIndexed 10 Schema
oMore than 500 columns indexed including Location information
oApproximately 200,000 document created
oMore than 15,000,000 data items indexed for each document
oInformation related with: Government Buildings, Clinics, Hospitals, Fire Stations, Teen centers, Service facilities, shelters, Child support offices, Historical resources, and Archaeological sites
DEMO
●Hadoop HDFS pseudo-distributed implementation
●HDFS mountable with Fuse
●Solr instances configuration
●Solr Client Web application
Conclusions
●HDFS offers high availability to store index documents
●Solr offer a light-weight solution to implement a powerful search engine
●Solr is a "cheap" solution to implement basic geospatial search engine
●Solr's Restful API makes it easy to integrate with any Enterprise System
Questions?