building a search engine using lucene
TRANSCRIPT
Building a Search Engine Using Apache
Lucene/Solr
Road Map• Problem Definition• A Basic Search Engine Pipeline• Meet Lucene • Lucene API Examples • Lucene Wrappers (Apache Solr, ElasticSearch, Regain, etc….)• Applied Lucene (Real Examples)
Problem DefinitionYou got a farm of data, and you want it to be searchable.Analogy: Searching for a needle in a haystack with adding more hay to the stack! - SQL Databases Cons ( > 500,000,000 records …)- Scalability- Decentralization
A Basic Search Engine Pipeline• Crawling: Grapping the data • Parsing [Optional]: Understanding the data• Indexing: Build the holding structure• Ranking: Sort the data• Searching: Read that holding structure
Behind The Scenes: Analysis, Tokenization, Query Parsing, Boosting, Calculating Term Vectors, Token Filtration,
Index Inversion, etc…
What is Lucene?• Doug Cutting (Lucene 1999, Nutch 2003, Hadoop 2006)• Free, Java information retrieval library• Application related: Indexing, Searching• High performance, A decade of research• Heavily supported, simply customized• No dependencies
What Lucene Ain’t• A complete search engine
• An application
• A crawler
• A document filter/recognizer
Lucene RolesRich Document Rich Document
GatherParse
Make Doc
Search UI
Search Appe.g. webapp
Search
Index
Index
Lucene Strength Points• Simple API • Speed• Concurrency • Smart indexing (Incremental)• Near Real Time Search• Vector Space Search • Heavily Used, Supported
Lucene Query Types• Single Term VS. Multi-Term “+name: camel + type: animal”• Wildcard Queries “text:wonder*” • Fuzzy Queries “room~0.8”• Range Queries “date:[25/5/2000 To *]”• Grouped Queries “text: animal AND small”• Proximity Queries “hamlet macbeth”~10• Boosted Queries “hamlet^5.0 AND macbeth”
API Sample I (Indexing)private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open(new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); }
public void close() throws IOException { writer.close(); }
public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); }}
Indexing Pipeline (Simplified)
Tokenizer TokenFilterDocument DocumentWriter
InvertedIndex
add
Analysis Basic Types "The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] "XY&Z Corporation - [email protected]" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [[email protected]] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [[email protected]]
The Inverted Index (In a nutshell)
API Sample II (Searching)public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open(new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true);
QueryParser parser = new QueryParser("contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println("Found " + hits.totalHits + " document(s)");
for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println(doc.get("filename")); }
is.close();}
Index Update • Lucene doesn’t have an update mechanism. So?
• Incremental Indexing (Index Merging)
• Delete + Add = Update
• Index Optimization
API Sample III (Deleting)Via IndexReadervoid deleteDocument(int docNum)
Deletes the document numbered docNum
int deleteDocuments(Term term) Deletes all documents that have a given term indexed.
Via IndexWritervoid deleteAll()
Delete all documents in the index.
void deleteDocuments(Query query) Deletes the document(s) matching the provided query.
void deleteDocuments(Query[] queries) Deletes the document(s) matching any of the provided queries.
void deleteDocuments(Term term) Deletes the document(s) containing term.
void deleteDocuments(Term[] terms) Deletes the document(s) containing any of the terms.
Some Statistics• Dependent on Lucene.NET (a .NET port of Lucene)
Local Testing (Index, Search are on the same device)
Over Network Testing (File server for index file, Standalone searching workstations)
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB ~32, 180 MB ~50 -> 300 0.2
40 GB ~360, 2.6 GB ~100 -> 3000 3.2
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB X,180 MB ~300 -> 700 X
40 GB X, 2.6 GB ~400 -> 4500 X
Lucene Wrappers (Apache Solr) • A Java wrapper over Lucene • A web application that can be deployed on any
servlet container (Apache Tomcat, Jetty)• A REST service • It has an administration interface • Built-in configuration with Apache Tika (a repository of parsers)• Scalable • Integration with Apache Hadoop, Apache Cassandra
Solr Administration Interface
Solr Architecture (The Big Picture)
Note: It includes JSON, PHP, Python,… Not only XML.
Communication with Solr (Sending Docs)• Direct Connection OR Through APIs (SolrJ, SolrNET)
// make a connection to Solr serverSolrServer server = new HttpSolrServer("http://localhost:8080/solr/");// prepare a docfinal SolrInputDocument doc1 = new SolrInputDocument();doc1.addField("id", 1);doc1.addField("firstName", "First Name");doc1.addField("lastName", "Last Name");final Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();docs.add(doc1);// add docs to Solrserver.add(docs);server.commit();
Communication with Solr (Searching)final SolrQuery query = new SolrQuery();query.setQuery("*:*");query.addSortField("firstName", SolrQuery.ORDER.asc);final QueryResponse rsp = server.query(query);final SolrDocumentList solrDocumentList = rsp.getResults();for (final SolrDocument doc : solrDocumentList) {
final String firstName = (String) doc.getFieldValue("firstName");final String id = (String) doc.getFieldValue("id");
}
Some Statistics
Note 1: We’re sending HTTP POST requests to Solr server, That can take a lot if we compared it withThe pure Lucene.NET model.
Note 2: Consider a server with upcoming requests from everywhere, OS related issues with queuing can cause some delay depending on the queuing strategy.
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB ~39.5, 169 MB ~300 -> 3000 0.203
40 GB ~400 (Not accurate), 40 GB ~300 -> 10000 ~7 (Not accurate)
Lucene/Solr Users• Instagram (geo-search API)• NetFlix (Generic search feature)• SourceForge (Generic search feature)• Eclipse (Documentation search)• LinkedIn (Recently, Job Search)• Krugle (SourceCode Search)• Wikipedia (Recently, Generic Content Search)
References • Manning Lucene in Action (2nd Edition)• Lucene Main Website• Another Presentation on SlideShare
Thank You