design a full-text search engine for a website based on lucene presented by: lijia li, yingyu wu,...
TRANSCRIPT
Design a full-text search engine for a website based on Lucene
Presented by: Lijia Li, Yingyu Wu, Xiao Zhu
Outline
• Introduction• Our goal• System architecture• Conclusion and future work• Show demo
Introduction• With the development of the network, the amount of information on the Internet showed explosive growth, increased the difficulty of finding the target information, the search engine has brought great convenience to people looking for information, internet has become an indispensable tool.
Our goal
• In this project, our goal is to implement a full-text retrieval engine based on Lucene.
Full-text retrieval engine
• The full-text search engine based on the entire text retrieval technology for indexing and searching.
• Features: (1) The unstructured index file database (2) Flexible retrieval methods (3) Support nature language retrieval (4) Retrieval efficiency
System Architecture
• Search Engine is used to provide searching service to users. Our search engine has two main parts: online and offline.
Users
User Interface
analyzer Result sorting
Search module
Index File
Index module
Website database
crawler
website
Enter keyword
webpage
Request
Search
Online
offline
LuceneWhy
•The index file format independent of the application platform
•Inverted index
•Object-oriented system architecture
•Chinese parser (SmartchineseAnalyzer, IKAnalyzer)
•Implement a set of powerful Query engine(RangeQuery,
FuzzyQuery……)
•Open Source
Web Crawler
Collection of start URL
URL Analysis
Analysis robots.txt
Get robots.txt
Unprocessed URL queue
URL Page fetch module
Page database
Internet
Page analysis module
Extract Links
Architecture of web crawler
Work flow of web crawler1. Extract the initial URL into unprocessed URL queue2. Get a URL address from the head of the queue3. Download pages according to their URL4. Extract hyperlink from the download page5. Extracted hyperlinks added to unprocessed URL queue6. Check whether the unprocessed URL queue is null if yes the program will be terminated otherwise step 2 will be executed.7. Loop
Index
Call the corresponding document parser to parse
document
Aset of documents to be
index
Read and Analysis document
Whether Indexed?
Determine the type of document
noDate of index ealier than the creation
data
no
yes
yes
Whether exist same type
Parse document
yesno
Build index file
Work flow
Document indexing steps1. Creating a IndexWriter instance
IndexWriter writer = new IndexWriter(indexPath, analyzer, boolean,
maxFieldLength)
2. Creating a recode of Document
Document doc = new Document()
3. Add Field Object in recode of Document
doc.add(new Filed(string, tokenstream))
4. Write recode of Document in Index
writer.addDocument(doc);
5. Close Index Writer Object, end indexing
writer.close()
Flow chart of searching
Example : User input: “ 大连理工 计算机” ,“america ohio” After QueryParser:“大连理工” AND“ 计算
机” ,“america” AND “ohio”
start
end
Accept search string from user
QueryParser analyze search string, output Query object
Set up Searcher
IndexSearcher object search related document in Index File
Output related document
Highlight search key word
1. Get position value of search key word
2. Get fragment of search key word, according position value of search key word
3. Use HTML and CSS attributes to highlight search key word
Conclusion and future work
• What we learn through this project is how to use web crawler and Lucene to implement a full-text search engine.
• Working on hadoop
•Thank you!