introduction of search engine
DESCRIPTION
Introduction of search engine.TRANSCRIPT
Introduction of Search Engine
By Jinglun
Agenda
• Overview
• Spider
• Analysis
• Index
• Search
Search Engine
Architecture
Spider
Web Characters
• Bowtie
• Diameter of WWW
• Volatile
• Semi-Structured
Problems & Solutions
• Depth-Fist Traversal vs Bread-First Traversal
• Duplication
• Optimize download
• Page Storage
• Concurrence Control and Robots.txt
Architecture
Analysis
• Structuring
• Replica Detection
• Word Segment
• Page Rank
Structuring
Replica Detection
Word Segment
• Forward Maximum Matching Method
• Backward Maximum Matching Method
• N-Gram
Page Rank
Architecture
Index
Data Structure
Build Index
Delta Encode
• a, b, c => a, b – a, c – b
• Case1: docid
0, 2, 3, 5, 6, 10 … =>
0, 2, 1, 2, 1, 4 …
• Case2: term
apple, applet, application, banana … =>
<0, apple>, <4, t>, <4, ication>,
<0, banana> …
Vint
• Highest bit is ‘1’: Highest bit is ‘0’:
• 0~127: 1byte, 0xxxxxxx
• 128~2^14-1: 2bytes
xxxxxxxyyyyyyy => 1xxxxxxx, 0yyyyyyy
• 2^14~2^21-1: 3bytes
• 2^21~2^28-1: 4bytes
• 2^28~2^32-1: 5bytes
Build Index for 5GB of Text
Search
Classic Models
• Boolean Model
• Vector Space Model
• Probabilistic Model
Work Flow
Optimize Merging
• Jump list
• Binary search
• Multiway merging
• Map/Hash Map
Automatic Summary
Thanks!
Backup
Bowtie Websites