Download - Introduction of search engine
![Page 1: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/1.jpg)
Introduction of Search Engine
By Jinglun
![Page 2: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/2.jpg)
Agenda
• Overview
• Spider
• Analysis
• Index
• Search
![Page 3: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/3.jpg)
Search Engine
![Page 4: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/4.jpg)
Architecture
![Page 5: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/5.jpg)
Spider
![Page 6: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/6.jpg)
Web Characters
• Bowtie
• Diameter of WWW
• Volatile
• Semi-Structured
![Page 7: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/7.jpg)
Problems & Solutions
• Depth-Fist Traversal vs Bread-First Traversal
• Duplication
• Optimize download
• Page Storage
• Concurrence Control and Robots.txt
![Page 8: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/8.jpg)
Architecture
![Page 9: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/9.jpg)
Analysis
• Structuring
• Replica Detection
• Word Segment
• Page Rank
![Page 10: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/10.jpg)
Structuring
![Page 11: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/11.jpg)
Replica Detection
![Page 12: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/12.jpg)
Word Segment
• Forward Maximum Matching Method
• Backward Maximum Matching Method
• N-Gram
![Page 13: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/13.jpg)
Page Rank
![Page 14: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/14.jpg)
Architecture
![Page 15: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/15.jpg)
Index
![Page 16: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/16.jpg)
Data Structure
![Page 17: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/17.jpg)
Build Index
![Page 18: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/18.jpg)
Delta Encode
• a, b, c => a, b – a, c – b
• Case1: docid
0, 2, 3, 5, 6, 10 … =>
0, 2, 1, 2, 1, 4 …
• Case2: term
apple, applet, application, banana … =>
<0, apple>, <4, t>, <4, ication>,
<0, banana> …
![Page 19: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/19.jpg)
Vint
• Highest bit is ‘1’: Highest bit is ‘0’:
• 0~127: 1byte, 0xxxxxxx
• 128~2^14-1: 2bytes
xxxxxxxyyyyyyy => 1xxxxxxx, 0yyyyyyy
• 2^14~2^21-1: 3bytes
• 2^21~2^28-1: 4bytes
• 2^28~2^32-1: 5bytes
![Page 20: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/20.jpg)
Build Index for 5GB of Text
![Page 21: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/21.jpg)
Search
![Page 22: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/22.jpg)
Classic Models
• Boolean Model
• Vector Space Model
• Probabilistic Model
![Page 23: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/23.jpg)
Work Flow
![Page 24: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/24.jpg)
Optimize Merging
• Jump list
• Binary search
• Multiway merging
• Map/Hash Map
![Page 25: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/25.jpg)
Automatic Summary
![Page 26: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/26.jpg)
Thanks!
![Page 27: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/27.jpg)
Backup
![Page 28: Introduction of search engine](https://reader034.vdocument.in/reader034/viewer/2022052413/55990dd71a28ab002c8b4583/html5/thumbnails/28.jpg)
Bowtie Websites