introduction of search engine

29
Introduction of Search Engine By Jinglun

Upload: jinglun-li

Post on 05-Jul-2015

183 views

Category:

Art & Photos


1 download

DESCRIPTION

Introduction of search engine.

TRANSCRIPT

Page 1: Introduction of search engine

Introduction of Search Engine

By Jinglun

Page 2: Introduction of search engine

Agenda

• Overview

• Spider

• Analysis

• Index

• Search

Page 3: Introduction of search engine

Search Engine

Page 4: Introduction of search engine

Architecture

Page 5: Introduction of search engine

Spider

Page 6: Introduction of search engine

Web Characters

• Bowtie

• Diameter of WWW

• Volatile

• Semi-Structured

Page 7: Introduction of search engine

Problems & Solutions

• Depth-Fist Traversal vs Bread-First Traversal

• Duplication

• Optimize download

• Page Storage

• Concurrence Control and Robots.txt

Page 8: Introduction of search engine

Architecture

Page 9: Introduction of search engine

Analysis

• Structuring

• Replica Detection

• Word Segment

• Page Rank

Page 10: Introduction of search engine

Structuring

Page 11: Introduction of search engine

Replica Detection

Page 12: Introduction of search engine

Word Segment

• Forward Maximum Matching Method

• Backward Maximum Matching Method

• N-Gram

Page 13: Introduction of search engine

Page Rank

Page 14: Introduction of search engine

Architecture

Page 15: Introduction of search engine

Index

Page 16: Introduction of search engine

Data Structure

Page 17: Introduction of search engine

Build Index

Page 18: Introduction of search engine

Delta Encode

• a, b, c => a, b – a, c – b

• Case1: docid

0, 2, 3, 5, 6, 10 … =>

0, 2, 1, 2, 1, 4 …

• Case2: term

apple, applet, application, banana … =>

<0, apple>, <4, t>, <4, ication>,

<0, banana> …

Page 19: Introduction of search engine

Vint

• Highest bit is ‘1’: Highest bit is ‘0’:

• 0~127: 1byte, 0xxxxxxx

• 128~2^14-1: 2bytes

xxxxxxxyyyyyyy => 1xxxxxxx, 0yyyyyyy

• 2^14~2^21-1: 3bytes

• 2^21~2^28-1: 4bytes

• 2^28~2^32-1: 5bytes

Page 20: Introduction of search engine

Build Index for 5GB of Text

Page 21: Introduction of search engine

Search

Page 22: Introduction of search engine

Classic Models

• Boolean Model

• Vector Space Model

• Probabilistic Model

Page 23: Introduction of search engine

Work Flow

Page 24: Introduction of search engine

Optimize Merging

• Jump list

• Binary search

• Multiway merging

• Map/Hash Map

Page 25: Introduction of search engine

Automatic Summary

Page 26: Introduction of search engine

Thanks!

Page 27: Introduction of search engine

Backup

Page 28: Introduction of search engine

Bowtie Websites