introduction of search engine

Post on 05-Jul-2015

183 Views

Category:

Art & Photos

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Introduction of search engine.

TRANSCRIPT

Introduction of Search Engine

By Jinglun

Agenda

• Overview

• Spider

• Analysis

• Index

• Search

Search Engine

Architecture

Spider

Web Characters

• Bowtie

• Diameter of WWW

• Volatile

• Semi-Structured

Problems & Solutions

• Depth-Fist Traversal vs Bread-First Traversal

• Duplication

• Optimize download

• Page Storage

• Concurrence Control and Robots.txt

Architecture

Analysis

• Structuring

• Replica Detection

• Word Segment

• Page Rank

Structuring

Replica Detection

Word Segment

• Forward Maximum Matching Method

• Backward Maximum Matching Method

• N-Gram

Page Rank

Architecture

Index

Data Structure

Build Index

Delta Encode

• a, b, c => a, b – a, c – b

• Case1: docid

0, 2, 3, 5, 6, 10 … =>

0, 2, 1, 2, 1, 4 …

• Case2: term

apple, applet, application, banana … =>

<0, apple>, <4, t>, <4, ication>,

<0, banana> …

Vint

• Highest bit is ‘1’: Highest bit is ‘0’:

• 0~127: 1byte, 0xxxxxxx

• 128~2^14-1: 2bytes

xxxxxxxyyyyyyy => 1xxxxxxx, 0yyyyyyy

• 2^14~2^21-1: 3bytes

• 2^21~2^28-1: 4bytes

• 2^28~2^32-1: 5bytes

Build Index for 5GB of Text

Search

Classic Models

• Boolean Model

• Vector Space Model

• Probabilistic Model

Work Flow

Optimize Merging

• Jump list

• Binary search

• Multiway merging

• Map/Hash Map

Automatic Summary

Thanks!

Backup

Bowtie Websites

top related