web search engines – browsing services search engine services web pages bag of words two semantics...
TRANSCRIPT
![Page 1: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/1.jpg)
搜索引擎技术
![Page 2: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/2.jpg)
内容提要
• 搜索引擎工作原理
• 信息检索相关研究和机构
![Page 3: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/3.jpg)
搜索引擎 — Web Search Engines
• 定义:允许用户递交查询,检索出与查询相关的网页结果列表,并且排序输出。
• 创建索引的方法– 手工索引– 自动索引
• 系统结构– 集中式体系结构– 分布式体系结构
![Page 4: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/4.jpg)
![Page 5: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/5.jpg)
![Page 6: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/6.jpg)
Browsing Services
Search Engine Services
WebPages
Bag of Words
Two semantics extremes
Two service extremes
???
???
![Page 7: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/7.jpg)
搜索引擎三段式工作流程
• 搜集– 批量搜集,增量式搜集;搜集目标,搜集策略
• 预处理– 关键词提取;重复网页消除;链接分析;索引
• 服务– 查询方式和匹配;结果排序;文档摘要
搜集 整理 服务
![Page 8: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/8.jpg)
搜索引擎系统流程
![Page 9: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/9.jpg)
天网搜索引擎系统流程
![Page 10: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/10.jpg)
分布式 Web 搜集系统结构
协调进程
(节点)
抓取进程
协调进程
(节点)
抓取进程
协调进程
(节点)
抓取进程
调 度 模块
……
![Page 11: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/11.jpg)
天网存储格式version: 1.0 // version number
url: http://www.pku.edu.cn/ // URL
origin: http://www.somewhere.cn/ // original URL
date: Tue, 15 Apr 2003 08:13:06 GMT // time of harvest
ip: 162.105.129.12 // IP address
unzip-length: 30233 // If included, the data must be
compressed
length: 18133 // data length
// a blank line
XXXXXXXX // the followings are data part
XXXXXXXX
….
XXXXXXXX // data end
// insert a new line
![Page 12: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/12.jpg)
File Organizations (Indexes)
• Choices for accessing data during query evaluation• Scan the entire collection
– Typical in early (batch) retrieval systems– Computational and I/O costs are O(characters in collection)– Practical for only “small” text collections– Large memory systems make scanning feasible
• Use indexes for direct access– Evaluation time O(query term occurrences in collection)– Practical for “large” collections– Many opportunities for optimization
• Hybrids: Use small index, then scan a subset of the collection
![Page 13: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/13.jpg)
Indexes
• What should the index contain?• Database systems index primary and secondarykeys
– This is the hybrid approach– Index provides fast access to a subset of database records– Scan subset to find solution set
• IR Problem:• Cannot predict keys that people will use in queries
– Every word in a document is a potential search term
• IR Solution: Index by all keys (words) full text indexes
![Page 14: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/14.jpg)
Index Contents
• The contents depend upon the retrieval model• Feature presence/absence
– Boolean– Statistical (tf, df, ctf, doclen, maxtf)– Often about 10% the size of the raw data, compressed
• Positional– Feature location within document– Granularities include word, sentence, paragraph, etc– Coarse granularities are less precise, but take less space– Word-level granularity about 20-30% the size of the raw
data,compressed
![Page 15: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/15.jpg)
Indexes: Implementation
• Common implementations of indexes– Bitmaps– Signature files– Inverted files
• Common index components– Dictionary (lexicon)– Postings
• document ids• word positions
No positional data indexed
![Page 16: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/16.jpg)
Inverted Files
![Page 17: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/17.jpg)
Inverted Files
![Page 18: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/18.jpg)
Word-Level Inverted File
![Page 19: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/19.jpg)
Inverted Search Algorithm
1. Find query elements (terms) in the lexicon
2. Retrieve postings for each lexicon entry
3. Manipulate postings according to the retrieval model
![Page 20: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/20.jpg)
Word-Level Inverted File
Query: 1.porridge & pot (BOOL) 2.“porridge pot” (BOOL)3. porridge pot (VSM)
lexicon posting
Answer
![Page 21: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/21.jpg)
内容提要
• 搜索引擎工作原理
• 信息检索相关研究和机构
![Page 22: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/22.jpg)
A Brief history of Modern Information Retrieval
• In 1945, Vannevar Bush published "As We May Think" in the Atlantic monthly.
• In the 1960s, the SMART system by Gerard Salton and his students
• Cranfield evaluations done by Cyril Cleverdon• The 1970s and 1980s saw many developments built on
the advances of the 1960s.• In 1992 with the inception of Text Retrieval Conference.• The algorithms developed• The algorithms developed in IR were employed for
searching the Web from 1996.
![Page 23: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/23.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Clustering of SIGIR papers by topic vs. year
![Page 24: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/24.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Question answering
![Page 25: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/25.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Clustering
![Page 26: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/26.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Inverted files & Implementations
![Page 27: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/27.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Message understanding & TDT
![Page 28: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/28.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Filtering
![Page 29: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/29.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Hypertext IR, Multiple evidence
![Page 30: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/30.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Probabilistic & Language models
![Page 31: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/31.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Distributed IR
![Page 32: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/32.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Evaluation
![Page 33: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/33.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Topic distillation & Linkage retrieval
![Page 34: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/34.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Text categorisation
![Page 35: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/35.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Document summarisation
![Page 36: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/36.jpg)
Cluster \ Year 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces 8 4 1 6 5 10 1 3 5 2 5 2 4 1 3 1 1 2 2 66
General ! 5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1 4 2 5 1 126
Models 1 2 1 1 4 1 2 1 2 1 2 2 2 2 2 3 1 30
Question answering 1 1 1 1 1 1 1 1 4 4 1 17
Syntactic phrases & SDR 1 1 1 2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR 1 4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Compression 1 1 2 2 1 1 1 3 1 1 1 2 1 18
Clustering 2 1 1 2 3 3 2 1 2 1 1 2 1 1 3 26
Relevance feedback 1 1 1 2 1 1 1 1 2 4 3 1 2 1 1 1 1 25
Inverted files & Implementations 1 1 1 2 1 3 1 2 1 1 1 3 18
Term weighting 1 3 2 1 2 1 1 5 3 3 1 2 1 1 1 1 1 1 31
Message understanding & TDT 1 1 1 3 2 3 4 2 4 5 5 31
Filtering 1 1 1 1 1 4 1 1 1 1 2 3 18
Hypertext IR, Multiple evidence 1 3 1 1 2 1 2 2 2 1 4 3 1 5 2 2 33
Image retrieval 1 1 1 1 1 2 1 1 9
Probabilistic & Language models 1 1 1 3 1 3 4 2 2 3 2 1 3 1 3 3 34
Boolean & extended Boolean 1 2 1 1 1 1 1 1 1 10
Japanese & Chinese IR 1 1 2 3 2 3 1 1 14
DBMS & IR 1 1 1 1 1 5
Users & Search 2 3 3 2 2 4 3 2 2 3 1 3 3 1 1 2 1 38
Visualisation 1 1 1 1 1 2 1 1 2 1 12
Signature files 1 1 1 2 2 1 1 9
Distributed IR 1 2 1 2 1 1 3 1 1 3 4 2 1 1 24
Evaluation 3 4 4 2 1 7 2 3 8 34
Topic distillation & Linkage retrieval 1 3 3 2 9
Latent semantic indexing 1 1 1 2 1 6
Text categorisation 1 3 3 3 1 3 1 3 3 2 23
Document summarisation 2 2 2 3 3 12
Cross lingual 1 3 3 1 1 3 4 16
Cross lingual
![Page 37: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/37.jpg)
信息检索相关研究和机构• CIIR, University of Massachusetts
• LTI, Carnegie Mellon University
• The Stanford University DB Group
• Microsoft Research Asia
• TREC
• 北京大学 , 网络实验室 , 天网组
![Page 38: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/38.jpg)
Lemur 简介• http://www-2.cs.cmu.edu/~lemur/
![Page 39: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/39.jpg)
Lemur Toolkit
• 目标:为促进 LM 和 IR 研究的 research system– ad hoc , distributed retrieval, cross-language IR,
summarization, filtering, and classification• 功能 :
– 支持大规模文档数据库的索引– 建立 Simple Language Model– 实现基于 Language Model 和其它多个检索模型的系统
• 实现 :– C and C++ – Unix / Windows – Current Version 3.1
![Page 40: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/40.jpg)
MRA: Towards Next Generation Web Search
• From Pages to Blocks– Analyze the Web at finer granularity
• From Surface Web to Deep Web– Unleash the huge assets of high-value information
• From Unstructure to Structure– Provide well organized results
• From relevance to intelligence– Contribute knowledge discovery with search
• From Desktop Search to Mobile Search– Bridge physical world search to digital world search
![Page 41: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/41.jpg)
The Stanford Univ. DB Group
• WebBase– Crawling, storage, indexing, and querying of
large collections of Web pages.
• Digital Libraries– Infrastructure and services for creating,
disseminating, sharing and managing information
![Page 42: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/42.jpg)
TREC Conference• Established in 1992 to evaluate large-scale IR
– Retrieving documents from a gigabyte collection
• Has run continuously since then– TREC 2004(13th) meeting is in November
• Run by NIST’s Information Access Division• Probably most well known IR evaluation setting
– Started with 25 participating organizations in 1992 evaluation
– In 2003, there were 93 groups from 22 different countries
• Proceedings available on-line (http://trec.nist.gov )– Overview of TREC 2003 at
http://trec.nist.gov/pubs/trec12/papers/OVERVIEW.12.pdf
![Page 43: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/43.jpg)
• TREC consists of IR research tracks– Ad hoc, routing, confusion ( scanned documents, speech
recognition ), video, filtering, multilingual ( cross-language, Spanish, Chinese ), question answering, novelty, high precision, interactive, Web, database merging, NLP, …
• Each track works on roughly the same model– November: track approved by TREC community– Winter: track’s members finalize format for track– Spring: researchers train system based on specification– Summer: researchers carry out format evaluation
• Usually a “blind” evaluation: research do not know answer– Fall: NIST carries out evaluation– November: Group meeting (TREC) to find out:
• How well your site did• How others tackled the program
– Many tracks are run by volunteers outside of NIST (e.g. Web)• “Coopetition” model of evaluation
– Successful approaches generally adopted in next cycle
TREC General Format
![Page 44: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/44.jpg)
TREC Tracks
![Page 45: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/45.jpg)
Summary of VLC/Web Track evaluation 1996 - 2003
![Page 46: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/46.jpg)
Tianwang Group @PKU
1996 1999 2000 2002 2004
Cycles: experience requirement experience requirement experience requirement
Key ideas:
Web pages preserve easier preserve
Web pages FTP files grow vanishing web resources mass system
exponentially pages Mile- Tianwang 1.0 Bingle 1.0 Tianwang 2.0 Web InfoMall 1.0 CDAL 1.0, World MEMEX stones: Web InfoMall 2.0
![Page 47: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/47.jpg)
http://www.infomall.cn/
![Page 48: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/48.jpg)
![Page 49: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/49.jpg)
![Page 50: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/50.jpg)
CWT100g 构建时间表 2004.2.1 6.16 10.8 -20 11.3 11.10
CWT100g idea Document query pooling judgment ......
√ √ √
我是一小步,人类的一大步 !
√
![Page 51: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/51.jpg)
![Page 52: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/52.jpg)
截止 2004-12-20 北大燕穹数据共享情况
2.5/8.8 = 28.4%
![Page 53: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/53.jpg)
提交结果的参加队TEAM NAME
TD-RUNS
NPHP-RUNS
上海交通大学 APEX 实验室 APEX 5 5
北京大学计算机科学技术研究所 ANS 3 2
TRS 公司 TRS 5 2
华南理工大学木棉一队 MUMIAN1
3 1
华南理工大学木棉二队 MUMIAN2
2 1
华南理工大学计算机学院数据库应用研究室 SCUTDB 5 5
福建师大附中 WLL 1
注: pooling 还包括 google,yisou,baidu,sogou,zhongsou 五个 SE 的检索结果。
![Page 54: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/54.jpg)
主题提取
导航搜索
其中 TIANWANG_RUN 仅供参考
评测结果
![Page 55: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/55.jpg)
总结
• 搜索引擎工作原理
• 信息检索相关研究和机构( 下载源码就到源码网 :www.codepub.com )
![Page 56: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/56.jpg)
谢谢!
![Page 57: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/57.jpg)
Vector Space Model
• 文档 d 和查询 q 在向量空间中表示为两个 m 维向量,每维度的权值用 TF∙IDF ,其相似度用向量夹角余弦度量,有 : ( 使用原始的 tf,idf 公式 )
Qt ttqtd
d
Qt ttqtd
dq
dq
Qtttqttd
dq
Qttqtd
d
df
Nff
W
df
Nff
WW
WW
idffidff
WW
WW
DQCos
)(log1
)(log1
),(
2,,
2,,
,,,,
BACK
![Page 58: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/58.jpg)
Query Answer
• 1.porridge & pot (BOOL) – d2
• 2.“porridge pot” (BOOL)– null
• 3. porridge pot (VSM)– d2 > d1>d5 – Next page
BACK
![Page 59: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/59.jpg)
CIIR-Center for Intelligent Information Retrieval @UMASS
• One of the leading research groups in IR– improving the probabilistic models, – first description of a retrieval system based on statistical language
models. – introduced and improved a number of techniques for text and query
representation– automatically representing databases and combining local searches for
DIR– first high capacity probabilistic filtering architecture– define and evaluate the first versions of event detection and tracking
software– earliest research on ranking and representation techniques for Asian
languages– first approaches to information extraction that emphasized learning– novel techniques for indexing images and video
![Page 60: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/60.jpg)
CIIR cont.
• Research– more than 500 journal and refereed conference
papers over the past 12 years (52 submissions in 2003).
• industrial and government collaboration – INQUERY– licensed our software to nearly 300 sites
• Education – 20 Ph.D.s , 29 M.S. – 123/145, 34/4 graduate/undergraduate
![Page 61: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/61.jpg)
CIIR cont.
• Personnel– Faculty 4 (W. BRUCE CROFT)– Technical personel 10– Graduate student 34/10
• Groups– IESL:Information Extraction and Synthesis Laboratory– IR :Information Retrieval Laboratory– MIR :Multimedia Indexing and Retrieval Laboratory
• The CIIR is currently concentrating on the unsolved long-term research problems that underlie effective information retrieval – text representation, – query acquisition,– retrieval models
![Page 62: Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???](https://reader036.vdocument.in/reader036/viewer/2022081414/5513d3a95503463a298b525a/html5/thumbnails/62.jpg)
LTI : Language Technologies Institue @CMU
• Machine Translation, Natural Language Processing, Speech, and Information Retrieval
• IR Projects (Jamie Callan and Yiming Yang )– Adaptive Information Filtering – Distributed Information Retrieval / Federated Search – Email Classification and Prioritization– Minerva: Web Mining for Question Answering– MuchMore: Translingual Information Retrieval – JAVELIN: Open-Domain Question Answering
BACK