text information retrieval and applications – advanced topics
DESCRIPTION
Text Information Retrieval and Applications – Advanced Topics. By J. H. Wang May 27, 2009. Outline. Advanced Retrieval Technologies Cross-Language Information Retrieval Multimedia Information Retrieval Semantic Retrieval Applications to IR Advanced Google Meta Search - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/1.jpg)
1
Text Information Retrieval and Applications – Advanced
TopicsBy J. H. WangMay 27, 2009
![Page 2: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/2.jpg)
2
Outline
• Advanced Retrieval Technologies– Cross-Language Information Retrieval– Multimedia Information Retrieval– Semantic Retrieval
• Applications to IR– Advanced Google– Meta Search– Search Result Clustering
![Page 3: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/3.jpg)
3
Advanced Retrieval Technologies
• Cross-Language Information Retrieval (CLIR)
• Multimedia IR (image, speech, music, video)
• Semantic retrieval (XML, Semantic Web)
![Page 4: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/4.jpg)
4
Cross-Language Information Retrieval
• Cross Language Information Retrieval (CLIR) -- A technology enabling users to query in one language and retrieve relevant documents written or indexed in another language
![Page 5: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/5.jpg)
5
Cross Language Web Search
• A technology enabling users to query in one language and retrieve relevant Web pages written or indexed in another language
![Page 6: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/6.jpg)
6
Why “Cross-Language”?
• Source: Global Reach (global-reach.biz/globstats)
![Page 7: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/7.jpg)
7
Internet World Users by Language
![Page 8: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/8.jpg)
8
Top Ten Languages Used in the WebSource: Internet World Stats (Mar. 31,
2009)
TOP TEN LANGUAGESIN THE INTERNET
Internet Usersby Language
InternetPenetrationby Language
Growthin Internet
( 2000 - 2008 )
Internet Users% of Total
World Populationfor this Language(2008 Estimate)
English 463,790,410 37.2 % 226.7 % 29.1 % 1,247,862,351
Chinese 321,361,613 23.5 % 894.8 % 20.1 % 1,365,138,028
Spanish 130,775,144 32.0 % 619.3 % 8.2 % 408,760,807
Japanese 94,000,000 73.8 % 99.7 % 5.9 % 127,288,419
French 73,609,362 17.8 % 503.4 % 4.6 % 414,043,695
Portuguese 72,555,800 29.7 % 857.7 % 4.5 % 244,080,690
German 65,243,673 67.7 % 135.5 % 4.1 % 96,402,666
Arabic 41,396,600 14.2 % 1,545.2 % 2.6 % 291,073,346
Russian 38,000,000 27.0 % 1,125.8 % 2.4 % 140,702,094
Korean 36,794,800 51.9 % 93.3 % 2.3 % 70,944,739
TOP 10 LANGUAGES 1,337,527,402 30.4 % 329.2 % 83.8 % 4,406,296,835
Rest of the Languages 258,742,706 11.2 % 424.5 % 16.2 % 2,303,732,235
WORLD TOTAL 1,596,270,108 23.8 % 342.2 % 100.0 % 6,710,029,070
Top Ten Languages Used in the Web( Number of Internet Users by Language )
More and more non-English users!
![Page 9: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/9.jpg)
9
0.1
1.0
10.0
100.0
Inte
rnet
Hos
ts (
mil
lion
):
English Japanese German French Dutch Finnish Spanish Chinese Swedish
Language (estimated by domain)
Web Content
Source: Network Wizards Internet Domain Survey (Jan 99 )
More and more non-English pages
![Page 10: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/10.jpg)
10
Chart of Web Content (by Language)
• Total Web pages: 313 B – English 68.4% – Japanese 5.9% – German 5.8% – Chinese 3.9% – French 3.0% – Spanish 2.4% – Russian 1.9% – Italian 1.6% – Portuguese 1.4% – Korean 1.3% – Other 4.6%
[Source: Vilaweb.com, as quoted by eMarketer (Feb. 2001)]
![Page 11: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/11.jpg)
11
Language Percent of Public Sites
• English 72% • German 7% • Japanese 6% • Spanish 3% • French 3% • Italian 2% • Dutch 2% • Chinese 2% • Korean 1% • Portuguese 1% • Russian 1% • Polish 1%
[Source: OCLC, 2002]
![Page 12: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/12.jpg)
12
Web Users and Pages (10 years ago)
Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99
Challenge of Scalability !
Total Users: 800MChinese Users: 110M
Including 87M (CN), 4.9M (HK), 11.6M (TW), 2.9M (MY), 2.14M (SG), 1.5M (US), and others.
Source: Global Reach, 2004
![Page 13: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/13.jpg)
13
10,030,000,000 pages
Scalability Problem !
Number of Chinese Web Pages
![Page 14: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/14.jpg)
14
Number of Web Pages
The world’s largest search engine ?
Billions Of Textual Documents Indexed
December 1995-September 2003
KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.
Source: Search Engine Watch (Nov. 2004)
SearchEngine
ReportedSize
PageDepth
Google 8.1 billion 101K
MSN 5.0 billion 150K
Yahoo4.2 billion(estimate)
500K
AskJeeves
2.5 billion 101K+
![Page 15: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/15.jpg)
15
Number of Web Pages
• Estimated size:– Web pages in the world: 19.2 billion pages
(indexed by Yahoo as of August 2005)– Websites in the world: 70,392,567 websites
(indexed by Netcraft as of August 2005)– Web pages per website: 273 (rounding to
the nearest whole number)• Updated estimate:
– 231,510,169 distinct websites (as found by the Netcraft Web Server Survey in April 2009)
– 63.2 billion [Source: http://news.netcraft.com/archives/web_server_survey.html]
[Source: http://www.boutell.com/newfaq/misc/sizeofweb.html]
![Page 16: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/16.jpg)
16
Number of Web Pages
• 1 trillion unique URLs (We knew the web was big, by Jesse Alpert & Nissan Hajaj, Software Engineers, Web Search Infrastructure Team, 25 July 2008)
• 19,200,000,000 pages (Mayer, Tim, 8 August 2005, Our Blog is Growing Up And So Has Our Index)
• 320,000,000 pages (World Wide Web is 320 million and growing, BBC News Sci/Tech, 3 April 1998.)
• 1,000,000,000 pages (Internet. How much information? 2000. Regents of the University of California.)
• 800,000,000 pages (Maran, Ruth, and Paul Whitehead. "Web Pages." Internet and World Wide Web Simplified, 3rd ed. Foster City: IDG Books Worldwide, 1999. )
• 8,034,000,000 pages (Miller, Colleen. web sites: number of pages. NEC Research, IDC.)
[Source: http://hypertextbook.com/facts/2007/LorantLee.shtml]
![Page 17: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/17.jpg)
17
Challenge of Cross-Language Web Search
• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup
• 81% of the search terms could not be obtained from common English-Chinese translation dictionaries
中 央 處 理 器 (CPU), 電 子 商 務 (E-commerce),
個人數位助理 (PDA), 雅虎 (Yahoo), 太空總署 (NASA), 星際大戰 (Star War),非典型肺炎 (SARS), …
![Page 18: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/18.jpg)
18
Challenge
• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup
• 81% of the search requests could not be obtained from common English-Chinese translation dictionaries
• How to find effective translations automatically for query terms not included in a dictionary ?
![Page 19: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/19.jpg)
19
Query Translation & CLIR in DL
Chinese Query Mono-Lingual
Document Search
Mono-Lingual Document
Search瓷器 Chinese Digital
Libraries
Chinese Digital
Libraries
Possible global use
![Page 20: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/20.jpg)
20
Query Translation & CLIR in DL
?
English Query
Porcelain
Chinese Query Mono-Lingual
Document Search
Mono-Lingual Document
Search瓷器 Chinese Digital
Libraries
Chinese Digital
Libraries
Need for CLIR services
![Page 21: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/21.jpg)
21
Query Translation & CLIR in DL
?
English Query
Porcelain
Chinese Query Mono-Lingual
Document Search
Mono-Lingual Document
Search瓷器 Chinese Digital
Libraries
Chinese Digital
Libraries
Query Translation
Query Translation
瓷器 / 瓷 / 陶瓷
![Page 22: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/22.jpg)
22
Query Translation & CLIR in DL
?
English Query
Porcelain
Chinese Query Mono-Lingual
Document Search
Mono-Lingual Document
Search瓷器 Chinese Digital
Libraries
Chinese Digital
Libraries
Query Translation
Query Translation
瓷器 / 瓷 / 陶瓷
Cost-ineffective to construct translation dictionaries
![Page 23: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/23.jpg)
23
Query Translation & CLIR in DL
?
English Query
Porcelain
Chinese Query Mono-Lingual
Document Search
Mono-Lingual Document
Search瓷器 Chinese Digital
Libraries
Chinese Digital
Libraries
Web
Query Translation
Query Translation
瓷器 / 瓷 / 陶瓷
Taking the Web as online corpus to deal with translation of
unknown terms
![Page 24: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/24.jpg)
24
Query Translation & CLIR in DL
?
English QueryNational
Palace Museum
Chinese Query Mono-Lingual
Document Search
Mono-Lingual Document
Search瓷器 Chinese Digital
Libraries
Chinese Digital
Libraries
Web
Query Translation
Query Translation
故宮 / 故宮博物院
Online Term Translation Suggestions
![Page 25: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/25.jpg)
25
Query Translation & CLIR in DL
Auto-generated
Translation Lexicons
?
English/Japanese/Korean Queries
Chinese Query Mono-Lingual
Document Search
Mono-Lingual Document
Search瓷器 Chinese Digital
Libraries
Chinese Digital
Libraries
Web
Query Translation
Query Translation
瓷器 / 瓷 / 陶瓷
![Page 26: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/26.jpg)
26
CLIR
• Conventional approach to query translation – Parallel documents as the corpus– Assume long queries
• Problems of CLIR in digital libraries– No corpus for cross-lingual
training– Short queries
“Out-of-dictionary” terms– Ex: proper nouns, new
terminologies, …
English Terminologies
Chinese Translation
mechanical strain 機械應變viscous damping 黏滯阻尼Richard Feynman 費曼Hyoplastic Left Heart Syndrome
左心發育不全症候群
NII Japan 國立情報學研究所
SARS 嚴重急性呼吸道症候群
Extracorporeal Shock Wave Lithotripsy
震波碎石
Davinci 達文西
![Page 27: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/27.jpg)
27
Translation Lexicon Construction for CLIR
• To use the Web as the corpus for query translation– Web mining techniques
• Anchor-text-based [ACM TOIS ‘04, ACM TALIP ‘02]• Search-result-based [JCDL ‘04]
• To extract terms from real document collections as possible queries– Term extraction method [SIGIR ‘97]
![Page 28: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/28.jpg)
28
Web Mining Approach to Term Translation Extraction
• LiveTrans: http://wkd.iis.sinica.edu.tw/LiveTrans/
LiveTrans Engine
LiveTrans Engine
Academia SinicaAnchor textsAnchor texts
Search resultsSearch results
The Web
中央研究院 / 中研院
Source query
Target translations
![Page 29: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/29.jpg)
29
National Palace Museum vs. 故宮博物院Search-Result Page
• Mixed-language characteristic in Chinese pages• How to extract translation candidates?• Which candidates to choose?
Noises
![Page 30: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/30.jpg)
30
Yahoo vs. 雅虎 -- Anchor-Text Set
• Anchor text (link text)– The descriptive text of a
link on a Web page
• Anchor-text set– A set of anchor texts
pointing to the same page (URL)
– Multilingual translations− Yahoo/雅虎 /야후− America/美国 /アメリカ
• Anchor-text-set corpus– A collection of anchor-
text sets
Yahoo Search Engine
美国雅虎 雅虎搜尋引擎
Yahoo! America
Taiwan
China
Japan
Korea
야후 -USA
アメリカの Yahoo! http://www.yahoo.com
![Page 31: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/31.jpg)
31
Term Translation Extraction from Different Resources
Term
Extraction
Term
Extraction
Source Query
TargetTranslation
Search-ResultPages
SearchEngineSearchEngine
SimilarityEstimationSimilarityEstimation
National Palace Museum
國立故宮博物院 , 故宮 , 故宮博物院
Anchor-Text
Corpus
WebSpiderWeb
Spider
![Page 32: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/32.jpg)
32
LiveTrans: Cross-language Web Search
![Page 33: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/33.jpg)
33
More Examples
![Page 34: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/34.jpg)
34
More Examples
![Page 35: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/35.jpg)
35
Multimedia IR
• Different forms of information need• Image retrieval• Speech information retrieval• Music information retrieval• Video information retrieval
![Page 36: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/36.jpg)
36
Image Retrieval
• Content-based– Query by image content
• Query by example ( 以圖找圖 )
– Similarity in visual features• Color, texture, shape, …
– Relevance feedback
• Text-based– Annotation
![Page 37: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/37.jpg)
37
Content-Based Image Retrieval (CBIR)
• Example systems– CIRES (Content-based Image Retrieval
System): http://amazon.ece.utexas.edu/~qasim/research.htm
– SIMPLIcity: http://www-db.stanford.edu/IMAGE/– National Museum of History:
http://210.201.141.12/cgi-bin/cbir-query.cgi?tid=-1
– …
![Page 38: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/38.jpg)
38
Image
Similarimages(no RF)
Relevance Feedback (RF)Source: Dr. Cheng
![Page 39: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/39.jpg)
39
Similar Images Using Relevance Feedback
Similarimagesusing RF
Image
![Page 40: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/40.jpg)
40
Automatic Image Annotation
Keywords?
white bear snow tundra polar bears snow fight
polar bear ice snow
Visual Similarity
Problem 1
Image Banks with Annotations
![Page 41: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/41.jpg)
41
Spoken Document Retrieval
• Spoken document retrieval– Indexing speech messages using speech recognition– Retrieving relevant messages for a text/speech
query
• Techniques– Document Processing: acoustic change detection,
speech/non-speech detection, Mandarin/non-Mandarin detection, story segmentation, speaker recognition/clustering
– Speech Recognition– Indexing/Retrieval
![Page 42: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/42.jpg)
42
SoVideo
http://slam.iis.sinica.edu.tw/demo.htm
![Page 43: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/43.jpg)
43
Music Information Retrieval
• Finding a song by similar melody– Query by singing – Query by humming
• Singer identification– Background noise– Singer voice model
• Demo:– http://slam.iis.sinica.edu.tw/demo.htm
![Page 44: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/44.jpg)
44
Video Information Retrieval
• Difference with CBIR– Temporal information– Structural organization– Complexity of querying system
• Techniques– Video segmentation– Keyframe identification
![Page 45: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/45.jpg)
45
Semantic Retrieval
• HTML vs. XML• Semantic Web (Agent, Ontology,
RDF)
![Page 46: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/46.jpg)
46
Common Language of the Web
• HTML– Link: Pi Pj
– URL (URI), anchor text• Part-of
NTU
National Taiwan University
http://www.ntu.edu.tw/
![Page 47: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/47.jpg)
47
Link Analysis –Hubs & Authorities in PageRank
100
9
53
50
50
50
3
3
3
![Page 48: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/48.jpg)
48
Current Web Search
• Keyword-based search (e.g., Google)– Full text indexing – Page authority (link analysis)– Page popularity (query log and user’s click)
• Problems– Not specific
• Data in pages have no semantic annotations• Yo-yo Ma’s most recent CD
– No topic disambiguation• Documents with different topics mix together• Yo-yo Ma’s CDs, concerts, biography, gossips,…
![Page 49: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/49.jpg)
49
Search on Semantic Web
• Metadata search– To increase precision and flexibility
• Topic-based search– To help contextualize queries and
overlay results in terms of a knowledge base
![Page 50: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/50.jpg)
50
XML (Extensible Markup Language)
• More flexible tags• DTD (Data Type Definition)
– Definition of the tags
![Page 51: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/51.jpg)
51
XML Search
• XML Text Search Engines– Amberfish (Etymon)– X3 (X-cubed) (DocSoft)– UltraSeek (Verity)
• XML Structured Query Engines– Fxgrep– Cheshire II (UC Berkeley)
• XML Query Languages– XQuery (W3C XMLQuery)– XQL– XML-QL
![Page 52: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/52.jpg)
52
Semantic Web
• "The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001
![Page 53: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/53.jpg)
53
Semantic Web
Agent
Agent
Agent
RDF
ontology
![Page 54: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/54.jpg)
54
Semantic Web
• RDF (Resource Description Framework)– Common language
• Ontology – Knowledge representation
• Agent
![Page 55: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/55.jpg)
55
Why Semantic Web?
• Standardizing knowledge sharing and reusability on the Web
• Interoperable (independent of devices and platforms)
• Machine readable—enabling intelligent processing of information
![Page 56: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/56.jpg)
56
An Example of Semantic Relation
author
work
publisherpublish
written by
![Page 57: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/57.jpg)
57
What is a Software Agent?
• A paradigm shift of information utilization from direct manipulation to indirect access and delegation
• A kind of middleware between information demand (client) and information supply (server)
• A software that has autonomous, personalized, adaptive, mobile, communicative, social, decision making abilities
![Page 58: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/58.jpg)
58
What is Ontology?
• An ontology is a formal and explicit specification of shared conceptualization of a domain of interest (T. Gruber)– Formal semantics– Consensus of terms– Machine readable and processible– Model of real world– Domain specific
![Page 59: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/59.jpg)
59
What is Ontology?(2)
• Generalization of– Entity relationship diagrams– Object database schemas– Taxonomies– Thesauri
• Conceptualization contains phenomena like– Concepts/classes/frames/entity types– Constraints– Axioms, rules
![Page 60: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/60.jpg)
60
Agents and Ontology
• Agents must have domain knowledge to solve domain-specific problems
• Agents must have common sharable ontology to communicate and share knowledge with each other
• The common sharable ontology must be represented in a standard format so that all software agents can understand and communicate
![Page 61: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/61.jpg)
61
Agents and Semantic Web
• Semantic Web provides the structure for meaningful content of Web pages, so that software agents roaming from page to page will carry out sophisticated tasks– An agent coming to a clinic’s web page will
know Dr. Henry works at the clinic on Monday, Wednesday and Friday without having the full intelligence to understand the text…
– Assumption is Dr. Henry make the page using an off-the-shelf tool, as well as the resources listed on the Physical Therapy Association’s site
![Page 62: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/62.jpg)
62
Knowledge Representation on the Web
• The challenge of the Web is to provide a language to express both data and rules for reasoning about the data [meta-data] that allows rules from any existing knowledge representation system to be exported onto the Web
• Adding logic to the Web means to use rules to make inference, choose actions and answer questions. The logic must be powerful enough but not too complicated for agents to consider a paradox
![Page 63: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/63.jpg)
63
Language Layers on the Web
XML
XHTML SMIL RDF
PICS
HTML
Declarative Languages:OIL, DAML+Ont
DC
Semantic web infrastructure is built on RDF data model
DAML-L (logic)
Trust
![Page 64: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/64.jpg)
64
Languages on the Web
• HTML+URL• XML+DTD (Data Type Definition)• RDF+RDF schema
![Page 65: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/65.jpg)
65
Statements: RDF
• The basic structure of RDF is object-attribute-value
• In terms of labeled graph: [O]-A->[V]
O
A
V
![Page 66: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/66.jpg)
66
Semantic Web Search Engine
• Swoogle: http://swoogle.umbc.edu/ [CIKM 2004]
• SHOE (Simple HTML Ontology Extensions): http://www.cs.umd.edu/projects/plus/SHOE/search/
• SWSE: http://www.swse.org/• http://
www.semanticwebsearch.com/
![Page 67: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/67.jpg)
67
Applications to IR
• Advanced Google• Meta Search• Search Result Clustering
![Page 68: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/68.jpg)
68
What do Users Really Want?
• Topic-based vs. keyword-based– “NTU”
• How to improve current search engines?
• Resources about Search Engines – Search Engine Watch:
http://searchenginewatch.com/ – Research Buzz: http://researchbuzz.com/
![Page 69: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/69.jpg)
69
Advanced Google
• Is Google good enough?– “NTU”– “NTU university”– “NTU university Singapore”
• More and more Services– Google Web, Image, News, Video, Google Desktop Search ,
…– Google Groups, Gmail, Google Talk, Google Calendar, …– Google Mobile, Google SMS, Google Local, …– Google Print (Book Search), Google Maps, Google Earth, …– Google Scholar, Translate, Finance, Docs, Reader, …
• More about Google Services– http://www.google.com/options/– Google Labs: http://labs.google.com/
![Page 70: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/70.jpg)
70
More Types of Document Search
• Google: Web, Image, News, Groups, Desktop (Office, mail),
• Microsoft: +Lookout (mail)• Yahoo: +Stata (mail), +Adobe
(PDF)
![Page 71: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/71.jpg)
71
Searching Different Media
• Multimedia Search: MP3, Blog, messenger, mobile, …– Baidu.com: MP3, image, news, …– Singingfish.com (AOL): audio/video, … – GoFish.com: audio, video, mobile, games– AllTheWeb.com: pictures, audio, video, …
• Blog search engines– Daypop, Bloogz, Waypath, …
• A9.com (by Amazon)– Books, movies, …– Bookmark, history, discover, diary
• Mobissimo.com– Airfare search, hotel search
• Yahoo-OCLC toolbar: library search– Searching Open WorldCat (OCLC union catalog)
![Page 72: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/72.jpg)
72
Different Forms of Presentation
• Clusty.com (by Vivisimo)– Clustering engine
• Snap.com (by Idealab)– Sorting by popularity, satisfaction, Web popularity, Web
satisfaction, domain, …• Alexa.com (by Amazon)
– Average user review ratings, …• Visualization
– TouchGraph Google Browser: http://www.touchgraph.com/TGGoogleBrowser.html
– Kartoo.com: a visual meta search engine– Girafa– ConceptSpace– LostGoggles (formerly MoreGoogle): thumbnail preview
![Page 73: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/73.jpg)
73
Focused Search Engines
• Scirus: http://scirus.landingzone.nl– For scientific information only
• Google Scholar: http://scholar.google.com/ – For scholarly literature
![Page 74: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/74.jpg)
74
Some Google Hacks and Searching Tricks
• References:– Tara Calishain and Rael Dornfest, “Google
Hacks,” O’Reilly– Kevin Hemenway and Tara Calishain,
“Spidering Hacks,” O’Reilly– http://douweosinga.com/projects/
googlehacks – Tara Calishain, “Web Search Garage,”
Prentice Hall– Chris Sherman, “Google Power: Unleash the
Full Potential of Google,” McGraw Hill
![Page 75: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/75.jpg)
75
Further Utilizing Google…
• Google API: http://www.google.com/apis/ – 1,000 automated queries per day
• Google Hacks– Google Talk– Word Color– Google Battle– Google Date– Google Best Time to Visit– Google Protocol
• …
![Page 76: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/76.jpg)
76
Meta (Federated) Search
• To search simultaneously several individual search engines and their databases of web pages – Ixquick, Metacrawler, Dogpile, …
• Clustering meta-searchers– Vivisimo, KillerInfo, …
• Meta-search engines for deep digging– SurfWax, Copernic Agent, …
![Page 77: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/77.jpg)
77
Meta Search Engine
MetaSearchEngine
SE1
User
SEn
SE2
Web
![Page 78: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/78.jpg)
78
Search Result Clustering
• Why search result clustering?• Why is SRC different from
document clustering?– In assessment of algorithm’s quality– Precision, recall vs. user-oriented,
subjective assessment
![Page 79: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/79.jpg)
79
Example of Search Result Clustering
National Taiwan University NTU Hospital
Nanyang Technological University, Singapore
NTU?
![Page 80: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/80.jpg)
80
Example Clustering Search Engines
• Vivisimo.com– Clusty.com
• WebClust.com• KillerInfo.com• InfoNetWare.com• SnakeT (Snippet Aggregation for
Knowledge ExTraction): http://roquefort.unipi.it/ – A hierarchical clustering engine for snippets
• Mooter.com• …
![Page 81: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/81.jpg)
81
Example on Vivisimo
![Page 82: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/82.jpg)
82
Vivisimo (cont.)
![Page 83: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/83.jpg)
83
Clusty.com
![Page 84: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/84.jpg)
84
InfoNetWare.com
![Page 85: Text Information Retrieval and Applications – Advanced Topics](https://reader036.vdocument.in/reader036/viewer/2022062315/5681505e550346895dbe5fff/html5/thumbnails/85.jpg)
85
Thanks for Your Attention!