data-oriented content query system: searching for data into text on the web
DESCRIPTION
Data-oriented Content Query System: Searching for Data into Text on the Web. Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA. Many Web Applications Try to Exploit the “Content” of Web Pages. - PowerPoint PPT PresentationTRANSCRIPT
Web Info Extractio
n
Web Info Extractio
n
Typed Entity Search
Typed Entity Search
Web-based
Q/A
Web-based
Q/A
In most cases, what we really want are not pages, but the information units inside.
? ? ? ? ? ? ? ?
2
Specialized Information Specialized Information ExtractorsExtractors
Specialized Information Specialized Information ExtractorsExtractors
Web Information Extraction (WIE)(Marius 2006, Cafarella 2005, Etzioni 2004)
Pattern: “X is CEO
of Y”
Company
CEO
Google Eric Schmidt
IBM S. Palmisano
… …Limitation
•Focus on simple patterns.
•Lack of interactivity.
3
Web-based Question Answering (WQA)(Wu 2007, Lin 2003, Brill 2002)
Who is CEO of Dell?
Who is CEO of Dell?
Keywords:“CEO Dell”
Parse Top-k results
Michael Dell
Limitation
•Only rely on top-k pages to
retrieve the answer.
4
Typed-Entity Search (TES)(Cheng 2007, Cafarella 2007, Chakrabarti 2006)
Amazon PhoneAmazon Phone
……
0.60
0.80
0.90
Ranked Entity List
But … Where is CEO
Limitation
•Limited Number of Data Type
•Lack of Flexibility
5
? ?? ?? ?? ?Data-oriented Content Data-oriented Content Query SystemQuery System
Data-oriented Content Data-oriented Content Query SystemQuery System
Web Info Extractio
n
Web Info Extractio
n
Typed Entity Search
Typed Entity Search
Web-based QA
Web-based QA
Requirements1. Extensible Data Types2. Flexible Contextual
Patterns3. Customizable Scoring 6
Input: CQL (Content Query Language)
Output
Entity Searc
h
Entity Searc
h
Web QA
Web QA
Data-oriented Content Data-oriented Content Query SystemQuery System
Data-oriented Content Data-oriented Content Query SystemQuery System
7
What we need Relational Model
Person Organization
LocationNumber
Number
Person
Organization
Location
9
What we need Relational Model
Find the population of China
WHERE pattern(…)
GROUP BY #number
ORDER BY conf()
FROM #number
10
China has a population of 1.3 billion
China with its population of 1.3 billion people
China is established in 1949.
Shanghai is the largest city with 15 million inhabitants in China
1.3 billion 15 million
1. 1.3 billion
2. 15 million
…
1. 1.3 billion
2. 15 million
…
What we need Relational Model
Number
Location
Person
PopulationPhonePrice
CapitalHeadquarter
ProfessorCEOPresident
Table
View
Number
population
price
phone
11
Index LayerIndex Layer
Parsing Layer
Parsing Layer
Index Selection Module
Index Selection Module
Execution Tree
Execution Tree
INPUTSELECT …FROM …WHERE …
OUTPUT
Index DesignSpecial Inverted
Index•Contextual Index•Join Index
Index DesignSpecial Inverted
Index•Contextual Index•Join Index
Query OptimizationGraph Coverage
Problem
Query OptimizationGraph Coverage
Problem
Data Type
Repository
Data Type
Repository
Data Type Definition
12
Experimental Result
• Speed improvement: 6-10
times•Space overhead: Around 2
times original corpus size.
Experimental Result
• Speed improvement: 6-10
times•Space overhead: Around 2
times original corpus size.
Data-oriented Content Query Data-oriented Content Query SystemSystem
Data-oriented Content Query Data-oriented Content Query SystemSystem
Web Info ExtractionWeb Info
Extraction
Typed Entity Search
Typed Entity Search
Web-based Q/A
Web-based Q/A
13