data-oriented content query system: searching for data into text on the web
DESCRIPTION
Data-oriented Content Query System: Searching for Data into Text on the Web. Mianwei Zhou, Kevin Chen-Chuan Chang Department of Computer Science UIUC. Many Web Applications Try to Exploit the “Content” of Web Pages. - PowerPoint PPT PresentationTRANSCRIPT
1
Data-oriented Content Query System: Searching for Data into Text on the Web
Mianwei Zhou, Kevin Chen-Chuan ChangDepartment of Computer ScienceUIUC
Many Web Applications Try to Exploit the “Content” of Web Pages.
Web Info Extractio
n
Typed Entity Search
Web-based
Q/A
In most cases, what we really want are not pages, but the information units inside.
? ?
2
3
Content-Exploiting Application 1:
Web Information Extraction
Specialized Information Extractors
Web Information Extraction (WIE)(Marius 2006, Cafarella 2005, Etzioni 2004)
Pattern: “#Number
people die of #Disease each
year”
Disease DeathInfluenza 63730 Penumonia
61776
… …Limitation
•Focus on simple patterns.
•Lack of interactivity.
Content-Exploiting Application 2:
Web-based Question AnsweringWeb-based Question Answering (WQA)
(Wu 2007, Lin 2003, Brill 2002)How many
people die from seasonal flue each year in
US? Keywords:“seasonal flu death”
Parse Top-k results
Around 36,000
Limitation
•Only rely on top-k pages to
retrieve the answer.
4
Content-Exploiting Application 3:
Typed-Entity SearchTyped-Entity Search (TES)
(Cheng 2007, Cafarella 2007, Chakrabarti 2006)
Amazon Phone
……0.60
0.80
0.90
Ranked Entity List
But … Where is Professor
Limitation
•Limited Number of Data Type
•Lack of Flexibility
5
General System for Querying Text “Content”, Much Like How DBMS Supports Data Application
? ?Data-oriented Content Query System
Web Info Extractio
n
Typed Entity Search
Web-based QA
Requirements1. Extensible Data Types2. Flexible Contextual
Patterns3. Customizable Scoring
System FrameworkInput: CQL
Output: Table
Input: CQL (Content Query Language)
OutputExpert Users
Entity Searc
hWeb QA
Common Users
Data-oriented Content Query System
Why Relational Model:Occurrence->Tuple
What we need Relational Model
Person OrganizationLocation
Number
Number
Person
Organization
Location
What we need
Why Relational Model: Content Query->Relational Operation
Relational Model
Find the population of China
WHERE pattern(…)
GROUP BY #number
ORDER BY conf()
FROM #number
China has a population of 1.3 billion
China with its population of 1.3 billion people
China is established in 1949.
Shanghai is the largest city with 15 million inhabitants in China
1.3 billion 15 million
1. 1.3 billion
2. 15 million
…
What we need
Why Relational ModelExtensible Data Type->View
Relational Model
Number
Location
Person
PopulationPhonePrice
CapitalHeadquarter
ProfessorCEOPresident
TableView
Number
population
price
phone
Main Technique: Highly Efficient and Scalable Index Structure
Index Layer
Parsing Layer
Index Selection Module
Execution Tree
INPUTSELECT
…FROM …WHERE
…
OUTPUT
Index DesignSpecial Inverted
Index•Contextual Index•Join Index
Query OptimizationGraph Coverage
Problem
Data Type
Repository
Data Type Definition
Experimental Result
• Speed improvement: 6-10
times•Space overhead: Around 2
times original corpus size.
Data-oriented Content Query System: Supporting Ad-hoc, Interactive “Content” Search.
Data-oriented Content Query System
Web Info Extraction
Typed Entity Search
Web-based Q/A
Interactive
Ad-hoc
14
Q & AThank You!