supporting the automatic construction of entity aware search engines lorenzo blanco, valter...
TRANSCRIPT
Supporting the Automatic Construction of Entity Aware
Search Engines
Lorenzo Blanco, Valter Crescenzi,
Paolo Merialdo, Paolo Papotti
Dipartimento di Informatica e AutomazioneUniversità degli Studi Roma Tre
Introduction
• A huge number of web sites publish pages based on data stored in databases
• Each of these pages often contains information about a single instance of a conceptual entity
name birthdate
college
BasketballPlayer
weight height
Introduction
http://www.nba.com/ http://sports.espn.go.com/
• We developed a system that: • taking as input a small set of sample pages
from distinct web sites• automatically discovers pages containing
data about other instances of the conceptual entity exemplified by the input samples
Introduction
Overall Approach
• Given a bunch of sample pages• crawl the web sites of the sample pages to
gather other pages offering the same type of information
• extract a set of keywords that describe the underlying entity
• do- launch web searches to find other sources with
pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages
• while new pages are found
Overall Approach
• Given a bunch of sample pages• crawl the web sites of the sample pages to
gather other pages offering the same type of information
• extract a set of keywords that describe the underlying entity
• do- launch web searches to find other sources with
pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages
• while new pages are found
Instance IdentifiersAlan AndersonMike DoucetRicky DixonQuentin LedayJarrett Lee…
site Crawler
• Goal: given one sample page, crawl its site to discover as many pages as possible that offer the same information
• A crawling algorithm scans the web site toward pages sharing the same structure of the input sample page
• The crawler also computes a set of strings representing meaningful identifiers for the entity instances (e.g. the athletes' names)
Crawling the seed sites
…
…
…
…
• Given a sample page, the system explores the site structure looking for pages that work as indexes to "similar" pages
• The similarity between pages is measured analyzing their structure
Crawler: intuition
Overall Approach
• Given a bunch of sample pages• crawl the web sites of the sample pages to
gather other pages offering the same type of information
• extract a set of keywords that describe the underlying entity
• do- launch web searches to find other sources with
pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages
• while new pages are found
Extraction of the entity description
• On a web site, different instances of the same conceptual entity are likely to share a characterizing set of keywords
• It is usual that these keywords appear in the page template
Extraction of the entity description
PPGPPG
RPGRPG
APGAPG
EFFEFF
BornBorn
HeightHeight
WeightWeight
CollegeCollege
Years ProYears Pro
photosphotosBuyBuyphotophotoE-mailE-mail
Extraction of the entity description
• For each known website we extract from its template a set of keywords
• The entity description is a set of keywords built combining these sets
• We favour the more frequent terms
Template Extraction: intuitionTemplate Extraction: intuition
• To extract the terms of the template of a set of pages (from the same web site) the system analyzes the frequencies of the tokens
(inspired by Arasu&Garcia-Molina, Sigmod 2003)
Template Extraction: intuitionTemplate Extraction: intuition
<HTML> <DIV><A>Home</A><P>Sport!</P> </DIV> <DIV> <B>Weight</B><I>97</I> </DIV> <DIV> <B>Height</B><I>180</I> </DIV> <DIV> <B>Profile</B> <SPAN>The career ... </SPAN> </DIV></HTML>
<HTML> <DIV><A>Home</A><P>Sport!</P> </DIV> <DIV> <B>Weight</B><I>136</I> </DIV> <DIV> <B>Height</B><I>212</I> </DIV> <DIV> <B>Profile</B> <SPAN>Giant... Height... </SPAN> </DIV></HTML>
page 1 page 2
/html/body/div[3]/b /html/body/div[3]/b/html/body/div[4]/span
Overall Approach
• Given a bunch of sample pages• crawl the web sites of the sample pages to
gather other pages offering the same type of information
• extract a set of keywords that describe the underlying entity
• do- launch web searches to find other sources with
pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages
• while new pages are found
• For each entity identifier, the system launches one search on the web to discover new target pages
• To focus the searches, the query includes the entity description
Launches searches on the WebLaunches searches on the Web
(identifier)Michael Jordan+
pts height weight
min ast
(entitydescription)
• We compute and check template of each result
• The pages whose template contains terms that match with the set of keywords of the entity description are considered as instances of the entity- only a percentage of the
terms is taken into account
Overall Approach
• Given a bunch of sample pages• crawl the web sites of the sample pages to
gather other pages offering the same type of information
• extract a set of keywords that describe the underlying entity
• do- launch web searches to find other sources with
pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages
• while new pages are found
ExperimentsExperiments
• We run some experiments to analyze the approach. We focused on the sport domain, looking for pages containing data about the following entities:- Basketball player- Soccer player- Hockey player- Golf player
• The sport domain as it is easy to:- interpret published data- evaluate precision of results
Experiments: Experiments: extracted entity descriptionsextracted entity descriptions
• All the terms can reasonably represent attribute names for the corresponding player entity
Experiments: Experiments: using entity descriptionsusing entity descriptions
• % of terms (used in the filtering of Google results) vs recall & precision
• 500 pages from 10 soccer web sites • Google returned about 15.000 pages distribuited over
4.000 distinct web sites
Experiments: pages found
• “Hockey player” entity• 2 iterations of the cycle• > 12,000 pages found• > 5,000 distinct instances
Related work
• Our method is inspired by DIPRE (S.Brin, WebDB, 1998)
• Focus crawlers (S.Chakrabarti et al., Computer Networks, 1999)- Typically rely on text classifiers to determine the
relevance of the visited pages to the target topic- Analogies, but we look for pages containing
instances of an entity
• CIMPLE (A.Doan et al., SIGIR, 2006)- Building a platform to support the information needs
of a virtual community- An expert is needed to provide relevant sources and
design the E-R model of the domain of interest
Conclusions and future workConclusions and future work
• We populated an entity aware search engine for sport fans. We used the facilities of Google Co-op:http://flint.dia.uniroma3.it/ (Demo section)
• To improve the entity description we are working on a probabilistic model to dynamically compute a weight for the terms of the page templates
• We are investigating the usage of automatic wrapping techniques to extract, mine and integrate data from the web pages collected by the proposed approach
Thank you!