supporting the automatic construction of entity aware search engines lorenzo blanco, valter...

Post on 16-Jan-2016

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Supporting the Automatic Construction of Entity Aware

Search Engines

Lorenzo Blanco, Valter Crescenzi,

Paolo Merialdo, Paolo Papotti

Dipartimento di Informatica e AutomazioneUniversità degli Studi Roma Tre

Introduction

• A huge number of web sites publish pages based on data stored in databases

• Each of these pages often contains information about a single instance of a conceptual entity

name birthdate

college

BasketballPlayer

weight height

Introduction

http://www.nba.com/ http://sports.espn.go.com/

• We developed a system that: • taking as input a small set of sample pages

from distinct web sites• automatically discovers pages containing

data about other instances of the conceptual entity exemplified by the input samples

Introduction

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

Instance IdentifiersAlan AndersonMike DoucetRicky DixonQuentin LedayJarrett Lee…

site Crawler

• Goal: given one sample page, crawl its site to discover as many pages as possible that offer the same information

• A crawling algorithm scans the web site toward pages sharing the same structure of the input sample page

• The crawler also computes a set of strings representing meaningful identifiers for the entity instances (e.g. the athletes' names)

Crawling the seed sites

• Given a sample page, the system explores the site structure looking for pages that work as indexes to "similar" pages

• The similarity between pages is measured analyzing their structure

Crawler: intuition

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

Extraction of the entity description

• On a web site, different instances of the same conceptual entity are likely to share a characterizing set of keywords

• It is usual that these keywords appear in the page template

Extraction of the entity description

PPGPPG

RPGRPG

APGAPG

EFFEFF

BornBorn

HeightHeight

WeightWeight

CollegeCollege

Years ProYears Pro

photosphotosBuyBuyphotophotoE-mailE-mail

Extraction of the entity description

• For each known website we extract from its template a set of keywords

• The entity description is a set of keywords built combining these sets

• We favour the more frequent terms

Template Extraction: intuitionTemplate Extraction: intuition

• To extract the terms of the template of a set of pages (from the same web site) the system analyzes the frequencies of the tokens

(inspired by Arasu&Garcia-Molina, Sigmod 2003)

Template Extraction: intuitionTemplate Extraction: intuition

<HTML> <DIV><A>Home</A><P>Sport!</P> </DIV> <DIV> <B>Weight</B><I>97</I> </DIV> <DIV> <B>Height</B><I>180</I> </DIV> <DIV> <B>Profile</B> <SPAN>The career ... </SPAN> </DIV></HTML>

<HTML> <DIV><A>Home</A><P>Sport!</P> </DIV> <DIV> <B>Weight</B><I>136</I> </DIV> <DIV> <B>Height</B><I>212</I> </DIV> <DIV> <B>Profile</B> <SPAN>Giant... Height... </SPAN> </DIV></HTML>

page 1 page 2

/html/body/div[3]/b /html/body/div[3]/b/html/body/div[4]/span

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

• For each entity identifier, the system launches one search on the web to discover new target pages

• To focus the searches, the query includes the entity description

Launches searches on the WebLaunches searches on the Web

(identifier)Michael Jordan+

pts height weight

min ast

(entitydescription)

• We compute and check template of each result

• The pages whose template contains terms that match with the set of keywords of the entity description are considered as instances of the entity- only a percentage of the

terms is taken into account

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

ExperimentsExperiments

• We run some experiments to analyze the approach. We focused on the sport domain, looking for pages containing data about the following entities:- Basketball player- Soccer player- Hockey player- Golf player

• The sport domain as it is easy to:- interpret published data- evaluate precision of results

Experiments: Experiments: extracted entity descriptionsextracted entity descriptions

• All the terms can reasonably represent attribute names for the corresponding player entity

Experiments: Experiments: using entity descriptionsusing entity descriptions

• % of terms (used in the filtering of Google results) vs recall & precision

• 500 pages from 10 soccer web sites • Google returned about 15.000 pages distribuited over

4.000 distinct web sites

Experiments: pages found

• “Hockey player” entity• 2 iterations of the cycle• > 12,000 pages found• > 5,000 distinct instances

Related work

• Our method is inspired by DIPRE (S.Brin, WebDB, 1998)

• Focus crawlers (S.Chakrabarti et al., Computer Networks, 1999)- Typically rely on text classifiers to determine the

relevance of the visited pages to the target topic- Analogies, but we look for pages containing

instances of an entity

• CIMPLE (A.Doan et al., SIGIR, 2006)- Building a platform to support the information needs

of a virtual community- An expert is needed to provide relevant sources and

design the E-R model of the domain of interest

Conclusions and future workConclusions and future work

• We populated an entity aware search engine for sport fans. We used the facilities of Google Co-op:http://flint.dia.uniroma3.it/ (Demo section)

• To improve the entity description we are working on a probabilistic model to dynamically compute a weight for the terms of the page templates

• We are investigating the usage of automatic wrapping techniques to extract, mine and integrate data from the web pages collected by the proposed approach

Thank you!

top related