supporting the automatic construction of entity aware search engines lorenzo blanco, valter...

Supporting the Automatic Construction of Entity Aware

Search Engines

Lorenzo Blanco, Valter Crescenzi,

Paolo Merialdo, Paolo Papotti

Dipartimento di Informatica e AutomazioneUniversità degli Studi Roma Tre

Introduction

• A huge number of web sites publish pages based on data stored in databases

• Each of these pages often contains information about a single instance of a conceptual entity

name birthdate

college

BasketballPlayer

weight height

Introduction

http://www.nba.com/ http://sports.espn.go.com/

• We developed a system that: • taking as input a small set of sample pages

from distinct web sites• automatically discovers pages containing

data about other instances of the conceptual entity exemplified by the input samples

Introduction

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

Instance IdentifiersAlan AndersonMike DoucetRicky DixonQuentin LedayJarrett Lee…

site Crawler

• Goal: given one sample page, crawl its site to discover as many pages as possible that offer the same information

• A crawling algorithm scans the web site toward pages sharing the same structure of the input sample page

• The crawler also computes a set of strings representing meaningful identifiers for the entity instances (e.g. the athletes' names)

Crawling the seed sites

…

…

…

…

• Given a sample page, the system explores the site structure looking for pages that work as indexes to "similar" pages

• The similarity between pages is measured analyzing their structure

Crawler: intuition

Overall Approach







Extraction of the entity description

• On a web site, different instances of the same conceptual entity are likely to share a characterizing set of keywords

• It is usual that these keywords appear in the page template


PPGPPG

RPGRPG

APGAPG

EFFEFF

BornBorn

HeightHeight

WeightWeight

CollegeCollege

Years ProYears Pro

photosphotosBuyBuyphotophotoE-mailE-mail


• For each known website we extract from its template a set of keywords

• The entity description is a set of keywords built combining these sets

• We favour the more frequent terms

Template Extraction: intuitionTemplate Extraction: intuition

• To extract the terms of the template of a set of pages (from the same web site) the system analyzes the frequencies of the tokens

(inspired by Arasu&Garcia-Molina, Sigmod 2003)

Template Extraction: intuitionTemplate Extraction: intuition

<HTML> <DIV><A>Home</A><P>Sport!</P> </DIV> <DIV> <B>Weight</B><I>97</I> </DIV> <DIV> <B>Height</B><I>180</I> </DIV> <DIV> <B>Profile</B> <SPAN>The career ... </SPAN> </DIV></HTML>

<HTML> <DIV><A>Home</A><P>Sport!</P> </DIV> <DIV> <B>Weight</B><I>136</I> </DIV> <DIV> <B>Height</B><I>212</I> </DIV> <DIV> <B>Profile</B> <SPAN>Giant... Height... </SPAN> </DIV></HTML>

page 1 page 2

/html/body/div[3]/b /html/body/div[3]/b/html/body/div[4]/span

Overall Approach







• For each entity identifier, the system launches one search on the web to discover new target pages

• To focus the searches, the query includes the entity description

Launches searches on the WebLaunches searches on the Web

(identifier)Michael Jordan+

pts height weight

min ast

(entitydescription)

• We compute and check template of each result

• The pages whose template contains terms that match with the set of keywords of the entity description are considered as instances of the entity- only a percentage of the

terms is taken into account

Overall Approach







ExperimentsExperiments

• We run some experiments to analyze the approach. We focused on the sport domain, looking for pages containing data about the following entities:- Basketball player- Soccer player- Hockey player- Golf player

• The sport domain as it is easy to:- interpret published data- evaluate precision of results

Experiments: Experiments: extracted entity descriptionsextracted entity descriptions

• All the terms can reasonably represent attribute names for the corresponding player entity

Experiments: Experiments: using entity descriptionsusing entity descriptions

• % of terms (used in the filtering of Google results) vs recall & precision

• 500 pages from 10 soccer web sites • Google returned about 15.000 pages distribuited over

4.000 distinct web sites

Experiments: pages found

• “Hockey player” entity• 2 iterations of the cycle• > 12,000 pages found• > 5,000 distinct instances

Related work

• Our method is inspired by DIPRE (S.Brin, WebDB, 1998)

• Focus crawlers (S.Chakrabarti et al., Computer Networks, 1999)- Typically rely on text classifiers to determine the

relevance of the visited pages to the target topic- Analogies, but we look for pages containing

instances of an entity

• CIMPLE (A.Doan et al., SIGIR, 2006)- Building a platform to support the information needs

of a virtual community- An expert is needed to provide relevant sources and

design the E-R model of the domain of interest

Conclusions and future workConclusions and future work

• We populated an entity aware search engine for sport fans. We used the facilities of Google Co-op:http://flint.dia.uniroma3.it/ (Demo section)

• To improve the entity description we are working on a probabilistic model to dynamically compute a weight for the terms of the page templates

• We are investigating the usage of automatic wrapping techniques to extract, mine and integrate data from the web pages collected by the proposed approach

Thank you!

supporting the automatic construction of entity aware search engines lorenzo blanco, valter...

Documents

set of pages

irrelevant pages

similar pages

new pageswhile new pages

small set of sample

new sources

entity instances

distinct web