supporting the automatic construction of entity aware search engines lorenzo blanco, valter...

25
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco , Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica e Automazione Università degli Studi Roma Tre

Upload: hector-obrien

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Supporting the Automatic Construction of Entity Aware

Search Engines

Lorenzo Blanco, Valter Crescenzi,

Paolo Merialdo, Paolo Papotti

Dipartimento di Informatica e AutomazioneUniversità degli Studi Roma Tre

Page 2: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Introduction

• A huge number of web sites publish pages based on data stored in databases

• Each of these pages often contains information about a single instance of a conceptual entity

name birthdate

college

BasketballPlayer

weight height

Page 3: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Introduction

http://www.nba.com/ http://sports.espn.go.com/

Page 4: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

• We developed a system that: • taking as input a small set of sample pages

from distinct web sites• automatically discovers pages containing

data about other instances of the conceptual entity exemplified by the input samples

Introduction

Page 5: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

Page 6: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

Page 7: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Instance IdentifiersAlan AndersonMike DoucetRicky DixonQuentin LedayJarrett Lee…

site Crawler

• Goal: given one sample page, crawl its site to discover as many pages as possible that offer the same information

• A crawling algorithm scans the web site toward pages sharing the same structure of the input sample page

• The crawler also computes a set of strings representing meaningful identifiers for the entity instances (e.g. the athletes' names)

Crawling the seed sites

Page 8: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

• Given a sample page, the system explores the site structure looking for pages that work as indexes to "similar" pages

• The similarity between pages is measured analyzing their structure

Crawler: intuition

Page 9: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

Page 10: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Extraction of the entity description

• On a web site, different instances of the same conceptual entity are likely to share a characterizing set of keywords

• It is usual that these keywords appear in the page template

Page 11: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Extraction of the entity description

PPGPPG

RPGRPG

APGAPG

EFFEFF

BornBorn

HeightHeight

WeightWeight

CollegeCollege

Years ProYears Pro

photosphotosBuyBuyphotophotoE-mailE-mail

Page 12: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Extraction of the entity description

• For each known website we extract from its template a set of keywords

• The entity description is a set of keywords built combining these sets

• We favour the more frequent terms

Page 13: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Template Extraction: intuitionTemplate Extraction: intuition

• To extract the terms of the template of a set of pages (from the same web site) the system analyzes the frequencies of the tokens

(inspired by Arasu&Garcia-Molina, Sigmod 2003)

Page 14: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Template Extraction: intuitionTemplate Extraction: intuition

<HTML> <DIV><A>Home</A><P>Sport!</P> </DIV> <DIV> <B>Weight</B><I>97</I> </DIV> <DIV> <B>Height</B><I>180</I> </DIV> <DIV> <B>Profile</B> <SPAN>The career ... </SPAN> </DIV></HTML>

<HTML> <DIV><A>Home</A><P>Sport!</P> </DIV> <DIV> <B>Weight</B><I>136</I> </DIV> <DIV> <B>Height</B><I>212</I> </DIV> <DIV> <B>Profile</B> <SPAN>Giant... Height... </SPAN> </DIV></HTML>

page 1 page 2

/html/body/div[3]/b /html/body/div[3]/b/html/body/div[4]/span

Page 15: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

Page 16: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

• For each entity identifier, the system launches one search on the web to discover new target pages

• To focus the searches, the query includes the entity description

Launches searches on the WebLaunches searches on the Web

(identifier)Michael Jordan+

pts height weight

min ast

(entitydescription)

Page 17: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

• We compute and check template of each result

• The pages whose template contains terms that match with the set of keywords of the entity description are considered as instances of the entity- only a percentage of the

terms is taken into account

Page 18: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Overall Approach

• Given a bunch of sample pages• crawl the web sites of the sample pages to

gather other pages offering the same type of information

• extract a set of keywords that describe the underlying entity

• do- launch web searches to find other sources with

pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages

• while new pages are found

Page 19: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

ExperimentsExperiments

• We run some experiments to analyze the approach. We focused on the sport domain, looking for pages containing data about the following entities:- Basketball player- Soccer player- Hockey player- Golf player

• The sport domain as it is easy to:- interpret published data- evaluate precision of results

Page 20: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Experiments: Experiments: extracted entity descriptionsextracted entity descriptions

• All the terms can reasonably represent attribute names for the corresponding player entity

Page 21: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Experiments: Experiments: using entity descriptionsusing entity descriptions

• % of terms (used in the filtering of Google results) vs recall & precision

• 500 pages from 10 soccer web sites • Google returned about 15.000 pages distribuited over

4.000 distinct web sites

Page 22: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Experiments: pages found

• “Hockey player” entity• 2 iterations of the cycle• > 12,000 pages found• > 5,000 distinct instances

Page 23: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Related work

• Our method is inspired by DIPRE (S.Brin, WebDB, 1998)

• Focus crawlers (S.Chakrabarti et al., Computer Networks, 1999)- Typically rely on text classifiers to determine the

relevance of the visited pages to the target topic- Analogies, but we look for pages containing

instances of an entity

• CIMPLE (A.Doan et al., SIGIR, 2006)- Building a platform to support the information needs

of a virtual community- An expert is needed to provide relevant sources and

design the E-R model of the domain of interest

Page 24: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Conclusions and future workConclusions and future work

• We populated an entity aware search engine for sport fans. We used the facilities of Google Co-op:http://flint.dia.uniroma3.it/ (Demo section)

• To improve the entity description we are working on a probabilistic model to dynamically compute a weight for the terms of the page templates

• We are investigating the usage of automatic wrapping techniques to extract, mine and integrate data from the web pages collected by the proposed approach

Page 25: Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica

Thank you!