semex a platform for personal information management and ...alon/files/nydbir05.pdfpersonal...
TRANSCRIPT
Semex: a Platform for PersonalInformation Management and
IntegrationAlon Halevy
University of Washington(On Sabbatical @ Stanford & Transformic Inc.)
April 15, 2005NY Area DB/IR Day
Joint work with: Luna Dong, Jayant Madhavan
Did You mean:DB and IR, DB or IR?
Personal information management: Pushes the limits on DB&IR.
Demonstrate some DB&IR issuesthrough the Semex Project.
NSF starting to get interested in PIM: First brainstorming workshop in January. People from DB, IR, HCI, Psychology.
What is PIM?
Questions I Can’t Answer
Find my VLDB04 paper, and the PowerPoint(maybe in an attachment?).Find emails from my Californian friends.Which paper by Ken Ross did I cite in mylatest SIGMOD paper?What quarter was Mary in my class and whatgrade did she get?Which experiment did I run with NF1 andwhich emails discussed them?
Why?
HTMLMail &
calendar Papers Files Presentations
Information is organized by application, notby any semantically meaningful logicalorganization.Vannevar Bush said this in 1946: PersonalMemex.
OriginitatedFrom
EarlyVersion
PublishedIn
ConfHomePage
ExperimentOf
PaperAbout
BudgetOf
Sender
Recipient
CourseGradeIn
AddressOf
Attached
Cites
PresentationFor
CoAuthor
FrequentEmailer
HomePage
Miller Barton MillerR. Miller
Association queries
Association queries
Articles
Contact info
R. Miller
Association queries
Article: “Data drivenunderstanding andrefinement of schemamapping”
IsCitedBy
Article: “The Piazza Peer-data Management Project”
Cites
Association queries
IsCitedBy
Article: “The Piazza Peer-data Management Project”
Cites
PIM vs. Web Search"But there's a fundamental difference betweensearching a universe of documents created bystrangers and searching your on personal library.
When you're free wheeling through ideas that youyourself have collated -- particularly when you'dlong ago forgotten about them -- there's somethingabout the experience that seems uncannily like freewheeling through the corridors of your ownmemory. It feels like thinking."
Steven Johnson, New York Times,January 30, 2005
Semex Over-arching Goals
Create an ‘AHA!’ experience with a PIMsystem “How did I ever live without this?” Extensible to arbitrary associations.
Leverage the PIM environment andknowledge to increase productivity inother tasksBuild a platform for <your cool stuff here>
Leveraging Semex:On-the-Fly Data Integration
Who published at SIGMOD but was not recently on the PC?
Partial Success of DI
EII: Enterprise Information Integration Starting to catch on. See SIGMOD-05 industrial paper for good
perspectives.Mostly in applications such as: Customer Relationship Management Portal construction Frequently occurring queries.
Still quite an effort to set up an integrationscenario.
On-The-Fly IntegrationConference
PC
Presentation
OrganizedBy
publishedIn
Person
Paper
servesOn
Author
presentedIn
Who published at SIGMOD but was not recently on the PC?
Outline
Semex (open) architectureThe glue: reference reconciliationCurrent research “what if’s”, and challenges: Malleable schemas On-the-fly information integration Association queries and indexing Visualizations of personal information More challenges
System Architecture
Word Excel PPT PDF Bibtex Latex Email Contacts
Semi-structured domain Model repository
Association extractor
ReferenceReconciliation
SimpleExtractedExternalDefined
Association extractor Association extractor Association extractor
IR/DB Themes
Axiom: the desktop is the database.
Need to manage any kind of data: Once you touch it, it’s managed!
Schema? Sure! A bit here and there.
Association queries
Referencereconciliation
Halevy
Multi-class Reconciliation
Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)
Venue: c1=(“ACM Conference on Management of Data”, “1978”,“Austin, Texas”)
c2=(“ACM SIGMOD”, “1978”, null)
Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null)p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)
Reference Reconciliation
Input: A set of references ROutput: A partitioning over R, such that Each partition refers to a single real-world
entity – high precision
Different partitions refer to different entities – high recall
Reference Reconciliation ResultsArticle: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)
a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)
Venue: c1=(“ACM Conference on Management of Data”, “1978”,“Austin, Texas”)
c2=(“ACM SIGMOD”, “1978”, null)
Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null)p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)
Novel ChallengesArticle: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)
a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)
Venue: c1=(“ACM Conference on Management of Data”, “1978”,“Austin, Texas”)
c2=(“ACM SIGMOD”, “1978”, null)
Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null)p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)
1. MultipleClasses 3. Multi-value
Attributes
2. LimitedInformation
4. Lack of training data
Applying Traditional Record LinkageAlgorithm
1750
1950
2150
2350
2550
2750
2950
3150
3350
1 2 3 4
Evidence
#(P
ers
on
Pa
rtit
ion
s)
1409
Person references: 24076 Real-world persons:1750
3159
Main Ideas[Dong et. al, SIGMOD 05]
Leverage the context (network) of thereferences.Propagate reconciliation decisionsbetween different classes.Enrich references as we go along.Enforce some integrity constraints.
I. Exploiting Context Information
Associated Reference I – Contact list p5=(“Stonebraker, M.”, null, {p4, p6}) p8=(null, “[email protected]”, {p7}) p6=p7
Associated Reference II – Authored articles p2=(“Michael Stonebraker”, null) p5=(“Stonebraker, M.”, null) p2 and p5 authored the same article
Cross-attribute similarity – Name&email p5=(“Stonebraker, M.”, null) p8=(null, “[email protected]”)
I. Exploiting Context Information
3159
2169 21692096
1750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
ers
on
Pa
rtit
ion
s)
1409
346
Person references: 24076 Real-world persons:1750
Merging PropagationArticle: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)
a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)Venue: c1=(“ACM Conference on Management of Data”, “1978”,
“Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null)
Perseon: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null)p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)
3159
2169 21692096
3159
2146 2135
2022
1750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
ers
on
Pa
rti
tio
ns
)
Traditional Propagation
II. Merging Propagation
Person references: 24076 Real-world persons:1750
III. Reference Enrichment
p2=(“Michael Stonebraker”, null, {p1,p3})p8=(null, “[email protected]”, {p7})p9=(“mike”, “[email protected]”, null)
P8-9 =(“mike”, “[email protected]”, {p7})
III. Reference Enrichment
3159
2169 21692096
3169
2036 2036
19101750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
ers
on
Part
itio
ns)
Traditional Merge Propagation
Person references: 24076 Real-world persons:1750
3159
2169 21692096
3169
2002 1990
18731750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
erso
n P
arti
tio
ns)
Traditional Merge Propagation Full
Overall Results
Person references: 24076 Real-world persons:1750
1409
125346
The Dependency Graph(“Distributed…”, “Distributed …”)
(“169-180”, “169-180”)
(a1, a2)(“Michael Stonebraker”, “Stonebraker, M.”)
(p2, p5)
(“Eugene Wong”, “Wong, E.”)
(p3, p6)(c1, c2)
(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)
Reference similarity Attribute similarity
(“Robert S. Epstein”, “Epstein, R.S.”)
(p1, p4)
Propagation I(“Distributed…”, “Distributed …”)
(“169-180”, “169-180”)
(a1, a2)(“Michael Stonebraker”, “Stonebraker, M.”)
(p2, p5)
(“Eugene Wong”, “Wong, E.”)
(p3, p6)(c1, c2)
(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)
(“Robert S. Epstein”, “Epstein, R.S.”)
(p1, p4)
Reconciled Similar
Propagation II(“Distributed…”, “Distributed …”)
(“169-180”, “169-180”)
(a1, a2)(“Michael Stonebraker”, “Stonebraker, M.”)
(p2, p5)
(“Eugene Wong”, “Wong, E.”)
(p3, p6)(c1, c2)
(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)
(“Robert S. Epstein”, “Epstein, R.S.”)
(p1, p4)
Reconciled Similar
Propagation III(“Distributed…”, “Distributed …”)
(“169-180”, “169-180”)
(a1, a2)(“Michael Stonebraker”, “Stonebraker, M.”)
(p2, p5)
(“Eugene Wong”, “Wong, E.”)
(p3, p6)(c1, c2)
(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)
(“Robert S. Epstein”, “Epstein, R.S.”)
(p1, p4)
Reconciled Similar
Propagation IV(“Distributed…”, “Distributed …”)
(“169-180”, “169-180”)
(a1, a2)(“Michael Stonebraker”, “Stonebraker, M.”)
(p2, p5)
(“Eugene Wong”, “Wong, E.”)
(p3, p6)(c1, c2)
(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)
(“Robert S. Epstein”, “Epstein, R.S.”)
(p1, p4)
Reconciled Similar
Propagation V(“Distributed…”, “Distributed …”)
(“169-180”, “169-180”)
(a1, a2)(“Michael Stonebraker”, “Stonebraker, M.”)
(p2, p5)
(“Eugene Wong”, “Wong, E.”)
(p3, p6)(c1, c2)
(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)
(“Robert S. Epstein”, “Epstein, R.S.”)
(p1, p4)
Reconciled Similar
Comparison with UAI Methods
Propagating similarities betweenclasses was investigated with UAImethods: (e.g., Russell and Pasula)Fit everything into a probabilistic modelof the domain.Our approach exploits thedependencies, but does not enforce amodel.
Outline
Semex (open) architectureThe glue: reference reconciliation
Current research “what if’s”, and challenges: Malleable schemas On-the-fly information integration Association queries and indexing Visualizations of personal information More challenges
DB/IR Themes
How do we model an applicationdomain that involves both structuredand unstructured data?
Malleable SchemasMost DB/IR work is on seamless querying: Integration after the fact.
But what about the modeling phase? How can we design applications that manipulate both
kinds of data? Domains where:
Border between two types is not clear or evolving, There is no obvious structure, Structure is not known at modeling time, Complete structure would be too complicated for users.
A Different Example:Web Data Integration
Building a meta-search engine forclassifieds sites on the web.Modeling the class RealEstate. Realize: subclasses are a messy
proposition. Instead: describe subclasses by keywords.
Malleable Schemas:Keywords as Schema Constructs
Web: modeling the class RealEstate. Realize: subclasses are a messy proposition. Instead: describe subclasses by keywords.
PIM: modeling property Participant. Realize: there are many shades of participation. Instead: describe variants with keywords.
Key point: keywords are seen asreplacements for some schema constructs.
Importing External DataConference
PC
Presentation
OrganizedBy
publishedIn
Person
Paper
servesOn
Author
presentedIn
Who published at SIGMOD but was not recently on the PC?
Importing Data w/Background Knowledge
We know a lot about the domain model andits possible instances: Schema matching is easier: [A la Doan et.] Reference reconciliation is easier: [Etzioni and
Perkowitz, 95] Wrapper construction: [Kushmerick et al, 97]
Leverage: The user’s past actions Colleagues with the same data needs
Challenge: matching relationships (inaddition to attributes).
Help When Looking at Data
View external sources from myperspective: Highlight people I know (and why) on a
web pageFill in blanks: Tell me which people may be missing in a
spreadsheet I’ve received Suggest other names (papers, etc.) when
I’m creating a list.
Association Querying
Find objects not referenced in thequery: Ask for Semex: Get Luna Dong, Jayant
MadhavanLearn interesting/useful associationpaths: Co-author, collaborator, relatedProject
Rank lists intelligently (lineage)
Views on Personal Information
Semex enables multiple views onpersonal information: See everything about a project View the progression of a project or paper Activity clustering [Mitchell does for email] Pointers to external resources
From PIM to G(roup)IM
We would like to share: Subsets of our data Fragments of the domain model
Create personal profiles for: Better web search, online shopping, ad
placement.Manage information along a social network.To share or not to share?
Summary
The goal of Semex is to bring the benefits ofdata management to the desktop Needs to be invisible! Automatically create associations between data
items.Fundamental challenges to DB/IR: Manage everything Exploit schema(S) when you them them Model data flexibly Support new types of queries.
“The most profound technologies arethose that disappear”. Mark Weiser
Some References
Overview: CIDR 2005Reference Reconciliation: SIGMOD 2005A cool demo: SIGMOD 2005The website:http://data.cs.washington.edu/semex/NSF PIM Workshop:http://pim.ischool.washington.edu/