issues in bridging db & ir announcements: next class: interactive review (come prepared)...
Post on 20-Dec-2015
214 views
TRANSCRIPT
Issues in Bridging DB & IR
Announcements:Next class: Interactive Review (Come prepared)
Homework III solutions onlineDemos tomorrow (instructions will be mailed by
the end of the class)
4/29
First did some discussion of BibFinder—how queries are mapped etc.
CEAS Online Evaluations You can do them at
https://intraweb.eas.asu.edu/eval Will be available until the end of day May 5th
(so the exam is unfettered by what you might think about it )
Instructors get to see it only after the grades have been given
(so you don’t need to feel compelled to be particularly nice)
Your feedback would be appreciated (especially the written comments) Last semester I got 2,196 words of
comments; let us see if we can break the record ;-)
…”please be both enthusiasticand adamant about this in yourclasses”
–Associate Dean for Academic affairs
Integration of autonomous data sources
Data/information integration
Technically has to handle heterogeneous data too
But we will sort of assume that the sources are “quasi-relational”
Supporting heterogeneous data (combining DB/IR)
This can be tackled in the presence of a single database
The issues are How to do effective
querying in the presence of structured and text data
E.g. Stuff I have Seen project
How to support IR-style querying on DB
Because users seem to know IR/keyword style querying more
(notice the irony here—we said structure is good because it supports structured querying)
How to support imprecise queries
The popularity of Web brings two broad challenges to Databases
DB vs. IR DBs allow structured
querying Queries and results
(tuples) are different objects
Soundness & Completeness expected
User is expected to know what she is doing
IR only supports unstructured querying
Queries and results are both documents!
High Precision & Recall is hoped for
User is expected to be a dunderhead.
Some specific problems
1. How to handle textual attributes?2. How to support keyword-based
querying? 3. How to handle imprecise queries?
(Ullas Nambiar’s work)
1. Handling text fields in data tuples
Often you have database relations some of whose fields are “Textual” E.g. a movie database, which has, in
addition to year, director etc., a column called “Review” which is unstructured text
Normal DB operations ignore this unstructured stuff (can’t join over them). SQL sometimes supports “Contains”
constraint (e.g. give me movies that contain “Rotten” in the review
Soft Joins..WHIRL [Cohen] We can extend the notion of Joins
to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output n a ranked form—with the rank proportional to the similarity
Neat idea… but does have some implementation difficulties
Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples
2. Supporting keyword search on databases
How do we answer a query like “Soumen Sunita”?
Issues: --the schema is normalized (not everything in one table) --How to rank multiple tuples which contain the keywords?
What Banks Does
The whole DB seen as a directed graph (edges correspond to foreign keys)Answers are subgraphs Ranked by edge weights
BANKS: Keyword Search in DB
12/9/2002 7
Basic Model
Database: modeled as a graph Nodes = tuples Edges = references between tuples
foreign key, other kind of relationships Edges are directed.
MultiQuery Optimization
S. Sudarshan Prasan Roy
writes
author
paper
Charuta
BANKS: Keyword search…
12/9/2002 2
Motivation
Keyword search of documents on the Web has been enormously successful Simple and intuitive, no need to learn any query
language
Database querying using keywords is desirable SQL is not appropriate for casual users Form interfaces cumbersome:
Require separate form for each type of query — confusing for casual users of Web information systems
Not suitable for ad hoc queries
12/9/2002 10
Edge Weight
Weight of forward edge based on schema e.g. citation link weights > writes link weights
Weight of backward edge = indegree of edges pointing to the node
1
1
1
3
3
3
12/9/2002 23
BANKS Query Result Example
Result of “Soumen Sunita”
3. Supporting Imprecise Queries Increasing number of Web accessible databases
E.g. bibliographies, reservation systems, department catalogs etc
Support for precise queries only – exactly matching tuples
Difficulty in extracting desired information Limited query capabilities provided by form based
query interface Lack of schema/domain information Increasing complexity of types of data e.g.
hyptertext, images etc
Often times user wants ‘about the same’ instead of ‘exact’
Bibliography search — find similar publications
Solution: Provide answers closely matching query constraints
Want cars priced ‘around’ $7000
Relaxing queries… It is obvious how to relax certain type of attribute values
E.g. price=7000 is approximately the same as price=7020 But how do we relax categorical attributes?
How should we relax Make=Honda? Two possible approaches
Assume that domain specific information about similarity of values is available (difficult to satisfy in practice)
Attempt to derive the similarity between attribute values directly from the data
Qn: How do we compute similarity between “Make=Honda” and “Make=Chevrolet”
Idea: Compare the set all tuples where Make=Honda to the set of all tuples where Make=Chevrolet
Consider the set of tuples as a vector of bags (where bags correspond to the individual attributes)
Use IR similarity techniques to compare the vectors
Finding similarities between attribute values
5/4
Challenges in answering Imprecise Queries
We introduce IQE (Imprecise Query Engine):
Uses query workload to identify other precise queries Extracts additional tuples satisfying a query by issuing similar precise queries Measures distance between queries using Answerset Similarity
Challenges: Extracting additional tuples with minimal domain knowledge Estimating similarity with minimal user input
Answerset Similarity Answerset A(Q): Set of all answer
tuples of query Q given by relation R.
Query Similarity: Sim(Q1,Q2) :- Sim(A(Q1), A(Q2))
Measuring answerset similarity Relational model
exact match between tuples captures complete overlap
Vector space model match keywords also detects partial overlaps
Problem: Vector Space model representation for answersets
Answer: SuperTuple
Widom
Stream…. VLDB 2002
Widom
Optimize….
ICDE 1998
ST(QAuthor=Widom)
Co-author R. Motwani:3, Molina:6…
Title warehouse:5, optimizing:2, streams:6..
Conference SIGMOD:3, VLDB:4,…
Year 2000:6,1999:5,……
Ullman Optimize…
PODS 1998
Ullman Mining… VLDB 2000
Answerset for Q(Author=Widom)
Answerset for Q(Author=Ullman)
Similarity Measures Jaccard similarity metric with bag
semantics SimJ(Q1,Q2) = |Q1 ∩ Q2| / |Q1 U Q2|
Doc-Doc Similarity Equal importance to all attributes Supertuple considered as “single
bag” of keywords Simdoc-doc(Q1, Q2) = SimJ(STQ1, STQ2)
Weighted-Attribute Similarity
Weights assigned to attributes signify importance to user
Simwatr(Q1,Q2) = ∑ wi x SimJ(STQ1(Ai), STQ2(Ai))
ST(QAuthor=Ullman)
Co-author C. Li:5, R. Motwani:7,…
Title Data-mining:3, optimizing:5,..
Conference SIGMOD:5, VLDB:5,…
Year 2000:6,1999:5,……
ST(QAuthor=Widom)
Co-author R. Motwani:3, Molina:6…
Title warehouse:5, optimizing:2, streams:6..
Conference SIGMOD:3, VLDB:4,…
Year 2000:6,1999:5,……
Empirical Evaluation
Goal Evaluate the efficiency and
effectiveness of our approach Setup
A database system extending the bibliography mediator BibFinder projecting relation
Publications( Author, Title, Conference, Journal, Year)
Query log consists of 10K precise queries
User study: 3 graduate students 90 test queries - 30 chosen by
each student Platform: Java 2 on a Linux
Server – Intel Celeron 2.2 Ghz, 512 MB
Time Size
Supertuple Generation 126 sec 21 Mb
Similarity Estimation 10 hrs 6 Mb
Answering Imprecise Query
Estimating query similarity For each q Є Qlog
Compute Sim(q,q’) for all q’ Є Qlog Simdoc-doc(q, q’) = SimJ(STq, STq’) Simwatr(q,q’) = ∑ wi x SimJ(STq(Ai), STq’(Ai))
Extracting similar answers Given a query Q
Map Q to a query q Є Qlog Identify ‘k’ queries similar to q Execute the ‘k’ new queries
Some Results
Imprecise Query Top Relevant Queries1 Title=“web-based
learning”Title=“e learning”
Title=“web technology”
Conference=“WISE”
2 Title=“Information Extraction”
Title=“Information filtering”
Title = “text mining”
Title = “relevance feedback”
3 Author=“Abiteboul” Author=“vianu”
Author=“Dan Suciu”
Relevance of Suggested Answers
0
0.25
0.5
0.75
1
1 11 21
Query
Re
lev
an
ce
Es
tim
ati
on
Err
or Doc-Doc Similarity
Weighted-Attribute Similarity
Are the results precise? Average error in relevance estimation is around 25%
User Study – Summary
75
80
85
Rel
evan
ce o
f top
-10
1 2 3
User
Doc-Doc
Weighted-attribute
Precision for top-10 related queries is above 75%
Doc-Doc similarity measure dominates Weighted-attribute similarity
Lessons: Queries with popular
keywords difficult
Efficiently and effectively capturing user interest is difficult
A solution requiring less input more acceptable
What’s Next ?
Open Issues: Most similar query may not be present in the workload. Answers to a similar query will have varying similarity depending
on the affected attributes
Solution: Given an imprecise query generate the most
similar query. Use attribute importance and value-value
similarity to order tuples.
Challenges:
• Estimating attribute importance
• Estimating value-value similarity
Learning the Semantics of the data
Estimate for value-value similarity Similarity between values of categorical attribute
Sim(v11,v12) = ∑ wi x Sim(Co-related_value(Ai,v11), Co-related_value(Ai,v12)) where Ai Є Attributes(R), Ai <> A
Euclidean distance for numerical attributes
Use the Model of the database – AFDs, Keys, Value correlations to Identify an implicit structure for the tuple. Show other tuples that least break the structure.
CarDb(Make,Model, Year, Price, Mileage, Location, Color) Approximate Keys
Model, Mileage, Location – uniquely decides 90% cars in Db Model, Mileage, Color - uniquely decides 84% cars in Db Approximate Functional Dependencies (AFDs)
Model -> Make Year -> Price
Mileage -> Year
Query relaxation
Finding similarities between attribute values