issues in bridging db & ir announcements: next class: interactive review (come prepared)...

Issues in Bridging DB & IR

Announcements:Next class: Interactive Review (Come prepared)

Homework III solutions onlineDemos tomorrow (instructions will be mailed by

the end of the class)

4/29

First did some discussion of BibFinder—how queries are mapped etc.

CEAS Online Evaluations You can do them at

https://intraweb.eas.asu.edu/eval Will be available until the end of day May 5th

(so the exam is unfettered by what you might think about it )

Instructors get to see it only after the grades have been given

(so you don’t need to feel compelled to be particularly nice)

Your feedback would be appreciated (especially the written comments) Last semester I got 2,196 words of

comments; let us see if we can break the record ;-)

…”please be both enthusiasticand adamant about this in yourclasses”

–Associate Dean for Academic affairs

Integration of autonomous data sources

Data/information integration

Technically has to handle heterogeneous data too

But we will sort of assume that the sources are “quasi-relational”

Supporting heterogeneous data (combining DB/IR)

This can be tackled in the presence of a single database

The issues are How to do effective

querying in the presence of structured and text data

E.g. Stuff I have Seen project

How to support IR-style querying on DB

Because users seem to know IR/keyword style querying more

(notice the irony here—we said structure is good because it supports structured querying)

How to support imprecise queries

The popularity of Web brings two broad challenges to Databases

DB vs. IR DBs allow structured

querying Queries and results

(tuples) are different objects

Soundness & Completeness expected

User is expected to know what she is doing

IR only supports unstructured querying

Queries and results are both documents!

High Precision & Recall is hoped for

User is expected to be a dunderhead.

Some specific problems

1. How to handle textual attributes?2. How to support keyword-based

querying? 3. How to handle imprecise queries?

(Ullas Nambiar’s work)

1. Handling text fields in data tuples

Often you have database relations some of whose fields are “Textual” E.g. a movie database, which has, in

addition to year, director etc., a column called “Review” which is unstructured text

Normal DB operations ignore this unstructured stuff (can’t join over them). SQL sometimes supports “Contains”

constraint (e.g. give me movies that contain “Rotten” in the review

Soft Joins..WHIRL [Cohen] We can extend the notion of Joins

to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output n a ranked form—with the rank proportional to the similarity

Neat idea… but does have some implementation difficulties

Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples

2. Supporting keyword search on databases

How do we answer a query like “Soumen Sunita”?

Issues: --the schema is normalized (not everything in one table) --How to rank multiple tuples which contain the keywords?

What Banks Does

The whole DB seen as a directed graph (edges correspond to foreign keys)Answers are subgraphs Ranked by edge weights

BANKS: Keyword Search in DB

12/9/2002 7

Basic Model

Database: modeled as a graph Nodes = tuples Edges = references between tuples

foreign key, other kind of relationships Edges are directed.

MultiQuery Optimization

S. Sudarshan Prasan Roy

writes

author

paper

Charuta

BANKS: Keyword search…

12/9/2002 2

Motivation

Keyword search of documents on the Web has been enormously successful Simple and intuitive, no need to learn any query

language

Database querying using keywords is desirable SQL is not appropriate for casual users Form interfaces cumbersome:

Require separate form for each type of query — confusing for casual users of Web information systems

Not suitable for ad hoc queries

12/9/2002 10

Edge Weight

Weight of forward edge based on schema e.g. citation link weights > writes link weights

Weight of backward edge = indegree of edges pointing to the node

1

1

1

3

3

3

12/9/2002 23

BANKS Query Result Example

Result of “Soumen Sunita”

3. Supporting Imprecise Queries Increasing number of Web accessible databases

E.g. bibliographies, reservation systems, department catalogs etc

Support for precise queries only – exactly matching tuples

Difficulty in extracting desired information Limited query capabilities provided by form based

query interface Lack of schema/domain information Increasing complexity of types of data e.g.

hyptertext, images etc

Often times user wants ‘about the same’ instead of ‘exact’

Bibliography search — find similar publications

Solution: Provide answers closely matching query constraints

Want cars priced ‘around’ $7000

Relaxing queries… It is obvious how to relax certain type of attribute values

E.g. price=7000 is approximately the same as price=7020 But how do we relax categorical attributes?

How should we relax Make=Honda? Two possible approaches

Assume that domain specific information about similarity of values is available (difficult to satisfy in practice)

Attempt to derive the similarity between attribute values directly from the data

Qn: How do we compute similarity between “Make=Honda” and “Make=Chevrolet”

Idea: Compare the set all tuples where Make=Honda to the set of all tuples where Make=Chevrolet

Consider the set of tuples as a vector of bags (where bags correspond to the individual attributes)

Use IR similarity techniques to compare the vectors

Finding similarities between attribute values

Challenges in answering Imprecise Queries

We introduce IQE (Imprecise Query Engine):

Uses query workload to identify other precise queries Extracts additional tuples satisfying a query by issuing similar precise queries Measures distance between queries using Answerset Similarity

Challenges: Extracting additional tuples with minimal domain knowledge Estimating similarity with minimal user input

Answerset Similarity Answerset A(Q): Set of all answer

tuples of query Q given by relation R.

Query Similarity: Sim(Q1,Q2) :- Sim(A(Q1), A(Q2))

Measuring answerset similarity Relational model

exact match between tuples captures complete overlap

Vector space model match keywords also detects partial overlaps

Problem: Vector Space model representation for answersets

Answer: SuperTuple

Widom

Stream…. VLDB 2002

Widom

Optimize….

ICDE 1998

ST(QAuthor=Widom)

Co-author R. Motwani:3, Molina:6…

Title warehouse:5, optimizing:2, streams:6..

Conference SIGMOD:3, VLDB:4,…

Year 2000:6,1999:5,……

Ullman Optimize…

PODS 1998

Ullman Mining… VLDB 2000

Answerset for Q(Author=Widom)

Answerset for Q(Author=Ullman)

Similarity Measures Jaccard similarity metric with bag

semantics SimJ(Q1,Q2) = |Q1 ∩ Q2| / |Q1 U Q2|

Doc-Doc Similarity Equal importance to all attributes Supertuple considered as “single

bag” of keywords Simdoc-doc(Q1, Q2) = SimJ(STQ1, STQ2)

Weighted-Attribute Similarity

Weights assigned to attributes signify importance to user

Simwatr(Q1,Q2) = ∑ wi x SimJ(STQ1(Ai), STQ2(Ai))

ST(QAuthor=Ullman)

Co-author C. Li:5, R. Motwani:7,…

Title Data-mining:3, optimizing:5,..


Year 2000:6,1999:5,……

ST(QAuthor=Widom)

Co-author R. Motwani:3, Molina:6…

Title warehouse:5, optimizing:2, streams:6..


Year 2000:6,1999:5,……

Empirical Evaluation

Goal Evaluate the efficiency and

effectiveness of our approach Setup

A database system extending the bibliography mediator BibFinder projecting relation

Publications( Author, Title, Conference, Journal, Year)

Query log consists of 10K precise queries

User study: 3 graduate students 90 test queries - 30 chosen by

each student Platform: Java 2 on a Linux

Server – Intel Celeron 2.2 Ghz, 512 MB

Time Size

Supertuple Generation 126 sec 21 Mb

Similarity Estimation 10 hrs 6 Mb

Answering Imprecise Query

Estimating query similarity For each q Є Qlog

Compute Sim(q,q’) for all q’ Є Qlog Simdoc-doc(q, q’) = SimJ(STq, STq’) Simwatr(q,q’) = ∑ wi x SimJ(STq(Ai), STq’(Ai))

Extracting similar answers Given a query Q

Map Q to a query q Є Qlog Identify ‘k’ queries similar to q Execute the ‘k’ new queries

Some Results

Imprecise Query Top Relevant Queries1 Title=“web-based

learning”Title=“e learning”

Title=“web technology”

Conference=“WISE”

2 Title=“Information Extraction”

Title=“Information filtering”

Title = “text mining”

Title = “relevance feedback”

3 Author=“Abiteboul” Author=“vianu”

Author=“Dan Suciu”

Relevance of Suggested Answers

0

0.25

0.5

0.75

1

1 11 21

Query

Re

lev

an

ce

Es

tim

ati

on

Err

or Doc-Doc Similarity

Weighted-Attribute Similarity

Are the results precise? Average error in relevance estimation is around 25%

User Study – Summary

75

80

85

Rel

evan

ce o

f top

-10

1 2 3

User

Doc-Doc

Weighted-attribute

Precision for top-10 related queries is above 75%

Doc-Doc similarity measure dominates Weighted-attribute similarity

Lessons: Queries with popular

keywords difficult

Efficiently and effectively capturing user interest is difficult

A solution requiring less input more acceptable

What’s Next ?

Open Issues: Most similar query may not be present in the workload. Answers to a similar query will have varying similarity depending

on the affected attributes

Solution: Given an imprecise query generate the most

similar query. Use attribute importance and value-value

similarity to order tuples.

Challenges:

• Estimating attribute importance

• Estimating value-value similarity

Learning the Semantics of the data

Estimate for value-value similarity Similarity between values of categorical attribute

Sim(v11,v12) = ∑ wi x Sim(Co-related_value(Ai,v11), Co-related_value(Ai,v12)) where Ai Є Attributes(R), Ai <> A

Euclidean distance for numerical attributes

Use the Model of the database – AFDs, Keys, Value correlations to Identify an implicit structure for the tuple. Show other tuples that least break the structure.

CarDb(Make,Model, Year, Price, Mileage, Location, Color) Approximate Keys

Model, Mileage, Location – uniquely decides 90% cars in Db Model, Mileage, Color - uniquely decides 84% cars in Db Approximate Functional Dependencies (AFDs)

Model -> Make Year -> Price

Mileage -> Year

Query relaxation

Finding similarities between attribute values

issues in bridging db & ir announcements: next class: interactive review (come prepared)...

Documents

db slide

review slide

databases slide

record slide

data tuples

unstructured querying

text data

ranked tuples