presenter : aviv alon seminar in databases (236826) 1

45
Indexing Dataspaces Presenter : Aviv Alon Seminar in Databases (236826) 1

Upload: marylou-knight

Post on 17-Dec-2015

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Presenter : Aviv Alon Seminar in Databases (236826) 1

1

Indexing DataspacesPresenter : Aviv AlonSeminar in Databases (236826)

Page 2: Presenter : Aviv Alon Seminar in Databases (236826) 1

Dataspaces are collections of heterogeneous and partially unstructured data.

Dataspaces

Page 3: Presenter : Aviv Alon Seminar in Databases (236826) 1

Dataspaces – Why we need them?

Looking for an architect with

good reviews and cheap materials?

Return “Architect B” as instance

Page 4: Presenter : Aviv Alon Seminar in Databases (236826) 1

4

Consider queries that are keyword basedbut also structure aware:

How to effectively query and search a dataspace

Main Problem

Page 5: Presenter : Aviv Alon Seminar in Databases (236826) 1

5

An inverted list where each row represents a keyword and each column represents a data item from the data sources.

Indexing Heterogeneous Data

Page 6: Presenter : Aviv Alon Seminar in Databases (236826) 1

6

We model the data as a set of triples Each triple is either of the form

(instance, attribute, value) for example: (“Architect B”, name, ‘Shalom’)

or of the form (instance, association, instance) for example: (“Architect B”, worksWith, “Architect A”)

Indexing Heterogeneous Data

Page 7: Presenter : Aviv Alon Seminar in Databases (236826) 1

7

We also model:

Indexing Heterogeneous Data

Page 8: Presenter : Aviv Alon Seminar in Databases (236826) 1

Person instances: p1, p2, p3

Article instance: a1

Conference instance: c1

Example Attributes firstName, lastName and

nickName are sub-attributes of name Association contactAuthor is a sub-

association of author.

Page 9: Presenter : Aviv Alon Seminar in Databases (236826) 1

Set of predicates of the form (v, {K1, ... , Kn})◦ v - an attribute or association label

◦ {K1, …, Kn} - a keyword set

Predicate queries

Example 1: (title, ‘Birch’)

attribute predicate

Page 10: Presenter : Aviv Alon Seminar in Databases (236826) 1

Set of predicates of the form (v, {K1, ... , Kn})◦ v - an attribute or association label

◦ {K1, …, Kn} - a keyword set

Predicate queries

association predicate

Example 2:(publishedIn ‘1996 Sigmod)’

Page 11: Presenter : Aviv Alon Seminar in Databases (236826) 1

Set of keywords K1, ... , Kn

◦ relevant instance◦ associated instances

Neighborhood keyword queries

Example: ‘Birch’relevant instance

associated instances

Page 12: Presenter : Aviv Alon Seminar in Databases (236826) 1

12

Build a separate index for each attribute to support structured queries on structured data.◦ Con: significant overhead to the index structure

Create an inverted list to support keyword search on unstructured data.◦ Con: Does not allow specifications on structure

Existing methods

Page 13: Presenter : Aviv Alon Seminar in Databases (236826) 1

13

Capture both text values and structuralinformation using an extended inverted list.

The index augments the text terms in the inverted list with labels denoting the structural aspects of the data such as attribute tags and associations between data items.

Proposed solution

Page 14: Presenter : Aviv Alon Seminar in Databases (236826) 1

Inverted Lists - ExampleWe cannot tell that “Tian” occurs as p1’s name and p3’s lastName

Page 15: Presenter : Aviv Alon Seminar in Databases (236826) 1

15

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

Page 16: Presenter : Aviv Alon Seminar in Databases (236826) 1

16

Attribute inverted lists (ATIL) Whenever the keyword k appears in a value of the a

attribute, there is a row in the inverted list for k//a//

Indexing Attributes

keyword = 1996

Attribute = yeara1 c1 p1 p2 p3

1996//year// 0 1 0 0 0

Page 17: Presenter : Aviv Alon Seminar in Databases (236826) 1

17

Attribute inverted lists (ATIL) Whenever the keyword k appears in a value of the a

attribute, there is a row in the inverted list for k//a//

Indexing Attributes

Page 18: Presenter : Aviv Alon Seminar in Databases (236826) 1

18

To Answer an attribute predicate query (A,{K1, ... , Kn})

we need to search for {K1 //A//, ... , Kn //A//}

Example:(lastName, ‘Tian’)

“tian//lastName//”

Attribute inverted lists (ATIL)

The search will yield p3

Page 19: Presenter : Aviv Alon Seminar in Databases (236826) 1

19

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

Page 20: Presenter : Aviv Alon Seminar in Databases (236826) 1

20

Attribute-association inverted lists (AAIL):

Indexing Attributes

keyword = Birch

Association = authoredPaper with p1, p2

a1 c1 p1 p2 p3

Birch//authoredPaper// 0 0 1 1 0

Page 21: Presenter : Aviv Alon Seminar in Databases (236826) 1

21

Attribute-association inverted lists (AAIL):

Indexing Associations

Page 22: Presenter : Aviv Alon Seminar in Databases (236826) 1

22

To Answer a association predicate query (R, {K1, ... , Kn})

we need to search for {K1 // R //, ... , Kn // R //}

Example: (author ‘Raghu’)

“raghu//author//”

Attribute-association Inverted lists (AAIL)

Page 23: Presenter : Aviv Alon Seminar in Databases (236826) 1

23

For the query (name ‘Tian’), we wish to return instances p1 and p3, rather than only p1.

Indexing hierarchies

Page 24: Presenter : Aviv Alon Seminar in Databases (236826) 1

24

To Answer the query (name ‘Tian’)

we can search for:“tian//name// OR tian//firstName// OR tian//lastName// OR tian//nickName//”

A Naïve method

Can be very expensive!

Page 25: Presenter : Aviv Alon Seminar in Databases (236826) 1

25

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

Page 26: Presenter : Aviv Alon Seminar in Databases (236826) 1

26

Attribute inverted lists with duplication (Dup-ATIL):

Indexing Attributes

Attribute = nameSub-attribute = nickName

a1 c1 p1 p2 p3a1 c1 p1 p2 p3

Jeff//name//Jeff//nickName//

00

00

00

00

11

Page 27: Presenter : Aviv Alon Seminar in Databases (236826) 1

Attribute inverted lists with duplication (Dup-ATIL)

Index with Duplication

Page 28: Presenter : Aviv Alon Seminar in Databases (236826) 1

28

To Answer an attribute predicate query (A,{K1, ... , Kn})

we need to search for {K1//A//, ... , Kn//A//}

Example:(name, ‘Tian’)

“tian//name//”

Attribute inverted lists with duplication (Dup-ATIL)

The search will yield both p3 and p1

Page 29: Presenter : Aviv Alon Seminar in Databases (236826) 1

29

Pro: simple query answering

Con: may considerably expand the size of the index because of the duplication. Specially when:◦ Long paths from the root attribute to the leaf attributes ◦ Most values in the triple base belong to leaf attributes.

Dup-ATIL (cont.)

Page 30: Presenter : Aviv Alon Seminar in Databases (236826) 1

30

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

Page 31: Presenter : Aviv Alon Seminar in Databases (236826) 1

31

Attribute inverted lists with hierarchies (Hier-ATIL):

Index with Hierarchy Path

Attribute = nameSub-attribute = nickName

a1 c1 p1 p2 p3

Jeff//name//nickName// 0 0 0 0 1

Page 32: Presenter : Aviv Alon Seminar in Databases (236826) 1

32

To Answer an attribute predicate query (A,{K1, ... , Kn})

we need to search for {K1//a0 // ... //am //*, ... , Kn// a0 // ... //am //*}

Example:(name, ‘Tian’)

“tian//name//*”

Attribute inverted lists with hierarchies (Hier-ATIL)

The search will yield both p3 and p1

a0 // ... //am : the hierarchy path for attribute A

Page 33: Presenter : Aviv Alon Seminar in Databases (236826) 1

33

Pro: Does not increase the number of indexed keywords (Although it can lengthen many of them)◦ real indexing systems typically record a keyword only by

the difference from its previous keyword

Con: answer a predicate query by transforming it into a prefix search, which can be more expensive than a keyword search.

Hier-ATIL (cont.)

Page 34: Presenter : Aviv Alon Seminar in Databases (236826) 1

34

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

Page 35: Presenter : Aviv Alon Seminar in Databases (236826) 1

35

Dup-ATIL is more suitable for the cases where a keyword occurs in many attributes with common ancestors

Hier-ATIL is more suitable for the cases where a keyword occurs in only a few attributes with common ancestors

Hybrid indexing combines the strengths of both methods

Hybrid Index – Why?

Page 36: Presenter : Aviv Alon Seminar in Databases (236826) 1

36

Hybrid attribute inverted list (Hybrid-ATIL): Inverted list that can answer any prefix search by

reading no more than t rows.

Hybrid Index

A1 c1 p1 p2 p3

Jeff//name//nickName//Jie//name//firstName//Tian//name////Tian//name//lastName//

0000

0000

0010

0000

1111

Tian//name//lastName//is shadowed by Tian//name//

summary row

Page 37: Presenter : Aviv Alon Seminar in Databases (236826) 1

37

To Answer prefix query of the form k//a0 // ... //am//*

we look at all the rows with prefix k//a0 // ... //am // except

those shadowed by summary rowsExample:(name, ‘Tian’), t=1

“tian//name//*”

Hybrid Index

Answer the prefix search after reading 1 row. yield both p1 and p3

Page 38: Presenter : Aviv Alon Seminar in Databases (236826) 1

38

We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL

Neighborhood Keyword Queries

Page 39: Presenter : Aviv Alon Seminar in Databases (236826) 1

We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL

Example:“Birch”, t=1

“birch//*”

Neighborhood Keyword Queries

Answer the prefix search after reading 1 row. yield p1, p2, a1,, c1

Page 40: Presenter : Aviv Alon Seminar in Databases (236826) 1

40

Associations between disparate items on the desktop:◦ Latex and Bibtex files◦ Word documents◦ Powerpoint presentations◦ emails and contacts◦ webpages in the web cache

The instances and associations are stored in an RDF file. the size of the file is 52.4MB

Experimental Evaluation

Page 41: Presenter : Aviv Alon Seminar in Databases (236826) 1

Experimental Evaluation

Attribute clauses. No

sub-attributes

Attribute clauses. With sub-attributes

Association clauses

Page 42: Presenter : Aviv Alon Seminar in Databases (236826) 1

Observations about the results105,320 object

300,354 attribute468,402 association

predicate query: 15.2 ms neighborhood keyword query: 224.3 ms

(with no more than 5 keywords)

Answering queries using the KIL was very efficient!

Answering queries with / without sub-attributes consumed a similar amount of time

Effectiveness of hybrid indexing

Page 43: Presenter : Aviv Alon Seminar in Databases (236826) 1

Compared with KIL (on average): The Naïve method

◦ query-answering time increased by a factor of 15.9 XML Index (SepIL):

◦ query-answering time increased by a factor of 2

Comparison of methods

Page 44: Presenter : Aviv Alon Seminar in Databases (236826) 1

44

Main Contributions: An indexing method that is designed to support flexible

querying over dataspaces Extended inverted lists to capture both texts and

structure of data

Future Work Extend the index to support value heterogeneity and to

investigate appropriate ranking algorithms

Conclusions

Page 45: Presenter : Aviv Alon Seminar in Databases (236826) 1

45

THE ENDQuestions ?