+ efficient network aware search in collaborative tagging sihem amer yahia, michael benedikt, laks...

36
+ Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish Chawla CSE 6339 Spring 2009

Upload: clifton-osborne

Post on 20-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+

Efficient network aware search in collaborative taggingSihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich

Presented by: Ashish ChawlaCSE 6339 Spring 2009

Page 2: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Overview Opportunity:

Explore keyword search in a context where query results are determined by opinion of network of taggers related to a seeker.

Incorporate social behavior into processing search queries

Network Aware Search Results determined by opinion of network.

Existing top-k are too space intensive Dependence of scores on seeker’s network

Investigate clustering seekers based on behavior of networks

Del.icio.us datasets were used for experiments

2

Page 3: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Introduction

What is Network Aware Search?

Examples: Flickr, YouTube, del.icio.us, photo tagging on Facebook

Users contribute content

annotate items (photos, videos, URLs, …) with tags form social networks

friends/family, interest-based need help discovering relevant content

What is Relevance of an item?

3

Page 4: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+What is Network-Aware Search?

4

Page 5: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Claims

Define what is network-aware search.

Improvise top-k algorithms to Network-Aware Search, by using score upper-bounds and EXACT strategy.

Refine score upper-bounds based on the user’s network and tagging behavior

5

Page 6: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Data Model

Roger, i1, musicRoger, i3, musicRoger, i5, sports…Hugo, i1, musicHugo, i22, music…Minnie, i2, sports…Linda, i2, footballLinda, i28, news…

Tagged(user u,item i,tag t)

Taggers = u TaggedSeekers = u Link

Link(u1,v1): directed edgeNetwork (u) = { v | Link (u,v) }For seeker u1ε Seekers, Network(u1) = neighbors of u1

Link (user u, user v)

6

Page 7: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+What are Scores?

Query is a set of tags Q = {t1,t2,…,tn} example: fashion, www, sports, artificial

intelligence

For a seeker u, a tag t, and a item I (Score per tag)score(i,u,t) = f(|Network(u) {v, |Tagged(v,i,t)}|)

Overall Score of the queryscore(i,u,Q) = g(score(i,u,t1), score(i,u,t2),…, score(i,u,

tn))

f and g are monotone, where f = COUNT, g = SUM

7

Page 8: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Problem Statement

Given a user query Q = t1 … tn and a number k, we want to efficiently determine the top k items ie: k items with the highest overall

score

8

Page 9: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Standard Top-k Processing

Q = {t1,t2,…,tn}

Inverted lists per tag, IL1, IL2, … ILn, sorted on scores

score (i) = g(score(i, IL1), score(i, IL2) , …, score(i, IL3))

Intuition high-scoring items are close to the top of most lists

Fagin-style processing: NRA (no random access)

access all lists sequentially in parallel

maintain a heap sorted on partial scores

stop when partial score of kth item > best case score of unseen/incomplete items

9

Page 10: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+

Item 25Item 25

0.60.6

Item 17Item 17

0.60.6

Item 83Item 83

0.90.9

item78item78

0.50.5

item38item38

0.60.6

item17item17

0.70.7

item83item83

0.40.4

item14item14

0.60.6

item61item61

0.30.3

item17item17

0.30.3

item5item5

0.60.6

item81item81

0.20.2

item21item21

0.20.2

item83item83

0.50.5

item65item65

0.10.1

item91item91

0.10.1

item21item21

0.30.3

item10item10

0.10.1

item44item44

0.10.1

Item 83Item 83

[0.9, 2.1][0.9, 2.1]

Item 17Item 17

[0.6, 2.1][0.6, 2.1]

item 25item 25

[0.6, 2.1][0.6, 2.1]

worst score

best-score

Min top-2 score : 0.6

Threshold (Max of unseen tuples): 2.1

Pruning Candidates:Min top-2 < best score of candidate

Stopping ConditionThreshold < min top-2 ?

List 1 List 2 List 3 Candidates

0.6+0.6+0.9=2.1

NRA10

Page 11: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+

item 25item 25

0.60.6

item 17item 17

0.60.6

item 83item 83

0.90.9

item 78item 78

0.50.5

item 38item 38

0.60.6

item 17item 17

0.70.7

item 83item 83

0.40.4

item 14item 14

0.60.6

item 61item 61

0.30.3

item 17item 17

0.30.3

item 5item 5

0.60.6

item 81item 81

0.20.2

item 21item 21

0.20.2

item 83item 83

0.50.5

item 65item 65

0.10.1

item 91item 91

0.10.1

item 21item 21

0.30.3

item 10item 10

0.10.1

item 44item 44

0.10.1

worst score

best-score

Min top-2 score : 0.9

Threshold (Max of unseen tuples): 1.8

Pruning Candidates:Min top-2 < best score of candidate

Stopping ConditionThreshold < min top-2 ?

item 17item 17

[1.3, [1.3, 1.81.8]]

item 83item 83

[0.9, [0.9, 2.02.0]]

item 25item 25

[0.6, [0.6, 1.91.9]]

item 38item 38

[0.6, [0.6, 1.81.8]]

item 78item 78

[0.5, [0.5, 1.81.8]]

List 1 List 2 List 3 Candidates

NRA11

Page 12: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+

item 25item 25

0.60.6

item 17item 17

0.60.6

item 83item 83

0.90.9

item 78item 78

0.50.5

item 38item 38

0.60.6

item 17item 17

0.70.7

item 83item 83

0.40.4

item 14item 14

0.60.6

item 61item 61

0.30.3

item 17item 17

0.30.3

item 5item 5

0.60.6

item 81item 81

0.20.2

item 21item 21

0.20.2

item 83item 83

0.50.5

item 65item 65

0.10.1

item 91item 91

0.10.1

item 21item 21

0.30.3

item 10item 10

0.10.1

item 44item 44

0.10.1

worst score

best-score

item 83item 83

[1.3, [1.3, 1.91.9]]

item 17item 17

[1.3, [1.3, 1.91.9]]

item 25item 25

[0.6, [0.6, 1.51.5]]

item 78item 78

[0.5, [0.5, 1.41.4]]

Min top-2 score : 1.3

Threshold (Max of unseen tuples): 1.3

Pruning Candidates:Min top-2 < best score of candidate

Stopping ConditionThreshold < min top-2 ?

no more new items can get into top-2

but, extra candidates left in queue

List 1 List 2 List 3 Candidates

NRA12

Page 13: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+

item 25item 25

0.60.6

item 17item 17

0.60.6

item 83item 83

0.90.9

item 78item 78

0.50.5

item 38item 38

0.60.6

item 17item 17

0.70.7

item 83item 83

0.40.4

item 14item 14

0.60.6

item 61item 61

0.30.3

item 17item 17

0.30.3

item 5item 5

0.60.6

item 81item 81

0.20.2

item 21item 21

0.20.2

item 83item 83

0.50.5

item 65item 65

0.10.1

item 91item 91

0.10.1

item 21item 21

0.30.3

item 10item 10

0.10.1

item 44item 44

0.10.1

worst score

best-score

Min top-2 score : 1.3

Threshold (Max of unseen tuples): 1.1

Pruning Candidates:Min top-2 < best score of candidate

Stopping ConditionThreshold < min top-2 ?

no more new items can get into top-2

but, extra candidates left in queue

item 17item 17

1.61.6

item 83item 83

[1.3, [1.3, 1.91.9]]

item 25item 25

[0.6, [0.6, 1.41.4]]

List 1 List 2 List 3 Candidates

NRA13

Page 14: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+

item 25item 25

0.60.6

item 17item 17

0.60.6

item 83item 83

0.90.9

item 78item 78

0.50.5

item 38item 38

0.60.6

item 17item 17

0.70.7

item 83item 83

0.40.4

item 14item 14

0.60.6

item 61item 61

0.30.3

item 17item 17

0.30.3

item 5item 5

0.60.6

item 81item 81

0.20.2

item 21item 21

0.20.2

item 83item 83

0.50.5

item 65item 65

0.10.1

item 91item 91

0.10.1

item 21item 21

0.30.3

item 10item 10

0.10.1

item 44item 44

0.10.1

Min top-2 score : 1.6

Threshold (Max of unseen tuples): 0.8

Pruning Candidates:Min top-2 < best score of candidate

item 83item 83

1.81.8

item 17item 17

1.61.6

List 1 List 2 List 3 Candidates

NRA14

Page 15: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+

NRA performs only sorted accesses (SA) (No Random Access)

Random access (RA) lookup actual (final) score of an item often very useful

Problems with NRA high bookkeeping overhead for “high” values of k, gain in even access

cost not significant

NRA15

Page 16: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+TA

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1lis

ts s

ort

ed b

y

score

List 1 List 2 List 3

(a1 , a2 , a3)

16

Page 17: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+TA

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

lists

sort

ed b

y

score

item 83

1.8

item 17

1.6

item 25

0.6

Candidatesmin top-2 score: 1.6maximum score for unseen items: 2.1

TA Algorithm: round 1

List 1 List 2 List 3

read one item from every list

Random access

Random access

17

Page 18: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Computing Exact Scores: Naïve

item score

i7

i1i2i3i4i5i6

i816

736562403918

16

seeker Jane

i7

i5i9i2i6i5i8

i3

seeker Ann

10

533630151410

5

scoreitem

tag = photositem score

i7

i1i8i4i2i3i6

i915

302927252320

13

seeker Jane

i4

i5i2i8i7i1i6

i3

seeker Ann

60

998078757263

50

scoreitem

tag = music

Typical: Maintain single inverted list per (seeker, tag), items ordered by score

+ can use standard top-k algorithms-- high space overhead

18

Page 19: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Computing Score Upper-Bounds Space saving strategy.

Maintain entries of the form (item,itemTaggers) where itemTaggers are all taggers who tagged the item with the tag.

Here every item is stored at most once.

Q now: what score to store with each entry? We store the maximum score that an item can have across

all possible seekers.

This is Global Upper-Bound strategy

Limitation: Time to dynamically computing exact scores at query time.

19

Page 20: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Score Upper-Bounds

Global Upper-Bound (GUB): 1 list per tag

tag = music

item taggers upper-bound

i6

i1i2i3i5i4i9

i7i8

Miguel,…Kath, …Sam, …Miguel, …Peter, …Jane, …Mary, …Miguel, …Kath, …

18

736562534036

1616

all seekers

+ low space overhead-- item upper-bounds, and list order(!) may differ from EXACT for most users-- time to dynamically computing exact scores at query time.

How do we do top-k processing with score upper-bounds?

Q: what score to store with each entry?

We store the maximum score that an item can have across all possible seekers.

20

Page 21: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Top-k with Score Upper-Bounds

gNRA - “generalized no random access” access all lists sequentially in parallel maintain a heap with partial exact scores stop when partial exact score of kth item > highest

possible score from unseen/incomplete items (computed using current list upper-bounds)

21

Page 22: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+gNRA – NRA Generalization

22

Page 23: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+gTA – TA Generalization

23

Page 24: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Performance of Global Upper Bound (GUB) and Exact

Space overhead total # number of entries in all inverted lists

Query processing time # of cursor moves

24

GUB Exactspace (IL entries)

74K 63M

time 479-18K 13 - 189 space baseline

time baseline

Page 25: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Clustering and Query-Processing

We want to reduce the distance between score upper-bound and the exact score.

Greater the distance, more processing may be required

Core idea Cluster users into groups and compute upper-

bound for the group.

Intuition group users whose behavior is similar

25

Page 26: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Clustering Seekers Cluster the seekers based on similarity in their

scores (because score of an item depends on the network).

Form an inverted list ILt,C for every tag t and cluster C (the score of an item being the maximum score over all seekers in the cluster).

Query processing for Q = t1.. tn and seeker u, we First find the cluster C(u) And then perform aggregation over the

collection

Global Upper-Bound (GUB) is where all seekers fall into the same cluster.

26

Page 27: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Clustering Seekers

assign each seeker to a cluster

compute an inverted list per cluster ub(i,t,C) = maxuC|Network(u) {v|Tagged(v,i,tj)}|

+ tighter bounds, item order usually closer to EXACT order than in Global Upper-Bound

-- space overhead still high (trade-off)

27

item taggersupper-bound

chanel

pumagucciadidasdieselversace nike

prada

Miguel,…Kath, …Sam, …Miguel, …Peter, …Jane, …Mary, …Chris, …

18

736562534036

16

Global Upper-Bound

item taggersupper-bound

gucciversacechanelpradapuma

Bob,…Peter, …Mary, …Chris, …Alice, …

6540181610

Example of Clusters

C1: seekers Bob & Alice

item taggersupper-bound

pumaadidasdieselnike

Miguel,…Sam, …Miguel, …Jane, …

73625336

gucci Kath, … 5

C2: seekers Sam & Miguel

Page 28: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+How do we cluster seekers?

Finding a cluster that minimizes worst, average computation time of top-k algorithms is NP-hard.

Proofs by reduction from independent task scheduling problem and minimum sum of squares problem

Authors present some heuristics Use some form of Normalized Discounted Cumulative Gain

(NDCG) which is a measure of the quality of a clustered list for a given seeker and keyword.

The metric compares the ideal (exact score) order in inverted lists with actual (score upper-bound) order

28

Page 29: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+NDCG - Example

i docID Log I –base2

Ranking

Rank/log i – base 2

Ideal ranking

Ideal ranking/log i –base 2

1 D1 0.00 3N/A 3N/A2 D2 1.00 2 2.00 3.00 3.003 D3 1.58 3 1.89 2.00 1.264 D4 2.00 0 0.00 2.00 1.005 D5 2.32 1 0.43 1.00 0.436 D6 2.58 2 0.77 0.00 0.00Cumulative Gain (CG) 9.49 5.10 5.69Distributive CG 8.10Ideal DCG 8.69Normalized CG (nDCG) 8.10/8.69 = 0.93

29

Page 30: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Clustering Taggers

For each tag t we partition the taggers into separate clusters.

We form inverted list and an item i in the list for cluster C gets the score as maxu ϵ seekers |Network(u) ∩ C ∩ {v1 | Tagged(v1,i,t)}|

How to cluster taggers? Graph with nodes as taggers and an edge exists between nodes v1

and v2 iff:

Items(v1,t) ∩ Items(v2,t) ≥ threshold

30

Page 31: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Clustering Seekers Metrics

Space Global Upper Bound has the lowest overhead. ASC and NCT achieve an order of magnitude improvement

in space overhead over Exact.

Time Both gNRA and gTA outperform Global Upper-bound. ASC outperforms NCT on both sequential and total accesses

in all cases for gTA and in all cases except one for gNRA. Inverted lists are shorter Score upper-bound order similar to exact score order for

many users

Average % improvement over Global Upper-Bound Normalized Cut: 38-72% Ratio Association 67-87%

31

Page 32: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Clustering Seekers

Cluster-Seekers improves query execution time over GUB by at least an order of magnitude, for all queries and all users

32

Page 33: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Clustering Taggers

Space Overhead is significantly lower than that of Exact and of

Cluster-Seekers

Time Best Case: all taggers relevant to a seeker will reside in a

single cluster Worst Case: All taggers will reside in separate clusters.

Idea: cluster taggers based on overlap in tagging assign each tagger to a cluster compute cluster upper-bounds: ub(i,t,C) = maxuSeekers, vC|Network(u) {v |Tagged(v,i,tj)}|

33

Page 34: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Clustering Taggers

34

Page 35: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+Conclusion and Next Steps

Cluster-Taggers worked best for seekers whose network fell into at most 3 * #tags clusters For others, query execution time degraded due to the

number of inverted lists that had to be processed

For these seekers Cluster-Taggers outperformed Cluster-Seekers in all cases Cluster-Taggers outperforms Global Upper-Bound by 94-

97%, in all cases.

Extended traditional top-k algorithms

Achieved a balance between time and space consumption.

35

Page 36: + Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish

+

Questions?WebCT/[email protected]

Thank You!