+ efficient network aware search in collaborative tagging sihem amer yahia, michael benedikt, laks...

+

Efficient network aware search in collaborative taggingSihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich

Presented by: Ashish ChawlaCSE 6339 Spring 2009

+Overview Opportunity:

Explore keyword search in a context where query results are determined by opinion of network of taggers related to a seeker.

Incorporate social behavior into processing search queries

Network Aware Search Results determined by opinion of network.

Existing top-k are too space intensive Dependence of scores on seeker’s network

Investigate clustering seekers based on behavior of networks

Del.icio.us datasets were used for experiments

2

+Introduction

What is Network Aware Search?

Examples: Flickr, YouTube, del.icio.us, photo tagging on Facebook

Users contribute content

annotate items (photos, videos, URLs, …) with tags form social networks

friends/family, interest-based need help discovering relevant content

What is Relevance of an item?

3

+What is Network-Aware Search?

4

+Claims

Define what is network-aware search.

Improvise top-k algorithms to Network-Aware Search, by using score upper-bounds and EXACT strategy.

Refine score upper-bounds based on the user’s network and tagging behavior

5

+Data Model

Roger, i1, musicRoger, i3, musicRoger, i5, sports…Hugo, i1, musicHugo, i22, music…Minnie, i2, sports…Linda, i2, footballLinda, i28, news…

Tagged(user u,item i,tag t)

Taggers = u TaggedSeekers = u Link

Link(u1,v1): directed edgeNetwork (u) = { v | Link (u,v) }For seeker u1ε Seekers, Network(u1) = neighbors of u1

Link (user u, user v)

6

+What are Scores?

Query is a set of tags Q = {t1,t2,…,tn} example: fashion, www, sports, artificial

intelligence

For a seeker u, a tag t, and a item I (Score per tag)score(i,u,t) = f(|Network(u) {v, |Tagged(v,i,t)}|)

Overall Score of the queryscore(i,u,Q) = g(score(i,u,t1), score(i,u,t2),…, score(i,u,

tn))

f and g are monotone, where f = COUNT, g = SUM

7

+Problem Statement

Given a user query Q = t1 … tn and a number k, we want to efficiently determine the top k items ie: k items with the highest overall

score

8

+Standard Top-k Processing

Q = {t1,t2,…,tn}

Inverted lists per tag, IL1, IL2, … ILn, sorted on scores

score (i) = g(score(i, IL1), score(i, IL2) , …, score(i, IL3))

Intuition high-scoring items are close to the top of most lists

Fagin-style processing: NRA (no random access)

access all lists sequentially in parallel

maintain a heap sorted on partial scores

stop when partial score of kth item > best case score of unseen/incomplete items

9

+

Item 25Item 25

0.60.6

Item 17Item 17

0.60.6

Item 83Item 83

0.90.9

item78item78

0.50.5

item38item38

0.60.6

item17item17

0.70.7

item83item83

0.40.4

item14item14

0.60.6

item61item61

0.30.3

item17item17

0.30.3

item5item5

0.60.6

item81item81

0.20.2

item21item21

0.20.2

item83item83

0.50.5

item65item65

0.10.1

item91item91

0.10.1

item21item21

0.30.3

item10item10

0.10.1

item44item44

0.10.1

Item 83Item 83

[0.9, 2.1][0.9, 2.1]

Item 17Item 17

[0.6, 2.1][0.6, 2.1]

item 25item 25

[0.6, 2.1][0.6, 2.1]

worst score

best-score

Min top-2 score : 0.6

Threshold (Max of unseen tuples): 2.1

Pruning Candidates:Min top-2 < best score of candidate

Stopping ConditionThreshold < min top-2 ?

List 1 List 2 List 3 Candidates

0.6+0.6+0.9=2.1

NRA10

+

item 25item 25

0.60.6

item 17item 17

0.60.6

item 83item 83

0.90.9

item 78item 78

0.50.5

item 38item 38

0.60.6

item 17item 17

0.70.7

item 83item 83

0.40.4

item 14item 14

0.60.6

item 61item 61

0.30.3

item 17item 17

0.30.3

item 5item 5

0.60.6

item 81item 81

0.20.2

item 21item 21

0.20.2

item 83item 83

0.50.5

item 65item 65

0.10.1

item 91item 91

0.10.1

item 21item 21

0.30.3

item 10item 10

0.10.1

item 44item 44

0.10.1

worst score

best-score





item 17item 17

[1.3, [1.3, 1.81.8]]

item 83item 83

[0.9, [0.9, 2.02.0]]

item 25item 25

[0.6, [0.6, 1.91.9]]

item 38item 38

[0.6, [0.6, 1.81.8]]

item 78item 78

[0.5, [0.5, 1.81.8]]


NRA11

+

item 25item 25

0.60.6

item 17item 17

0.60.6

item 83item 83

0.90.9

item 78item 78

0.50.5

item 38item 38

0.60.6

item 17item 17

0.70.7

item 83item 83

0.40.4

item 14item 14

0.60.6

item 61item 61

0.30.3

item 17item 17

0.30.3

item 5item 5

0.60.6

item 81item 81

0.20.2

item 21item 21

0.20.2

item 83item 83

0.50.5

item 65item 65

0.10.1

item 91item 91

0.10.1

item 21item 21

0.30.3

item 10item 10

0.10.1

item 44item 44

0.10.1

worst score

best-score

item 83item 83

[1.3, [1.3, 1.91.9]]

item 17item 17

[1.3, [1.3, 1.91.9]]

item 25item 25

[0.6, [0.6, 1.51.5]]

item 78item 78

[0.5, [0.5, 1.41.4]]





no more new items can get into top-2

but, extra candidates left in queue


NRA12

+

item 25item 25

0.60.6

item 17item 17

0.60.6

item 83item 83

0.90.9

item 78item 78

0.50.5

item 38item 38

0.60.6

item 17item 17

0.70.7

item 83item 83

0.40.4

item 14item 14

0.60.6

item 61item 61

0.30.3

item 17item 17

0.30.3

item 5item 5

0.60.6

item 81item 81

0.20.2

item 21item 21

0.20.2

item 83item 83

0.50.5

item 65item 65

0.10.1

item 91item 91

0.10.1

item 21item 21

0.30.3

item 10item 10

0.10.1

item 44item 44

0.10.1

worst score

best-score





no more new items can get into top-2

but, extra candidates left in queue

item 17item 17

1.61.6

item 83item 83

[1.3, [1.3, 1.91.9]]

item 25item 25

[0.6, [0.6, 1.41.4]]


NRA13

+

item 25item 25

0.60.6

item 17item 17

0.60.6

item 83item 83

0.90.9

item 78item 78

0.50.5

item 38item 38

0.60.6

item 17item 17

0.70.7

item 83item 83

0.40.4

item 14item 14

0.60.6

item 61item 61

0.30.3

item 17item 17

0.30.3

item 5item 5

0.60.6

item 81item 81

0.20.2

item 21item 21

0.20.2

item 83item 83

0.50.5

item 65item 65

0.10.1

item 91item 91

0.10.1

item 21item 21

0.30.3

item 10item 10

0.10.1

item 44item 44

0.10.1




item 83item 83

1.81.8

item 17item 17

1.61.6


NRA14

+

NRA performs only sorted accesses (SA) (No Random Access)

Random access (RA) lookup actual (final) score of an item often very useful

Problems with NRA high bookkeeping overhead for “high” values of k, gain in even access

cost not significant

NRA15

+TA

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1lis

ts s

ort

ed b

y

score

List 1 List 2 List 3

(a1 , a2 , a3)

16

+TA

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

lists

sort

ed b

y

score

item 83

1.8

item 17

1.6

item 25

0.6

Candidatesmin top-2 score: 1.6maximum score for unseen items: 2.1

TA Algorithm: round 1

List 1 List 2 List 3

read one item from every list

Random access

Random access

17

+Computing Exact Scores: Naïve

item score

i7

i1i2i3i4i5i6

i816

736562403918

16

seeker Jane

i7

i5i9i2i6i5i8

i3

seeker Ann

10

533630151410

5

scoreitem

tag = photositem score

i7

i1i8i4i2i3i6

i915

302927252320

13

seeker Jane

i4

i5i2i8i7i1i6

i3

seeker Ann

60

998078757263

50

scoreitem

tag = music

Typical: Maintain single inverted list per (seeker, tag), items ordered by score

+ can use standard top-k algorithms-- high space overhead

18

+Computing Score Upper-Bounds Space saving strategy.

Maintain entries of the form (item,itemTaggers) where itemTaggers are all taggers who tagged the item with the tag.

Here every item is stored at most once.

Q now: what score to store with each entry? We store the maximum score that an item can have across

all possible seekers.

This is Global Upper-Bound strategy

Limitation: Time to dynamically computing exact scores at query time.

19

+Score Upper-Bounds

Global Upper-Bound (GUB): 1 list per tag

tag = music

item taggers upper-bound

i6

i1i2i3i5i4i9

i7i8

Miguel,…Kath, …Sam, …Miguel, …Peter, …Jane, …Mary, …Miguel, …Kath, …

18

736562534036

1616

all seekers

+ low space overhead-- item upper-bounds, and list order(!) may differ from EXACT for most users-- time to dynamically computing exact scores at query time.

How do we do top-k processing with score upper-bounds?

Q: what score to store with each entry?

We store the maximum score that an item can have across all possible seekers.

20

+Top-k with Score Upper-Bounds

gNRA - “generalized no random access” access all lists sequentially in parallel maintain a heap with partial exact scores stop when partial exact score of kth item > highest

possible score from unseen/incomplete items (computed using current list upper-bounds)

21

+gNRA – NRA Generalization

22

+gTA – TA Generalization

23

+Performance of Global Upper Bound (GUB) and Exact

Space overhead total # number of entries in all inverted lists

Query processing time # of cursor moves

24

GUB Exactspace (IL entries)

74K 63M

time 479-18K 13 - 189 space baseline

time baseline

+Clustering and Query-Processing

We want to reduce the distance between score upper-bound and the exact score.

Greater the distance, more processing may be required

Core idea Cluster users into groups and compute upper-

bound for the group.

Intuition group users whose behavior is similar

25

+Clustering Seekers Cluster the seekers based on similarity in their

scores (because score of an item depends on the network).

Form an inverted list ILt,C for every tag t and cluster C (the score of an item being the maximum score over all seekers in the cluster).

Query processing for Q = t1.. tn and seeker u, we First find the cluster C(u) And then perform aggregation over the

collection

Global Upper-Bound (GUB) is where all seekers fall into the same cluster.

26

+Clustering Seekers

assign each seeker to a cluster

compute an inverted list per cluster ub(i,t,C) = maxuC|Network(u) {v|Tagged(v,i,tj)}|

+ tighter bounds, item order usually closer to EXACT order than in Global Upper-Bound

-- space overhead still high (trade-off)

27

item taggersupper-bound

chanel

pumagucciadidasdieselversace nike

prada

Miguel,…Kath, …Sam, …Miguel, …Peter, …Jane, …Mary, …Chris, …

18

736562534036

16

Global Upper-Bound


gucciversacechanelpradapuma

Bob,…Peter, …Mary, …Chris, …Alice, …

6540181610

Example of Clusters

C1: seekers Bob & Alice


pumaadidasdieselnike

Miguel,…Sam, …Miguel, …Jane, …

73625336

gucci Kath, … 5

C2: seekers Sam & Miguel

+How do we cluster seekers?

Finding a cluster that minimizes worst, average computation time of top-k algorithms is NP-hard.

Proofs by reduction from independent task scheduling problem and minimum sum of squares problem

Authors present some heuristics Use some form of Normalized Discounted Cumulative Gain

(NDCG) which is a measure of the quality of a clustered list for a given seeker and keyword.

The metric compares the ideal (exact score) order in inverted lists with actual (score upper-bound) order

28

+NDCG - Example

i docID Log I –base2

Ranking

Rank/log i – base 2

Ideal ranking

Ideal ranking/log i –base 2

1 D1 0.00 3N/A 3N/A2 D2 1.00 2 2.00 3.00 3.003 D3 1.58 3 1.89 2.00 1.264 D4 2.00 0 0.00 2.00 1.005 D5 2.32 1 0.43 1.00 0.436 D6 2.58 2 0.77 0.00 0.00Cumulative Gain (CG) 9.49 5.10 5.69Distributive CG 8.10Ideal DCG 8.69Normalized CG (nDCG) 8.10/8.69 = 0.93

29

+Clustering Taggers

For each tag t we partition the taggers into separate clusters.

We form inverted list and an item i in the list for cluster C gets the score as maxu ϵ seekers |Network(u) ∩ C ∩ {v1 | Tagged(v1,i,t)}|

How to cluster taggers? Graph with nodes as taggers and an edge exists between nodes v1

and v2 iff:

Items(v1,t) ∩ Items(v2,t) ≥ threshold

30

+Clustering Seekers Metrics

Space Global Upper Bound has the lowest overhead. ASC and NCT achieve an order of magnitude improvement

in space overhead over Exact.

Time Both gNRA and gTA outperform Global Upper-bound. ASC outperforms NCT on both sequential and total accesses

in all cases for gTA and in all cases except one for gNRA. Inverted lists are shorter Score upper-bound order similar to exact score order for

many users

Average % improvement over Global Upper-Bound Normalized Cut: 38-72% Ratio Association 67-87%

31

+Clustering Seekers

Cluster-Seekers improves query execution time over GUB by at least an order of magnitude, for all queries and all users

32

+Clustering Taggers

Space Overhead is significantly lower than that of Exact and of

Cluster-Seekers

Time Best Case: all taggers relevant to a seeker will reside in a

single cluster Worst Case: All taggers will reside in separate clusters.

Idea: cluster taggers based on overlap in tagging assign each tagger to a cluster compute cluster upper-bounds: ub(i,t,C) = maxuSeekers, vC|Network(u) {v |Tagged(v,i,tj)}|

33

+Clustering Taggers

34

+Conclusion and Next Steps

Cluster-Taggers worked best for seekers whose network fell into at most 3 * #tags clusters For others, query execution time degraded due to the

number of inverted lists that had to be processed

For these seekers Cluster-Taggers outperformed Cluster-Seekers in all cases Cluster-Taggers outperforms Global Upper-Bound by 94-

97%, in all cases.

Extended traditional top-k algorithms

Achieved a balance between time and space consumption.

35

+

Questions?WebCT/[email protected]

Thank You!

+ efficient network aware search in collaborative tagging sihem amer yahia, michael benedikt, laks...

Documents