integrating semantics-based access mechanisms with p2p file systems

26
Integrating Semantics-Based Integrating Semantics-Based Access Mechanisms with P2P Access Mechanisms with P2P File Systems File Systems Yingwu Zhu, Honghao Wang and Yimin g Hu

Upload: benedict-cooke

Post on 02-Jan-2016

17 views

Category:

Documents


1 download

DESCRIPTION

Integrating Semantics-Based Access Mechanisms with P2P File Systems. Yingwu Zhu, Honghao Wang and Yiming Hu. Outline. Background System Design Related Work Conclusions and Furture Work. Background. Current P2P file systems (e.g., CFS and PAST ) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Integrating Semantics-Based Access Integrating Semantics-Based Access Mechanisms with P2P File SystemsMechanisms with P2P File Systems

Yingwu Zhu, Honghao Wang and Yiming Hu

Page 2: Integrating Semantics-Based Access Mechanisms with P2P File Systems

OutlineOutline

BackgroundSystem DesignRelated WorkConclusions and Furture Work

Page 3: Integrating Semantics-Based Access Mechanisms with P2P File Systems

BackgroundBackground

Current P2P file systems (e.g.,CFS and PAST) Layer FS functionalities on a distributed

hash table (DHT), e.g., chord, pastry Do not support semantics-based access

Because DHTs support only exact-match lookups

Page 4: Integrating Semantics-Based Access Mechanisms with P2P File Systems

BackgroundBackground

Layer Responsibity

FS

Stores/retrieves file objects into/from the DHT;

Presents a file system interface to applications/

users

DHT

Supports a hash-table interface of get(fileID) and put(fileID, file)

Software layering in a P2P file system

Page 5: Integrating Semantics-Based Access Mechanisms with P2P File Systems

A problem of P2P file systems Supports only exact-match lookups given

a file object identifier fileID get(fileID): retrieves the file corresponding to t

he fileID put(fileID, file): stores the file with the fileID a

s a DHT key

Extending exact-match lookups to semantic access is non-trivial

MotivationMotivation

Page 6: Integrating Semantics-Based Access Mechanisms with P2P File Systems

MotivationMotivation

A challenge to P2P file systems Provides convenient access to vast am

ount of information E.g., provide semantics-based search capa

bilities to efficiently locate semantically close files for browsing and purging, etc.

Page 7: Integrating Semantics-Based Access Mechanisms with P2P File Systems

System DesignSystem Design

Targeted ApplicationSystem ArchitectureSemantic Indexing and LocatingEvalutation

Page 8: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Targeted ApplicationTargeted Application

Semantic search is expressed in natural language. Query: “locate files similar to f1” The query results are materialized via s

emantic directories Not a simple keyword match: “loate files

with k1, k2 and k3”*k1, k2 and k3 are three distinct keywords

Page 9: Integrating Semantics-Based Access Mechanisms with P2P File Systems

System ArchitectureSystem Architecture

Extends a P2P file system to support semantics-based access

Major Components Semantic Extractor Registry Semantic Indexing and Locating Utility

Page 10: Integrating Semantics-Based Access Mechanisms with P2P File Systems

System ArchitectureSystem Architecture

FS

Extractor Registry

Semantic Indexingand

Locating Utility

DHT

Application/User

Major components of the system architecture

Page 11: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Semantic Extractor RegistrySemantic Extractor Registry

A set of semantic extractors Leverage IR algorithms, VSM and LSI Represent a file as a semantic vector (S

V), typcially 200-300 keywords Semantically close files have similar SV

s

Page 12: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Semantic Indexing and LocatSemantic Indexing and Locating Untilitying Untility

Provides semantics-based indexing and retrieval capabilities Relies on the property of Locality Sensiti

ve Hash Fucntions (LSH) Derives a small number of semantic ide

ntifiers (semID) from a file’s SV as the DHT keys for indexing and locating

Page 13: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Goals The indice of semantically close files are

clustered to the same peer nodes with high probability (nearly 100%)

Efficiently locate semantically close files by searching a small number of peer nodes (e.g, 20)

Semantic Indexing and LocatSemantic Indexing and Locating Untilitying Untility

Page 14: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Locality Sensitive HashingLocality Sensitive Hashing

A family of hash functions F is locality sensitive if hF operating on two sets A and B, we have:P hF [h(A)=h(B)] = sim(A,B)

Min-wise independent permutations are LSH sim(A,B) = |A B| / |A B|

Similarity function

Page 15: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Semantic IndexingSemantic Indexing

Given a file’s SV

Step 1: Drive a small number of semIDs from the SV using LSH

Step 2: Indexing the file by having these semIDs as the DHT keys

Page 16: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Semantic IndexingSemantic Indexing

Using n groups of m hash functionsResults:

The indice of semantically close files are hashed to the same peers with probability 1-(1-pm)n

P is expected to be high for semantically close files, so is the probability

*p=sim(f1,f2), similarity between two files’s SVs

Page 17: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Semantic IndexingSemantic Indexing

Given a file’s SV A:proc sem_index (A) { convert A into A’; \\ A’ is a set of integer by using SHA-1 for each g[j] do \\ g[j] is one of n group of hash funcions semID[j] = 0; for each h[i] in g[j] do \\ g[j] has m hash functions semID[j] ^ = h[i](A’); \\ ^ is a XOR operation endfor endfor for each semID[j] do insert the tuple <semID, fileID, A> into DHT by having

semID[j] as the DHT key \\ semantic indexing endforendproc

Page 18: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Semantic LocatingSemantic Locating

Given a query’s SV Step 1: Derive a small number of semIDs

from the SV using LSH Step 2: Locate those semantically close fil

es by having these semIDs as the DHT keys

Goal: answer a query by consulting only a small number of peer nodes

Page 19: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Indexing AIndexing B

Indexing C

A B C DPeer node

Demostration of Semantic Indexing and Locating

Semantic Locating

Query: locate files similar to D

A, B, C and D are semantically close files

A, B

A, B, CNULL

User1

User2

Page 20: Integrating Semantics-Based Access Mechanisms with P2P File Systems

EvaluationEvaluation

Load distribution of semantic indexing Semantic indices per peer node

Performance of semantic locating Percentage of semantically close files th

at can be located (Recall)

Page 21: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Semantic IndexingSemantic Indexing

Number of peer nodesNu

mb

er o

f fi

le i

nd

exes

per

no

de

Load distribution when the system indexes 10,000 files

Page 22: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Semantic IndexingSemantic Indexing

Nm

ber

of

file

in

dex

es p

er n

od

e

Number of indexed files (x1000)

Load distribution in a 1000 node system

Page 23: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Perf. of Semantic LocatingPerf. of Semantic Locating

5 10 15 20

5 84% 92% 94% 96%

2 94% 99% 100% 100%

m

nrecall

[1] Apply n groups of m hash functions

[2] Percentage of files located (128-byte fingerprint limit as a SV) [3] m and n determine the performance of semantic locating

Page 24: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Related WorkRelated Work

P2P file systems like CFS and PASTExact-match lookups in DHTsTraditional semantic file systems like

SFS and HAC IR algorithms as VSM and LSILSH and its related applications (e.g.,t

he nearest neighbor problem, cached data location in database)

Page 25: Integrating Semantics-Based Access Mechanisms with P2P File Systems

ConclusionsConclusions

The first step to support semantics-based access in P2P file systems

LSH-based semantic indexing and locating approach Impose small storage overhead (several

MBs per node) Efficiency: answer a query by consulting

a small number of peers (e.g., 20) Approximate results, but acceptable

Page 26: Integrating Semantics-Based Access Mechanisms with P2P File Systems

Furture WorkFurture Work

Query consistency and refinementEvaluation using IR workloads (e.g.,

TREC data sets).