integrating semantics-based access mechanisms with p2p file systems
DESCRIPTION
Integrating Semantics-Based Access Mechanisms with P2P File Systems. Yingwu Zhu, Honghao Wang and Yiming Hu. Outline. Background System Design Related Work Conclusions and Furture Work. Background. Current P2P file systems (e.g., CFS and PAST ) - PowerPoint PPT PresentationTRANSCRIPT
Integrating Semantics-Based Access Integrating Semantics-Based Access Mechanisms with P2P File SystemsMechanisms with P2P File Systems
Yingwu Zhu, Honghao Wang and Yiming Hu
OutlineOutline
BackgroundSystem DesignRelated WorkConclusions and Furture Work
BackgroundBackground
Current P2P file systems (e.g.,CFS and PAST) Layer FS functionalities on a distributed
hash table (DHT), e.g., chord, pastry Do not support semantics-based access
Because DHTs support only exact-match lookups
BackgroundBackground
Layer Responsibity
FS
Stores/retrieves file objects into/from the DHT;
Presents a file system interface to applications/
users
DHT
Supports a hash-table interface of get(fileID) and put(fileID, file)
Software layering in a P2P file system
A problem of P2P file systems Supports only exact-match lookups given
a file object identifier fileID get(fileID): retrieves the file corresponding to t
he fileID put(fileID, file): stores the file with the fileID a
s a DHT key
Extending exact-match lookups to semantic access is non-trivial
MotivationMotivation
MotivationMotivation
A challenge to P2P file systems Provides convenient access to vast am
ount of information E.g., provide semantics-based search capa
bilities to efficiently locate semantically close files for browsing and purging, etc.
System DesignSystem Design
Targeted ApplicationSystem ArchitectureSemantic Indexing and LocatingEvalutation
Targeted ApplicationTargeted Application
Semantic search is expressed in natural language. Query: “locate files similar to f1” The query results are materialized via s
emantic directories Not a simple keyword match: “loate files
with k1, k2 and k3”*k1, k2 and k3 are three distinct keywords
System ArchitectureSystem Architecture
Extends a P2P file system to support semantics-based access
Major Components Semantic Extractor Registry Semantic Indexing and Locating Utility
System ArchitectureSystem Architecture
FS
Extractor Registry
Semantic Indexingand
Locating Utility
DHT
Application/User
Major components of the system architecture
Semantic Extractor RegistrySemantic Extractor Registry
A set of semantic extractors Leverage IR algorithms, VSM and LSI Represent a file as a semantic vector (S
V), typcially 200-300 keywords Semantically close files have similar SV
s
Semantic Indexing and LocatSemantic Indexing and Locating Untilitying Untility
Provides semantics-based indexing and retrieval capabilities Relies on the property of Locality Sensiti
ve Hash Fucntions (LSH) Derives a small number of semantic ide
ntifiers (semID) from a file’s SV as the DHT keys for indexing and locating
Goals The indice of semantically close files are
clustered to the same peer nodes with high probability (nearly 100%)
Efficiently locate semantically close files by searching a small number of peer nodes (e.g, 20)
Semantic Indexing and LocatSemantic Indexing and Locating Untilitying Untility
Locality Sensitive HashingLocality Sensitive Hashing
A family of hash functions F is locality sensitive if hF operating on two sets A and B, we have:P hF [h(A)=h(B)] = sim(A,B)
Min-wise independent permutations are LSH sim(A,B) = |A B| / |A B|
Similarity function
Semantic IndexingSemantic Indexing
Given a file’s SV
Step 1: Drive a small number of semIDs from the SV using LSH
Step 2: Indexing the file by having these semIDs as the DHT keys
Semantic IndexingSemantic Indexing
Using n groups of m hash functionsResults:
The indice of semantically close files are hashed to the same peers with probability 1-(1-pm)n
P is expected to be high for semantically close files, so is the probability
*p=sim(f1,f2), similarity between two files’s SVs
Semantic IndexingSemantic Indexing
Given a file’s SV A:proc sem_index (A) { convert A into A’; \\ A’ is a set of integer by using SHA-1 for each g[j] do \\ g[j] is one of n group of hash funcions semID[j] = 0; for each h[i] in g[j] do \\ g[j] has m hash functions semID[j] ^ = h[i](A’); \\ ^ is a XOR operation endfor endfor for each semID[j] do insert the tuple <semID, fileID, A> into DHT by having
semID[j] as the DHT key \\ semantic indexing endforendproc
Semantic LocatingSemantic Locating
Given a query’s SV Step 1: Derive a small number of semIDs
from the SV using LSH Step 2: Locate those semantically close fil
es by having these semIDs as the DHT keys
Goal: answer a query by consulting only a small number of peer nodes
Indexing AIndexing B
Indexing C
A B C DPeer node
Demostration of Semantic Indexing and Locating
Semantic Locating
Query: locate files similar to D
A, B, C and D are semantically close files
A, B
A, B, CNULL
User1
User2
EvaluationEvaluation
Load distribution of semantic indexing Semantic indices per peer node
Performance of semantic locating Percentage of semantically close files th
at can be located (Recall)
Semantic IndexingSemantic Indexing
Number of peer nodesNu
mb
er o
f fi
le i
nd
exes
per
no
de
Load distribution when the system indexes 10,000 files
Semantic IndexingSemantic Indexing
Nm
ber
of
file
in
dex
es p
er n
od
e
Number of indexed files (x1000)
Load distribution in a 1000 node system
Perf. of Semantic LocatingPerf. of Semantic Locating
5 10 15 20
5 84% 92% 94% 96%
2 94% 99% 100% 100%
m
nrecall
[1] Apply n groups of m hash functions
[2] Percentage of files located (128-byte fingerprint limit as a SV) [3] m and n determine the performance of semantic locating
Related WorkRelated Work
P2P file systems like CFS and PASTExact-match lookups in DHTsTraditional semantic file systems like
SFS and HAC IR algorithms as VSM and LSILSH and its related applications (e.g.,t
he nearest neighbor problem, cached data location in database)
ConclusionsConclusions
The first step to support semantics-based access in P2P file systems
LSH-based semantic indexing and locating approach Impose small storage overhead (several
MBs per node) Efficiency: answer a query by consulting
a small number of peers (e.g., 20) Approximate results, but acceptable
Furture WorkFurture Work
Query consistency and refinementEvaluation using IR workloads (e.g.,
TREC data sets).