abstract - uc homepageshomepages.uc.edu/~carrahle/images/nsfpresentation15.pdfhashing orrphash,...
TRANSCRIPT
RPHash Algorithm
Phase 2
Data: X = {x1, ..., xn}, xk ∈ Rm - data vectorsL - vector hash idsC - top hash idsM = {C , [0, ...0]}forall the xk ∈ X do
forall the ci ∈ C do
if L[k] ∩M[i ][0] 6= 0 then
∆ = M[k]− xk M[k] = M[k] + ∆/count;L[k].add(M[i][0])
end
end
end
Result: Kmeans(M[1],K )
Phase 1
Data: K - number of clusters;X = {x1, ..., xn}, xk ∈ Rm - data vectors;C - a lossy counter;H - LSH Function;P = {p1, ...pn} - set of projection matrices;L = {{∅}...};C = {∅};forall the xk ∈ X̃ do
forall the pi ∈ P̃ do
x̃k ←√
m
dpᵀixk ;
t = H(x̃k);L[k][i] = t;C.add(t);
end
end
Result: C.top(K)
ScalableBigDataClusteringbyRandomProjectionHashingLee Carraher, Anindya Moitra, and Philip A Wilsey (PI)
Dept of Electrical Engineering & Computing Systems, University of Cincinnati
AbstractRPHash is a streaming algorithm for data clustering based on frequent itemset counting of locality sensitive hash (LSH)collisions generated by multi-probe, randomly projected data vectors. The proposed algorithm called Random ProjectionHashing or RPHash, trades sub-optimal computationally performance for improved memory e�ciency over otherclustering algorithms. Memory e�ciency, a premium in streaming and distributed algorithms allows RPHash to be run ona variety of streaming data environments. Two Additional features, temporal weighting and privacy are also providedintrinsically by RPHash. By performing multiple projections utilizing recent advances in the fast johnson lindenstrausstransform (FJLT) we are able to attain a temporally weighted, streaming clustering algorithm with log-linear complexitygrowth and intrinsic data security.
Current Methods
Density Based Random Projection Subset Clustering Graph ClusteringSpectral Clustering
DBScan (1996) Fuzzy KmeansCLARANS (2002) Proclus (1999) Canopy Clustering(2000) Latent Dirichlet (2002)CLIQUE (2005) RPHash (this work) Streaming Kmeans (2011)
Applications
Bio-Informatics
Medical imaging
Computer Vision
Recommender Systems
Social Network Analysis
Market research
Key Components
Multi-Probe Random Projection
Fast Johnson-Lindenstrauss Transform
Locality Sensitive Hashing
Universally Generative Lattices
Lossy Frequency Counting
JL-Lemma
(1− ε)‖u − u′‖2 ≤ ‖f (u)− f (u′)‖2 ≤ (1 + ε)‖u − u
′‖2
Key Features
Scalability through Naive Parallelism
Predictable Run Time
Low Inter-Process Communication
Data Security
Cloud Based Resource Allocation
Locality Sensitive Hash Function
let H = {h : S → U} is (r1, r2, p1, p2)−sensitive if for any
u, v ∈ S1 if d(u, v) ≤ r1 then PrH[h(u) = h(v)] ≥ p12 if d(u, v) > r2 then PrH[h(u) = h(v)] ≤ p2
Security Via Random Projection
Increased Data Security and HIPAA Compliance Concerns make Commercial
Cloud Computing an Expensive Resource
Random Projection, In RPHash can Provide Signi�cant Resistance to
De-Anonymization
The Probability of Re-association under Random Projection is exponentially
dependent upon the di�erence in the projection dimension and original dimension.
20 30 40 50 60 70 80
Difference in number of Dimensions (d)
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
P(N
N((x·R)·R̂
−1)=
x)
Probability of Re-associating aProjected Vector
power law fit
avg w/ err (10 runs)
Early Results
0.5 1 1.5 2 2.5
Variance
0.5
0.6
0.7
0.8
0.9
1
1.1
PR
Accura
cy
Projected KMeans
Projected Leech
Multi Projected Leech
Precision-Recall vs Inter-Cluster Variance
0 0.5 1 1.5 2 2.5
Variance
0
1
2
3
4
Tim
e (
s)
Projected KMeans
Projected Leech
Multi Projected Leech
Time Vs Inter-Cluster Variance
Contact
Open source available at: github.com/wilseypa/rphashPhilip A. Wilsey (PI): [email protected],
Lee carraher: [email protected],Anindya Moitra: [email protected]