scalablebigdataclusteringbyrandomprojectionhashing · rph k-m s-w dst r-s base scalability results...
Post on 23-Sep-2019
1 Views
Preview:
TRANSCRIPT
Noise Robustness
Noise injection tests
20,000 points
10 random multivariate Gaussian
distributions
dimension = 759
Good performance in presence of noise
as high as 60%
s
Scalable Big Data Clustering by Random Projection HashingLee Carraher, Anindya Moitra, Sayantan Dey and Philip A Wilsey (PI)
Dept of Electrical Engineering & Computing Systems, University of Cincinnati
Security Via Random Projection
Increased Data Security and HIPAA Compliance Concerns make Commercial Cloud Computing an Expensive
Resource
Random Projection, In RPHash can Provide Signi�cant Resistance to De-Anonymization
The Probability of Re-association under Random Projection is exponentially dependent upon the di�erence in
the projection dimension and original dimension.
Contact
Philip A. Wilsey (PI): wilseypa@gmail.com,Lee Carraher: leecarraher@gmail.com,Anindya Moitra: moitraaa@mail.uc.eduSayantan Dey: deysn@mail.uc.edu
Probability of Vector
De-Anonymization
s(v , v ′) = ||v , v ′||2∀{v , v ′} ∈ V ,∃v̂ ∈ V : s(v , v ′) >s(v̂ , v)
where v̂ 6= v
Pr(NN((v · R) · R̂−1,V ) = v) . 1
||V ||
RPHash Algorithm Diagram
Figure:Streaming RPHash Data Diagram.
Streaming RPHash Algorithm
Online Steps
x̃ :=√
m
dpᵀx ; // Random Projection
t := H(x̃) ; // LSH Hashing
C .add(t); // Update Item Count
if t ∈ M.keys then
M[t].wadd(x) ; // Update Centroid
else
M[t] =new centroid(x) ; // Add New Centroid
end
M.remove(T.pop()) ; // Remove Least Support
O�ine Steps
O�ineWeightedClusterer(M.items)
Applications
Bio-Informatics
Medical imaging
Computer Vision
Recommender Systems
Social Network Analysis
Market research
AbstractRPHash provides a solution to the approximate k-means clustering problem for very large distributed datasets. Distributeddata models have gained popularity in recent years following the e�orts of commercial, academic and governmentorganizations, to make data more widely accessible. Due to the sheer volume of available data, in-memory single-corecompu- tation quickly becomes infeasible, requiring distributed multi- processing. Our solution achieves comparableclustering perfor- mance to other popular clustering algorithms, with improved overall complexity growth while beingamenable to distributed processing frameworks such as Map-Reduce. Our solution also maintains certain guaranteesregarding data privacy de-anonymization.
Key Features
Scalability through Naive Parallelism
Predictable Run Time
Low Inter-Process Communication
Data Security
Cloud Based Resource Allocation
Key Components
Multi-Probe Random Projection
Fast Johnson-Lindenstrauss Transform
Locality Sensitive Hashing
Universally Generative Lattices
Lossy Frequency Counting
Clustering Performance Results
Clustering Algorithms
Reservoir Sampling (R-S)
DStream (DSt)
Damped Sliding Window (S-W)
Streaming K-Means (KM)
Streaming RPHash (RPH)
ExternalPerformance Metrics
Adjusted Rand Index [-1,1]
Cluster Purity [0,1]
Variation of Information [0,∞]
0.050.250.550.85
ARI
better, range: [-1, 1]
worse
R-S DSt S-W K-M RPH
0.4750.6250.7750.925
Puri
ty
better, range: [0, 1]
worse
0 4000 8000 12000 16000 20000Progression of Input Stream Vectors
112500337500562500787500
WC
SSE
worse, range: (0, )
better
RPH K-M S-W DSt R-S Base
Scalability Results
Clustering Algorithms
Reservoir Sampling (R-S)
DStream (DSt)
Damped Sliding Window (S-W)
Streaming K-Means (KM)
Streaming RPHash (RPH)
Runtime Performance Metrics
Average Runtime Duration (seconds)
Estimated Memory Usage (MB)
Multi Core Speedup
RPHash exhibits near linear parallel speedup.
Few sequential bottlenecks .
Figure: Parallel Speedup as a Function of Number of Processors.
0 1 2 3 4 5 6
# Processors
0
1
2
3
4
5
Sp
eed
up
Rati
o
y=0.855x, R2=.996
RPHash Distributed Spark Implementation
Spark implementation of RPHash is based on 2-Pass RPHash.
Streaming RPHash Spark version in development.
Human Activity Recognition DatasetMetric 2Pass RPHash Baseline
Silhouette Width 0.1191 0.06171
WCSSE 191922.4 200951.7
Dunn Index 0.085822 0.087755UJIIndoor Localization DatasetMetric 2Pass RPHash Baseline
Silhouette Width -0.01337 0.06522
WCSSE 10534218 10234389
Dunn Index 0.000 0.000695
RPHash Data StructuresData
s k - number of clusters
x ∈ Rm - data vector from stream
H(·) - LSH Function radius=r , dim=d
P = {p1, ...pn} - m × d projection matrix
M- lsh_key → centroid map
C - cm-sketch data structure
T - CM-Sketch based εk bounded priority queue
Open Source Software Repositories
Source available at:
github.com/wilseypa/rphash-java
Sub-Projects:
github.com/wilseypa/rphash-golang
github.com/wilseypa/rphash-reference-implementation
github.com/wilseypa/rphash-spark
175287
122
Run
time
(sec
)
R-S DSt S-W K-M RPH
100
281
506
759
949
1139
1424
1709
2563
3204
3844
4805
5767
Number of Dimensions
37112187262
Mem
ory
(MB
)
2267
112157
Scal
ed
WC
SSE
worse, range: (0, )
better
Base R-S DSt S-W K-M RPH
0.01250.16250.33750.5125
Silh
ouet
te
better, range: [-1, 1]
worse
0.1250.3750.6250.875
ARI
better, range: [-1, 1]
worse
0 2 5 7 10 15 20 25 30 40 50 60 75Percent of Noise
0.31250.53750.76250.9875
Puri
ty
better, range: [0, 1]
worse
top related