abstract - uc homepageshomepages.uc.edu/~carrahle/images/nsfpresentation15.pdfhashing orrphash,...

1
= { , ..., } R = { , [ , ... ]} [] [ ][ ] 6= Δ= [] - []= [ ]+Δ/ ( [], ) = { , ..., } R H P = { , ... } = {{}...} = {} e e ˜ p | = H(˜) Scalable Big Data Clustering by Random Projection Hashing Lee Carraher, Anindya Moitra, and Philip A Wilsey (PI) Dept of Electrical Engineering & Computing Systems, University of Cincinnati ( - )k - 0 k ≤k () - ( 0 )k ( + )k - 0 k H = { : } ( , , , )- , ( , ) H [()= ( )] ( , ) > H [()= ( )] 20 30 40 50 60 70 80 Difference in number of Dimensions (d) -0.2 0.0 0.2 0.4 0.6 0.8 1.0 P (NN ((x · R) · ˆ R -1 )= x) Probability of Re-associating a Projected Vector power law fit avg w/ err (10 runs) 0.5 1 1.5 2 2.5 Variance 0.5 0.6 0.7 0.8 0.9 1 1.1 PR Accuracy Projected KMeans Projected Leech Multi Projected Leech Precision-Recall vs Inter-Cluster Variance 0 0.5 1 1.5 2 2.5 Variance 0 1 2 3 4 Time (s) Projected KMeans Projected Leech Multi Projected Leech Time Vs Inter-Cluster Variance

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstract - UC Homepageshomepages.uc.edu/~carrahle/images/NSFPresentation15.pdfHashing orRPHash, trades sub-optimal computationally performance for improved memory e ciency over other

RPHash Algorithm

Phase 2

Data: X = {x1, ..., xn}, xk ∈ Rm - data vectorsL - vector hash idsC - top hash idsM = {C , [0, ...0]}forall the xk ∈ X do

forall the ci ∈ C do

if L[k] ∩M[i ][0] 6= 0 then

∆ = M[k]− xk M[k] = M[k] + ∆/count;L[k].add(M[i][0])

end

end

end

Result: Kmeans(M[1],K )

Phase 1

Data: K - number of clusters;X = {x1, ..., xn}, xk ∈ Rm - data vectors;C - a lossy counter;H - LSH Function;P = {p1, ...pn} - set of projection matrices;L = {{∅}...};C = {∅};forall the xk ∈ X̃ do

forall the pi ∈ P̃ do

x̃k ←√

m

dpᵀixk ;

t = H(x̃k);L[k][i] = t;C.add(t);

end

end

Result: C.top(K)

ScalableBigDataClusteringbyRandomProjectionHashingLee Carraher, Anindya Moitra, and Philip A Wilsey (PI)

Dept of Electrical Engineering & Computing Systems, University of Cincinnati

AbstractRPHash is a streaming algorithm for data clustering based on frequent itemset counting of locality sensitive hash (LSH)collisions generated by multi-probe, randomly projected data vectors. The proposed algorithm called Random ProjectionHashing or RPHash, trades sub-optimal computationally performance for improved memory e�ciency over otherclustering algorithms. Memory e�ciency, a premium in streaming and distributed algorithms allows RPHash to be run ona variety of streaming data environments. Two Additional features, temporal weighting and privacy are also providedintrinsically by RPHash. By performing multiple projections utilizing recent advances in the fast johnson lindenstrausstransform (FJLT) we are able to attain a temporally weighted, streaming clustering algorithm with log-linear complexitygrowth and intrinsic data security.

Current Methods

Density Based Random Projection Subset Clustering Graph ClusteringSpectral Clustering

DBScan (1996) Fuzzy KmeansCLARANS (2002) Proclus (1999) Canopy Clustering(2000) Latent Dirichlet (2002)CLIQUE (2005) RPHash (this work) Streaming Kmeans (2011)

Applications

Bio-Informatics

Medical imaging

Computer Vision

Recommender Systems

Social Network Analysis

Market research

Key Components

Multi-Probe Random Projection

Fast Johnson-Lindenstrauss Transform

Locality Sensitive Hashing

Universally Generative Lattices

Lossy Frequency Counting

JL-Lemma

(1− ε)‖u − u′‖2 ≤ ‖f (u)− f (u′)‖2 ≤ (1 + ε)‖u − u

′‖2

Key Features

Scalability through Naive Parallelism

Predictable Run Time

Low Inter-Process Communication

Data Security

Cloud Based Resource Allocation

Locality Sensitive Hash Function

let H = {h : S → U} is (r1, r2, p1, p2)−sensitive if for any

u, v ∈ S1 if d(u, v) ≤ r1 then PrH[h(u) = h(v)] ≥ p12 if d(u, v) > r2 then PrH[h(u) = h(v)] ≤ p2

Security Via Random Projection

Increased Data Security and HIPAA Compliance Concerns make Commercial

Cloud Computing an Expensive Resource

Random Projection, In RPHash can Provide Signi�cant Resistance to

De-Anonymization

The Probability of Re-association under Random Projection is exponentially

dependent upon the di�erence in the projection dimension and original dimension.

20 30 40 50 60 70 80

Difference in number of Dimensions (d)

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

P(N

N((x·R)·R̂

−1)=

x)

Probability of Re-associating aProjected Vector

power law fit

avg w/ err (10 runs)

Early Results

0.5 1 1.5 2 2.5

Variance

0.5

0.6

0.7

0.8

0.9

1

1.1

PR

Accura

cy

Projected KMeans

Projected Leech

Multi Projected Leech

Precision-Recall vs Inter-Cluster Variance

0 0.5 1 1.5 2 2.5

Variance

0

1

2

3

4

Tim

e (

s)

Projected KMeans

Projected Leech

Multi Projected Leech

Time Vs Inter-Cluster Variance

Contact

Open source available at: github.com/wilseypa/rphashPhilip A. Wilsey (PI): [email protected],

Lee carraher: [email protected],Anindya Moitra: [email protected]