scalablebigdataclusteringbyrandomprojectionhashing · rph k-m s-w dst r-s base scalability results...

Noise Robustness

Noise injection tests

20,000 points

10 random multivariate Gaussian

distributions

dimension = 759

Good performance in presence of noise

as high as 60%

Scalable Big Data Clustering by Random Projection HashingLee Carraher, Anindya Moitra, Sayantan Dey and Philip A Wilsey (PI)

Dept of Electrical Engineering & Computing Systems, University of Cincinnati

Security Via Random Projection

Increased Data Security and HIPAA Compliance Concerns make Commercial Cloud Computing an Expensive

Resource

Random Projection, In RPHash can Provide Signi�cant Resistance to De-Anonymization

The Probability of Re-association under Random Projection is exponentially dependent upon the di�erence in

the projection dimension and original dimension.

Contact

Philip A. Wilsey (PI): wilseypa@gmail.com,Lee Carraher: leecarraher@gmail.com,Anindya Moitra: moitraaa@mail.uc.eduSayantan Dey: deysn@mail.uc.edu

Probability of Vector

De-Anonymization

s(v , v ′) = ||v , v ′||2∀{v , v ′} ∈ V ,∃v̂ ∈ V : s(v , v ′) >s(v̂ , v)

where v̂ 6= v

Pr(NN((v · R) · R̂−1,V ) = v) . 1

||V ||

RPHash Algorithm Diagram

Figure:Streaming RPHash Data Diagram.

Streaming RPHash Algorithm

Online Steps

x̃ :=√

dpᵀx ; // Random Projection

t := H(x̃) ; // LSH Hashing

C .add(t); // Update Item Count

if t ∈ M.keys then

M[t].wadd(x) ; // Update Centroid

M[t] =new centroid(x) ; // Add New Centroid

M.remove(T.pop()) ; // Remove Least Support

O�ine Steps

O�ineWeightedClusterer(M.items)

Applications

Bio-Informatics

Medical imaging

Computer Vision

Recommender Systems

Social Network Analysis

Market research

AbstractRPHash provides a solution to the approximate k-means clustering problem for very large distributed datasets. Distributeddata models have gained popularity in recent years following the e�orts of commercial, academic and governmentorganizations, to make data more widely accessible. Due to the sheer volume of available data, in-memory single-corecompu- tation quickly becomes infeasible, requiring distributed multi- processing. Our solution achieves comparableclustering performance to other popular clustering algorithms, with improved overall complexity growth while beingamenable to distributed processing frameworks such as Map-Reduce. Our solution also maintains certain guaranteesregarding data privacy de-anonymization.

Key Features

Scalability through Naive Parallelism

Predictable Run Time

Low Inter-Process Communication

Data Security

Cloud Based Resource Allocation

Key Components

Multi-Probe Random Projection

Fast Johnson-Lindenstrauss Transform

Locality Sensitive Hashing

Universally Generative Lattices

Lossy Frequency Counting

Clustering Performance Results

Clustering Algorithms

Reservoir Sampling (R-S)

DStream (DSt)

Damped Sliding Window (S-W)

Streaming K-Means (KM)

Streaming RPHash (RPH)

ExternalPerformance Metrics

Adjusted Rand Index [-1,1]

Cluster Purity [0,1]

Variation of Information [0,∞]

0.050.250.550.85

better, range: [-1, 1]

R-S DSt S-W K-M RPH

0.4750.6250.7750.925

better, range: [0, 1]

0 4000 8000 12000 16000 20000Progression of Input Stream Vectors

112500337500562500787500

worse, range: (0, )

better

RPH K-M S-W DSt R-S Base

Scalability Results

Clustering Algorithms

Reservoir Sampling (R-S)

DStream (DSt)

Damped Sliding Window (S-W)

Streaming K-Means (KM)

Streaming RPHash (RPH)

Runtime Performance Metrics

Average Runtime Duration (seconds)

Estimated Memory Usage (MB)

Multi Core Speedup

RPHash exhibits near linear parallel speedup.

Few sequential bottlenecks .

Figure: Parallel Speedup as a Function of Number of Processors.

0 1 2 3 4 5 6

# Processors

y=0.855x, R2=.996

RPHash Distributed Spark Implementation

Spark implementation of RPHash is based on 2-Pass RPHash.

Streaming RPHash Spark version in development.

Human Activity Recognition DatasetMetric 2Pass RPHash Baseline

Silhouette Width 0.1191 0.06171

WCSSE 191922.4 200951.7

Dunn Index 0.085822 0.087755UJIIndoor Localization DatasetMetric 2Pass RPHash Baseline

Silhouette Width -0.01337 0.06522

WCSSE 10534218 10234389

Dunn Index 0.000 0.000695

RPHash Data StructuresData

s k - number of clusters

x ∈ Rm - data vector from stream

H(·) - LSH Function radius=r , dim=d

P = {p1, ...pn} - m × d projection matrix

M- lsh_key → centroid map

C - cm-sketch data structure

T - CM-Sketch based εk bounded priority queue

Open Source Software Repositories

Source available at:

github.com/wilseypa/rphash-java

Sub-Projects:

github.com/wilseypa/rphash-golang

github.com/wilseypa/rphash-reference-implementation

github.com/wilseypa/rphash-spark

175287

R-S DSt S-W K-M RPH

Number of Dimensions

37112187262

112157

worse, range: (0, )

better

Base R-S DSt S-W K-M RPH

0.01250.16250.33750.5125

0.1250.3750.6250.875

0 2 5 7 10 15 20 25 30 40 50 60 75Percent of Noise

0.31250.53750.76250.9875

better, range: [0, 1]

scalablebigdataclusteringbyrandomprojectionhashing · rph k-m s-w dst r-s base scalability results...

Documents

256 ds-5 arm dstream user...

54195 single handle bathroom centerset faucets...

dst marblenatural

b1 k dst e2 b1 s-c px px px - la...

dst carriers

dst storyboard1

dst collection

dst cover...

dst-24b/pci dst-24b/pci+ dst-24b/pci(2.0) dst-24b ... -...

dst catalog

dst concept

dst brochure

diagnostic scan tool (dst) instruction manual iv...

ebp in eaxftp.mi.fu-berlin.de/pub/schiller/ti ii ch04 isa...

dst directory

dst s eries -hid · phone (661) 233-2000 fax (661) 233-2001...

rackmount distribution panel models: dst-20a...

dyslexia screen webinar - pearson assessment: … · dst-j...

inland healthcare ii dst 1031 · 2018-10-05 · healthcare...

supported by rg s&t commission, mumbai & dst, ministry of...