“study on parallel svm based on mapreduce” kuei-ti lu 03/12/2015

“Study on Parallel SVM Based on MapReduce”

Kuei-Ti Lu03/12/2015

Support Vector Machine (SVM)

• Used for – Classification– Regression

• Applied in – Network intrusion detection– Image processing– Text classification– …

libSVM

• Library for support vector machines• Integrate different types of SVMs

Types of SVMs Supported by libSVM

• For support vector classification– C-SVC– Nu-SVC

• For support vector regression– Epsilon-SVR– Nu-SVR

• For distribution estimation– One-class SVM

• Goal: Find the separating hyperplane that maximizes the margin

• Support vectors: data points closest to the separating hyperplane

• Primal form

• Dual form (derived using Lagrange multipliers)

nibwxyts

,...,10

,...,11))((..

1{min 2

axxkyyaa

jijijiji

,...,1,0

),(min,

Speedup

• Computation and storage requirements increase rapidly as the number of training vectors (also called training samples or training points) increases

• Need efficient algorithms and implementation to apply to large scale data mining

• => Parallel SVM

Parallel SVM Methods

• Message Passing Interface (MPI) – Efficient for computation-intensive problems• Ex. Simulation

• MapReduce– Can be used for data-intensive problems

• …

Other Speedup Techniques

• Chunking: optimize subsets of training data iteratively until the global optimum is reached– Ex. Sequential Minimal Optimization (SMO) • Use a chunk size of 2 vectors

• Eliminate non-support vectors early

This Paper’s Approach

1. Partition & distribute data to nodes2. Map class: Train each subSVM to find support

vectors for subset of data3. Reduce class: Combine support vectors of

each 2 subSVMs4. If more than 1 SVM

Go to 2.

Twister

• Support iterative MapReduce

• More efficient than Hadoop or Dryad/DryadLINQ for iterative MapReduce

Computation Complexity

tnOnOm

nO trans

)2)2(()))

2((())((

Evaluations

• Number of nodes• Training time• Accuracy = # correctly predicted data / # total

testing data * 100 %

Adult Data Analysis

• Binary classification• Correlation between attribute variable X and

class variable Y used to select attributes

)])([(),cov(

Adult Data Analysis

• Computation cost concentrates on training

• Data transfer time cost minor• Last layer computation time

depends on α and β instead of number of nodes (1 node only)

• Feature selection reduces computation greatly but does not reduce accuracy very much

Forest Cover Type Classification

• Multiclass classification– Use k(k - 1)/2 binary SVMs as k-class SVM– 1 binary SVM for each pair of classes– Use maximum voting to determine the class

• Correlation between attribute variable X and class variable Y used to select attributes

• Attribute variables are normalized to [0, 1]

minmax

xxxnorm

• Last layer computation time depends on α and β instead of number of nodes (1 node only)

• Feature selection reduces computation greatly but does not reduce accuracy very much

Heart Disease Classification

• Binary classification• Data replicated different times to compare

results for different sample sizes

Heart Disease Classification

• When sample size too big, can’t be processed with 1 node because of memory constraint

• Training time decreases little when number of nodes > 8

Conclusion

• Classical SVM impractical for large scale data• Need parallel SVM• This paper proposes a model based on

iterative MapReduce• Show the model efficient for data-intensive

problems

References

[1] Z. Sun and G. Fox, “Study on Parallel SVM Based on MapReduce,” in PDPTA., Las

Vegas, NV, 2012, pp. [2] C. Lin et al., “Anomaly Detection Using

LibSVM Training Tools,” in ISA., Busan, Korea, 2008, pp. 166-171.

“study on parallel svm based on mapreduce” kuei-ti lu 03/12/2015

predicted data

subsets of training

class variable y

data points closest

number of training vectors

class svm1 binary svm

total testing data

nodesmap class

Documents

svm classifier

32hl67u svm

svm practical

geometric intuition and algorithms for e {svm alvaro...

introduction to mapreduce | mapreduce architecture |...

action research yueh-kuei hsu national taiwan normal...

mapreduce and hadoop file...

color image enhancement kuei-chun chen -...

hadoop/mapreduce - 123seminarsonly.comhadoop mapreduce •...

kernel svm

svm campaign

1. introduction to mapreduce -...

svm admin2

mapreduce & hadoop...

processing with what is mapreduce? hadoop/mapreduce

svm reference

daniele loiacono - politecnico di...

pipelined-mapreduce an improved mapreduce

lecture12 - svm

boosted svm