e6893 big data analytics lecture 3: big data …cylin/course/bigdata/eecs...ml — support vector...

Post on 24-Jun-2020

20 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Storage and Processing 1

E6893 Big Data Analytics Lecture 3:

Big Data Analytics Algorithms — I

Ching-Yung Lin, Ph.D.

Adjunct Professor, Dept. of Electrical Engineering and Computer Science

September 20, 2019

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Storage and Processing 2

Analytics 1 — Recommendation

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 3

Key Components of Mahout in Hadoop

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 4

Key Components of Spark MLlib

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 5

Spark ML Basic Statistics

p Correlation: Calculating the correlation between two series of data is a common

operation in Statistics

■ Pearson’s Correlation

■ Spearman’s Correlation

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 6

Example of Popular Similarity Measurements

Pearson Correlation Similarity Euclidean Distance Similarity Cosine Measure Similarity Spearman Correlation Similarity Tanimoto Coefficient Similarity (Jaccard coefficient) Log-Likelihood Similarity

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 7

Pearson Correlation Similarity

Data:

missing data

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 8

On Pearson Similarity

Three problems with the Pearson Similarity:

1. Not take into account of the number of items in which two users’ preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.)

2. If two users overlap on only one item, no correlation can be computed.

3. The correlation is undefined if either series of preference values are identical.

Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used.

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 9

Spearman Correlation Similarity Example for ties

Pearson value on the relative ranks

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 10

Basic Spark Data format

// number of parameters, location of non-zero indices, and non-zero values

Data: 1.0, 0.0, 3.0

// straightforward

// number of parameters, Sequence of non-value values (index, value)

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 11

Correlation Example in Spark

1.0, 0.0, 0.0, -2.0 4.0, 5.0, 0.0, 3.0 6.0, 7.0, 0.0, 8.0 9.0, 0.0, 0.0, 1.0

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 12

Euclidean Distance Similarity

Similarity = 1 / ( 1 + d )

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 13

Cosine Similarity

Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0).

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 14

Caching User Similarity

Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed.

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 15

Tanimoto (Jaccard) Coefficient Similarity

Discard preference values

Tanimoto similarity is the same as Jaccard similarity. But, Tanimoto distance is not the same as Jaccard distance.

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 16

Log-Likelihood Similarity

Asses how unlikely it is that the overlap between the two users is just due to chance.

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 17

Performance measurements

Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset. Spearnman: 0.8 Tanimoto: 0.82 Log-Likelihood: 0.73 Euclidean: 0.75 Pearson (weighted): 0.77 Pearson: 0.89

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 18

Spark ML Basic Statistics

p Hypothesis testing: Hypothesis testing is a powerful tool in statistics to determine whether

a result is statistically significant. Spark ML

currently supports Pearson’s Chi-squared (χ2)

tests for independence.

p ChiSquareTest conducts Pearson’s independence

test for every feature against the label.

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 19

Chi-Square Tests

http://math.hws.edu/javamath/ryan/ChiSquare.html

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 20

Chi-Square Tests

http://math.hws.edu/javamath/ryan/ChiSquare.html

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 21

Chi-Square Tests

http://math.hws.edu/javamath/ryan/ChiSquare.html

We would reject the null hypothesis that there is no relationship between location and type of malaria. Our data tell us there is a relationship between type of malaria and location.

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 22

Chi-Square Tests in Spark

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 23

Spark ML Clustering

© 2015 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 24

Example: clustering

Feature Space

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 25

Clustering

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 26

Clustering — on feature plane

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 27

Clustering example

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 28

Steps on clustering

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 29

Making initial cluster centers

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 30

K-mean clustering

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 31

HelloWorld clustering scenario result

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 32

Testing difference distance measures

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 33

Manhattan and Cosine distances

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 34

Tanimoto distance and weighted distance

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 35

Results comparison

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 36

Sample code of K-mean clustering in Spark

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 37

vectorization example

0: weight 1: color 2: size

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 38

Canopy clustering to estimate the number of clustersTell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy.

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 39

Other clustering algorithms

Hierarchical clustering

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 40

Different clustering approaches

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 41

Spark ML Classification and Regression

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 42

Spark ML Classification and Regression

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 43

Classification — definition

© 2015 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 44

Classification example: using SVM to recognize a Toyota Camry

ML — Support Vector MachineNon-MLFeature Space Positive SVs

Negative SVs

Rule 1.Symbol has something like bull’s headRule 2.Big black portion in front of car.Rule 3. …..????

© 2015 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 45

Classification example: using SVM to recognize a Toyota Camry

ML — Support Vector Machine

Feature Space

Positive SVs

Negative SVs

PCamry > 0.95

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 46

When to use Big Data System for classification?

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 47

The advantage of using Big Data System for classification

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 48

How does a classification system work?

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 49

Key terminology for classification

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 50

Input and Output of a classification model

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 51

Four types of values for predictor variables

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 52

Sample data that illustrates all four value types

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 53

Supervised vs. Unsupervised Learning

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 54

Work flow in a typical classification project

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 55

Classification Example 1 — Color-Fill

Position looks promising, especially the x-axis ==> predictor variable. Shape seems to be irrelevant. Target variable is “color-fill” label.

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 56

Classification Example 2 — Color-Fill (another feature)

© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 57

Fundamental classification algorithms

Example of fundamental classification algorithms:

• Naive Bayesian • Complementary Naive Bayesian • Stochastic Gradient Descent (SDG) • Random Forest • Support Vector Machines

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 58

Choose algorithm

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 59

Stochastic Gradient Descent (SGD)

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 60

Characteristic of SGD

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 61

Support Vector Machine (SVM)

nonlinear kernels

maximize boundary distances; remembering “support vectors”

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 62

Example SVM code in Spark

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 63

Naive BayesTraining set:

Classifier using Gaussian distribution assumptions:

Test Set:

==> female

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 64

Random Forest

Random forest uses a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features.

!65

Adaboost Example

- Start with a uniform distribution (“weights”) over training examples (The weights tell the weak learning algorithm which examples are important)

- Obtain a weak classifier from the weak learning algorithm, hjt:X→{-1,1}

- Increase the weights on the training examples that were misclassified

- (Repeat)

• Adaboost [Freund and Schapire 1996] ■ Constructing a “strong” learner as

a linear combination of weak learners

!66

Example — User Modeling using Time-Sensitive Adaboost• Obtain simple classifier on each feature, e.g., setting threshold

on parameters, or binary inference on input parameters.

• The system classify whether a new document is interested by a person via Adaptive Boosting (Adaboost):

■ The final classifier is a linear weighted combination of single-feature classifiers.

■ Given the single-feature simple classifiers, assigning weights on the training samples based on whether a sample is correctly or mistakenly classified. <== Boosting.

■ Classifiers are considered sequentially. The selected weights in previous considered classifiers will affect the weights to be selected in the remaining classifiers. !"" Adaptive.

■ According to the summed errors of each simple classifier, assign a weight to it. The final classifier is then the weighted linear combination of these simple classifiers.

• Our new Time-Sensitive Adaboost algorithm: ■ In the AdaBoost algorithm, all samples are regarded

equally important at the beginning of the learning process

■ We propose a time-adaptive AdaBoost algorithm that assigns larger weights to the latest training samples

People select apples according to their shapes, sizes, other people’s interest, etc.

Each attribute is a simple classifier used in Adaboost.

!67

Time-Sensitive Adaboost [Song, Lin, et al. 2005]

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 68

Evaluate the model

AUC (0 ~ 1): 1 — perfect 0 — perfectly wrong 0.5 — random

confusion matrix

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 69

Homework #1 (Due 10/4/2019, 5pm)

See TA’s instruction.

© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Storage and Processing 70

Questions?

top related