least-mean-square training of cluster-weighted-modeling national taiwan university department of...

Least-Mean-Square Least-Mean-Square Training of Cluster-Training of Cluster-Weighted-ModelingWeighted-Modeling

National Taiwan UniversityNational Taiwan University

Department of Computer Department of Computer Science and Information Science and Information

EngineeringEngineering

OutlineOutline

• Introduction of CWMIntroduction of CWM

• Least-Mean-Square Training of CWMLeast-Mean-Square Training of CWM

• ExperimentsExperiments

• SummarySummary

• Future workFuture work

• Q&AQ&A

Cluster-Weighted Modeling Cluster-Weighted Modeling (CWM)(CWM)• CWM is a supervised learning model which are CWM is a supervised learning model which are

based on the joint probability density estimation based on the joint probability density estimation of a set of input and output (target) data.of a set of input and output (target) data.

• The joint probability is expended into clusters The joint probability is expended into clusters which describe local subspaces well. Each local which describe local subspaces well. Each local Gaussian expert can have its own local functionGaussian expert can have its own local function

(constant, linear or quadratic function).(constant, linear or quadratic function).• The global (nonlinear) model can be constructed The global (nonlinear) model can be constructed

by combining all the local models.by combining all the local models.• The resulting model has transparent local The resulting model has transparent local

structures and meaningful parameters.structures and meaningful parameters.

ArchitectureArchitecture

• sdffsdff

Prediction calculationPrediction calculation

• Conditional forecast: Conditional forecast: The expected output The expected output given the input.given the input.

• Conditional error Conditional error (output uncertainty): (output uncertainty): The expected output The expected output covariance given the covariance given the input input

• Objective function: Log-likelihood functionObjective function: Log-likelihood function

• Initialize cluster means (k-means), variances Initialize cluster means (k-means), variances (maximal range for each dimension). Initialize (maximal range for each dimension). Initialize

=1/M. M: Predetermined number of clusters.=1/M. M: Predetermined number of clusters.• E-step: Evaluate the posterior probabilityE-step: Evaluate the posterior probability

• M-step:M-step: Update clusters meansUpdate clusters means

Update prior probabilityUpdate prior probability

Training (EM Algorithm)Training (EM Algorithm)

M-step ( Cont.)M-step ( Cont.)• Define cluster-weighted expectationDefine cluster-weighted expectation

• Update cluster-weighted covariance matricesUpdate cluster-weighted covariance matrices

• Update cluster parameters which maximizesUpdate cluster parameters which maximizes

the data likelihoodthe data likelihood

wherewhere

• Update output covariance matricesUpdate output covariance matrices

Least-Mean-Square Training of Least-Mean-Square Training of CWM CWM

• To train CWM’s model parameters from a leaTo train CWM’s model parameters from a least-squared perspective.st-squared perspective.

• Minimizing squared error function of CWM’s tMinimizing squared error function of CWM’s training result to find another solution which craining result to find another solution which can have a better accuracy.an have a better accuracy.

• To find another solution when CWM is trapped To find another solution when CWM is trapped in local minima.in local minima.

• Applying supervised selection of cluster centerApplying supervised selection of cluster centers instead of unsupervised method.s instead of unsupervised method.

LMS Learning AlgorithmLMS Learning Algorithm

The instantaneous error produced by sample n iThe instantaneous error produced by sample n iss

The prediction formula isThe prediction formula is

Using softmax function to constrain prior probaUsing softmax function to constrain prior probability to have value between 0 and 1 and their bility to have value between 0 and 1 and their summation equal to 1.summation equal to 1.

LMS Learning Algorithm LMS Learning Algorithm (cont.)(cont.)• The derivation of gradients: The derivation of gradients:

LMS CWM Learning LMS CWM Learning AlgorithmAlgorithm• Initialization: Initialize Initialization: Initialize Using CWM’s training result. InitializeUsing CWM’s training result. InitializeIterate until convergence:Iterate until convergence: For n=1:NFor n=1:N Estimate error Estimate error Estimate gradients Estimate gradients UpdateUpdate

EndEndE-step:E-step:

M-step:M-step:

Simple DemoSimple Demo

• cwm1dcwm1d• cwmprdemocwmprdemo• cwm2dcwm2d• lms1dlms1d

ExperimentsExperiments

• A simple Sin function.A simple Sin function.

• LMS-CWM has a better interpolation LMS-CWM has a better interpolation result.result.

Mackey-Glass Chaotic Time Mackey-Glass Chaotic Time Series PredictionSeries Prediction

• 1000 data points. We take the first 500 1000 data points. We take the first 500 points as training set, the last 500 points as training set, the last 500 points are chosen as test set.points are chosen as test set.

• Single-step predictionSingle-step prediction

• Input: [s(t),s(t-6),s(t-12),s(t-18)]Input: [s(t),s(t-6),s(t-12),s(t-18)]

• Output: s(t+85)Output: s(t+85)

• Local linear modelLocal linear model

• Number of clusters: 30Number of clusters: 30

Results (1)Results (1)

CWM LMS-CWM

Results (2)Results (2)• Learning curveLearning curve

CWM LMS CWMCWM LMS CWM

MSEMSE CWMCWM LMS CWMLMS CWM

Test setTest set 0.00080270.0008027 0.00044800.0004480

Training setTraining set 0.00065680.0006568 0.00042930.0004293

Local MinimaLocal Minima

• The initial locations of four clusters.The initial locations of four clusters.

The initial locations of four clusters

The resulting centers’ locations after each training session of CWM and LMS-CWM.

SummarySummary

• A LMS learning method for CWM is presented.A LMS learning method for CWM is presented.• May lose the benefits of data density estimation May lose the benefits of data density estimation

and characterizing data. and characterizing data. • Provides an alternative training option.Provides an alternative training option.• Parameters can be trained by EM and LMS Parameters can be trained by EM and LMS

alternatively.alternatively.• Combine both advantages of EM and LMS Combine both advantages of EM and LMS

learning.learning.• LMS-CWM learning can be viewed as a LMS-CWM learning can be viewed as a

refinement to CWM if only prediction accuracy refinement to CWM if only prediction accuracy is our main concern.is our main concern.

Future workFuture work

• Regularization.Regularization.

• Comparison between different Comparison between different models (from theoretical, models (from theoretical, performance point of views)performance point of views)

Q&AQ&A

Thank You!Thank You!

least-mean-square training of cluster-weighted-modeling national taiwan university department of...

Documents

training session of

training set0

local function constant

local models

local subspaces

supervised learning

output target data

alternative training