principal component regression & canonical...

36
Principal Component Regression & Canonical Correlation Analysis Nachiketa Acharya [email protected] Big Thanks to Dr. Simon Mason Training Workshop on Seasonal Prediction of Southwest Monsoon Rainfall: 16th – 18th April, 2018

Upload: others

Post on 24-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

Principal Component Regression & Canonical Correlation Analysis

Nachiketa Acharya [email protected]

Big Thanks to Dr. Simon Mason

Training Workshop on Seasonal Prediction of Southwest Monsoon Rainfall: 16th – 18th April, 2018

Page 2: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

2 Seasonal Forecasting Using the Climate Predictability Tool

Making Seasonal forecast for monsoon

There are two main methods in use (and in practice we often combine the two, and/or use a hybrid of them).

I: Models of past statistics – teleconnection

EAST ASIA PR ANOMALY

WARM WATER VOLUME

N. ATL SST ANOMALY

EQ. SE INDIAN OCEAN SST ANOMALY

NORTH WEST EUROPE TEMP ANOMALY

NINO 3.4 SST ANOMALY

NATL PR ANOMALY

NCPAC U850 ANOMALY

Courtesy: IMD

Page 3: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

3 Seasonal Forecasting Using the Climate Predictability Tool

Making Seasonal forecast for monsoon

II: Models of the physics – causation • Climate models are computer codes based on fundamental laws of physics

• However the output from these ensemble prediction systems cannot be used directly and requires further calibration in order to produce reliable forecasts.

Page 4: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

4 Seasonal Forecasting Using the Climate Predictability Tool

Seasonal forecasting tool

• Climate Predictability tool (CPT) is an easy-to-use software for making seasonal forecast using either empirical predictors, of the outputs from GCM.

• Developed and maintain by Dr. Simon Mason.

• CPT available for Windows 95+ and Linux Batch version.

Page 5: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

5 Seasonal Forecasting Using the Climate Predictability Tool

How the CPT make forecast?

Page 6: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

6 Seasonal Forecasting Using the Climate Predictability Tool

Options for Making seasonal forecast in CPT

• Multiple Linear Regression

• Principal Component Regression.

• Canonical Correlation Analysis.

Page 7: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

7 Seasonal Forecasting Using the Climate Predictability Tool

Multiple Linear Regression Area-average MAM rainfall for Thailand Ocean-based ENSO Indices

MAM rainfall over Thailand can be predict using a single predictor such as Feb NIÑO4 SSTs

0 1

0

1

ˆ NINO4

340 mm

50

y

0.48r

A simple linear regression equation for predicting rainfall has two parameters: • constant: how much rainfall can we

expect on average when the value of the predictor is 0.

• coefficient: how much can we expect rainfall to increase or decrease when the predictor increases by 1.

Page 8: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

8 Seasonal Forecasting Using the Climate Predictability Tool

In CPT the MLR (multiple linear regression) option allows for more than one predictor:

Multiple Linear Regression

n

iii XbbY

10

where:

Y = dependent variable

Xi = independent variables

bi = regression coefficients

n = number of independent variables

Page 9: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

9 Seasonal Forecasting Using the Climate Predictability Tool

In CPT the MLR (multiple linear regression) option allows for more than one predictor:

Let’s have some equations! The Multiple Regression Model

Estimates coefficients using least-square by minimizing

Multiple Linear Regression

Page 10: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

10 Seasonal Forecasting Using the Climate Predictability Tool

Problems with Multiple Linear Regression

Multiplicity - Too many predictors from which to choose.

With more than a handful of candidate predictors, the probability of including a least one spurious predictor (and therefore of subsequently making a bad prediction) becomes very high.

Page 11: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

11 Seasonal Forecasting Using the Climate Predictability Tool

Problems with Multiple Linear Regression

• Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated , meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy.

• When two X variables are highly correlated, they both convey essentially the same information.

• In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data.

• When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.

Page 12: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

12 Seasonal Forecasting Using the Climate Predictability Tool

Multicollinearity: Example

MAM Feb5ˆ 340 NINO40y MAM Jan3ˆ 344 NINO48y

MAM Jan Febˆ 332 NINO4 NIN131 O475y

For the first half of the data (1961 – 1985) only:

MAM Jan Febˆ 330 NINO4 NIN17 O419y

Predicting MAM 1961 – 2010 rainfall for Thailand from NIÑO4 SSTs:

Correlation between NINO4Jan and NINO4Feb is 0.97.

Perfect example where coefficient estimates are change erratically in response to small changes in the model or the data due to strong correlation among predictors.

MLR

Page 13: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

13 Seasonal Forecasting Using the Climate Predictability Tool

Principal Components Regression • Principal components regression is just like

standard regression except the independent variables are principal components rather than the original X variables.

• Principal components regression (PCR) is a method for combating multicollinearity and results in estimation and prediction better than ordinary least squares

What is principal components analysis? • Principal components are linear

combinations of the X’s. Principal components are new variables which are linear combinations of the X’s. the new variables are not correlated with each other.

• The principal components transformation is equivalent to a rotation of axes.

Credit: Dave Garen

It was later independently developed (and named) by Harold Hotelling in the 1930s

PCA was invented in 1901 by Karl Pearson

Page 14: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

14 Seasonal Forecasting Using the Climate Predictability Tool

Principal components

• Principal components analysis is specifically designed as a data reduction technique.

• PCR analysis is sometimes also known as empirical orthogonal function (EOF) analysis

Page 15: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

15 Seasonal Forecasting Using the Climate Predictability Tool

Understanding PCA in a simple way

Page 16: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

16 Seasonal Forecasting Using the Climate Predictability Tool

Diagrammatic Representation of Principal component

Page 17: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

17 Seasonal Forecasting Using the Climate Predictability Tool

Selecting of PCs

How many of the new variables should be retained to represent the total variability of the original variables adequately? A stopping rule is required to identify at which point additional principal components are no longer required.

Page 18: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

18 Seasonal Forecasting Using the Climate Predictability Tool

Visualization of Principal Components: Example from SST

Scores and loadings for first principal component of February 1961 – 2000 sea-surface temperatures.

A principal component is a weighted sum of a set of original variables, with the weights set so that the principal component has maximum variance.

Page 19: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

19 Seasonal Forecasting Using the Climate Predictability Tool

Principal Components

The score indicates how intensely developed the loading pattern is for each year.

????

Page 20: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

20 Seasonal Forecasting Using the Climate Predictability Tool

Scores and loadings for second principal component of February 1961 – 2000 sea-surface temperatures.

Separate patterns (“modes”) of variability can be defined. We can use just a few of these modes to represent the SST variability throughout the domain.

Principal Components

Page 21: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

Principal Component Regression: Math (boring )

Page 22: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making
Page 23: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

23 Seasonal Forecasting Using the Climate Predictability Tool

Principal Component

The principal components are orthogonal to each other, that is:

Elimination of Principal Components.

Page 24: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

Transformation Back to the Original Variables:

Page 25: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

25 Seasonal Forecasting Using the Climate Predictability Tool

Selecting Models in CPT

• MLR can be used when there is one or a very small number of predictors(independent to each other).

• PCR can be used to address problems with MLR that arise when there are many predictors.

• But what if there are many predictands?

Page 26: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

26 Seasonal Forecasting Using the Climate Predictability Tool

Canonical Correlation Analysis

Page 27: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

27 Seasonal Forecasting Using the Climate Predictability Tool

Canonical Correlation Analysis

Page 28: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

28 Seasonal Forecasting Using the Climate Predictability Tool

Some terminology of CCA

Page 29: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

29 Seasonal Forecasting Using the Climate Predictability Tool

Mode 1 r=0.73

Feb SSTs, 1961-2000

MAM rainfall

Visualization of Canonical Correlation Analysis: Example from SST and Rain

Page 30: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

30 Seasonal Forecasting Using the Climate Predictability Tool

Canonical Correlation Analysis

Mode 2 r=0.67

Feb SSTs, 1961-2000

MAM rainfall

Page 31: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

31 Seasonal Forecasting Using the Climate Predictability Tool

Selecting Models in CPT

• MLR can be used when there is one or a very small number of predictors (independent to each other).

• PCR can be used to address problems with MLR that arise when there are many predictors.

• CCA can be used if there are many predictors AND many predictands? But it can also be used even if there are a few of each.

Page 32: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

32 Seasonal Forecasting Using the Climate Predictability Tool

Making probability forecast in CPT

• Generally Seasonal forecast describes in “tercile probability”

• Let’s do some hands on to understand this.

Example data: 10.13038, 27.59568, 13.42799, 13.96082, 21.76947, 16.92497, 18.6818, 25.95358, 30.46833, 18.02041,

23.27678, 17.61698, 22.29597, 24.39998, 13.83134, 22.74837, 26.01102, 20.92308, 37.29841, 13.91443, 12.6294, 2.501207, 29.10483, 28.67083, 19.20107, 28.98476, 21.83703, 22.9079, 21.12945, 24.39952

Mean=21.0205 and SD= 7.0933 Based on percentile we can estimates the thresholds Lower bound (33rd ) = 18.3511, Upper bound = 23.8382 (67th )

Page 33: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

33 Seasonal Forecasting Using the Climate Predictability Tool

Plot the frequency and Probabilities

35.18XP 83.231)83.23( XPXP

Histogram

Probability Distribution Function based on Normal Distribution

Page 34: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

34 Seasonal Forecasting Using the Climate Predictability Tool

• What will be the guess if no forecast is available?

– The forecast will fall in any of the categories

• So what are the chances of getting below normal?

– 33%

• Above Normal?

– 33%

• Near Normal

– 33%

• So every time the forecast is 33% probability of getting all the three categories, We call it as No Skill

Page 35: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

35 Seasonal Forecasting Using the Climate Predictability Tool

• Now suppose in one year our models give mean forecast 30

• By various methods we calculate the spread (standard deviation) lets consider 6

• And then we generate forecast pdf

Blue is climatology red is forecast PDF

2% 2%

14% 14%

2%

84%

Page 36: Principal Component Regression & Canonical …rcc.imdpune.gov.in/Training/SASCOF12/CPT_Dayone/PCR_CCA...• Climate Predictability tool (CPT) is an easy-to-use software for making

36 Seasonal Forecasting Using the Climate Predictability Tool

SO WE LEARN ABOUT Methods in CPT

WHAT DOES THEY REALLY

MEAN?

SORRY, I’M NOT PREPARED FOR

IN-DEPTH QUESTIONS

Questions?

web: iri .columbia.edu

@climatesociety

…/climatesociety