university of coloradocivil.colorado.edu/~balajir/adv-data-material/cven-6833... · web...

18
University of Colorado Department of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data Analysis Techniques Homework Set No. 4 Date Feb. 3, 2006 1.Annual average (Sea Surface Temperatures) SST anomalies covering the globe and PDSI (Palmer Drought Index) covering the USA are available on the class page http://civil.colorado.edu/~balajir/~apipatta/CVEN- 6833/multivariate You wish to identify the space-time patterns of variability of the annual SST. To this end, (i) Perform a PCA on the SST anomalies data Plot the Eigen-variance spectrum for the first 40 modes

Upload: others

Post on 12-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

University of ColoradoDepartment of Civil, Environmental and Architectural Engineering

CVEN-6833 Advanced Data Analysis Techniques

Homework Set No. 4 Date Feb. 3, 2006

1.Annual average (Sea Surface Temperatures) SST anomalies covering the globe and PDSI (Palmer Drought Index) covering the USA are available on the class page http://civil.colorado.edu/~balajir/~apipatta/CVEN-6833/multivariate

You wish to identify the space-time patterns of variability of the annual SST. To this end,(i) Perform a PCA on the SST anomalies data

Plot the Eigen-variance spectrum for the first 40 modes

Page 2: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

Document your observations.

The sea surface temperature anomaly data (SST) gives the departure from the mean value (at that station) of the sea surface temperature measured at 1207 stations around the world from 1925 to 2003. Here we consider data from the 488 stations in the Pacific Ocean.

Principal Component Analysis rotates a matrix data to find the axes of maximum variability. The axes that account for most of the variation in the data are the principal components or leading modes. The plots above describe the fraction of the variance in the data described by each mode. Something like 30% of the variability in the data is accounted for by the first mode and the first ten PCA’s account for approximately 80% of the variability. The first three components, analyzed in further detail below, account for more than half of the total variance in the data.

Page 3: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

Plot the leading three spatial and temporal (PCs) modes of variability

Page 4: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data
Page 5: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

Document your observations.

The contour plots above show the first three eigenvectors of the covariance matrix of the Pacific SST data. The covariance matrix is an N x N symmetric matrix that describes how the data at each station co-varies with that at all other stations over the length of the record. A positive covariance means that, on average, two stations tend to vary from their means in the same direction: warmer years at station A are likely to also be warmer at station B. A negative covariance indicates the opposite: a warm year at station C tends to correspond to a cool year at station D. These eigenvectors are also known as Empirical Orthogonal Functions (EOFs).

Each eigenvector, or EOF, contains one value for each station. This value is some sort of “mixture” of the covariance values for that station with all the others, though it’s physical meaning and units are unclear to me. The EOF as a whole identifies an axis through the covariance data that can indicate stable patterns or oscillations in the data. The contours in the plots above connect stations with equal values from the first three EOFs.

As mentioned before, the significance of the actual value and units are not entirely clear to me, but I believe that the contour lines identify regions where the data tend to change in phase. The plot of EOF1 shows three distinct regions of variability, the North Pacific, South Pacific, and the Southeast United States. Since we are dealing with mean annual temperature anomalies (rather than actual temperatures), we shouldn’t expect to see any seasonal pattern in the data. EOF 1 is the El Nino Southern Oscillation (ENSO), accounting for about 30% of the Pacific Ocean SST anomaly. Modes two and three account for about 15% and 5% of the variability respectively. Their physical meaning is unclear to me.

Page 6: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

Document your observations.

The plots above show the value of the three leading principal components (PCs)vs. time in years after 1925. The PC represents the time evolution of the corresponding EOF (plotted previously). In PC1, we can see a peak every 3-5 years supporting the conclusion that the first mode of variability (once the seasonal cycle is removed) is ENSO. PC2 shows approximately decadal cycles and PC3 is a mess. The cycles become more evident in part ii when we use spectral analysis to identify the dominant frequencies in the principal components.

Page 7: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

(ii) Investigate the dominant periodicities of the first three leading PCs by performing a spectral analysis on the PCs.

Document your observations.

Cycles in periodic data are much easier to discern by looking at the spectrum. Peaks in the spectrum show us the dominant frequencies of the patterns that appear in the EOF contour plots. In PC1, there are several peaks at frequencies lower than 1 cycle per decade and a group of peaks in the range of approximately 1 cycle every three to six years. This higher frequency cycle is ENSO, recurring every three to six years. The spectrum of PC2 shows power at frequencies lower than 1 cycle per decade and PC 3 shows a number of low power peaks at many frequencies, consistent with our earlier observation that it is a mess.

Page 8: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

(iii) Repeat (i) and (ii) on the PDSI data.

Document your observations.

The Palmer Drought Severity Index (PDSI) is a measure of the atmospheric moisture anomaly. Positive values indicate wetter than average years and negative values indicate dryer than average years. Here we analyze data from 154 North American stations for the years 1925-2003. In the plots above, we can see that the first three modes account for approximately 27%, 14%, and 9%, respectively. Cumulatively, they account for half the total variance in the PDSI anomaly data.

Page 9: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data
Page 10: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

Because the EOFs are derived from the covariance matrix, it is difficult to know if the values in the eigenvectors represent wetter or dryer than average conditions. Patterns are evident in all of the first three EOFs, but I don’t know how to interpret them in terms of drought. In an ideal world, I would contour the data for each year and create an animation where each year is a frame. Then the nature of the patterns we see in the EOFs should become evident.

The first EOF shows a locus in the Upper Great Plains. Since the plains are subject to periodic severe drought, I would guess that this is what we are seeing here. The fact that there appear to be two loci is probably due to interpolation and the contour interval. EOF2 is focused in the Southwest U.S. The southwest tends to be wet during El Nino years. That is probably the pattern shown in EOF2. EOF3 shows strong loci in the Ohio River Valley and Southern California. This could also be an El Nino effect with wet years in Southern California corresponding to dry years in the eastern U.S. and vice versa.

Page 11: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data
Page 12: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

The patterns that appear in the principal components and their spectra call into question the conclusion that PC2 is related to El Nino since neither PC1 or PC2 show much power at the one cycle per 3-5 years frequency expected for an El Nino driven cycle.

PC shows a very strong Great Plains drought cycle with power at decadal and longer recurrence intervals. That seems consistent with intuition. PC2 shows a weaker cycle centered in West Texas with a period of close to 20 years. PC3 shows variation at decadal and longer periods and also at the El Nino frequency (the spike at about .18 cycles/year). We should be able to get a better idea of the correlation between the SST data and PDSI data in the next section.

Page 13: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

2. You want to identify the joint patterns of variability of PDSI and SST. One way to do this is to (i) relate the leading PCs of SST with those from PDSI – scatterplot and Local Polynomial fits.

The scatter plots above relate each of the SST principal components to each of the PDSI principal components. By comparing the correlation and the sum of the square of the residuals, we can get a better idea which mode of variation in sea surface temperatures is correlated with the modes of variation in the Palmer Drought Index. The best linear fit, based on the residuals, is between SST PC1 and PDSI PC3. This relationship also has one of the strongest correlations. This relationship supports our hypothesis in the previous section that PDSI EOF3, showing out of phase patterns in Southern California and the Ohio River Valley is related to the first principal component of SST, ENSO. The negative correlation suggests that strong El Nino conditions are correlated with wet years in the Southwest and drought in the Eastern U.S. The correlation maps in the next section will support this conclusion.

Page 14: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

(ii)Correlate the leading PC of PDSI with the SST data – thus producing correlation maps and inferring the patterns.

We have already concluded that the first PC of SST is the El Nino Southern Oscillation and we hypothesized in the previous two sections that ENSO is correlated with the third PC of the PDSI. The map above correlated the first PC of SST with the actual PDSI data. We see a negative correlation in the Southwest and a positive correlation in the Ohio River Valley. El Nino years are typically wet in the Southwest and apparently dry in the Eastern U.S.

Page 15: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

The correlation of the leading PC of the drought index with the sea surface temperature data is more difficult to interpret. The previous comparision of SST PC1 with the PDSI showed an anti-correlation in the Southwest with a magnitude of 0.4. Most of the correlations in this map are much weaker, suggesting that the influence of sea surface temperature on drought cycles in the upper Great Plains is complex.

3.Can you throw out some thoughts on multivariate forecasting using the PCA? You may wish to try on a small region of PDSI.

Principal component analysis identifies patterns in data. If a few patterns are strong enough to account for the majority of the variation in the data, the weaker components can be neglected and the dimensionality of the problem is effectively reduced. Correlating principal components from different data sets, as we did above with the sea surface temperatures and the drought index, can help identify causal links. In the analysis above, we found that the leading mode of variation in the sea surface temperature is correlated with a wet-dry oscillation between the Southwest U.S. and the Ohio River Valley. This is supported by the correlation map and by power at similar frequencies in the principal components. This is clearly a very simplistic analysis, supported inferences made from outside

Page 16: University of Coloradocivil.colorado.edu/~balajir/Adv-Data-material/CVEN-6833... · Web viewDepartment of Civil, Environmental and Architectural Engineering CVEN-6833 Advanced Data

information, but the basic idea is clear. There are related patterns in climate data that can be identified with the techniques demonstrated here. Once the apparent cause and effect relationships are identified, we can make forecasts such as El Nino increases the likelihood of drought in the east and rain in the desert southwest.