multivariatedata analysisin omics research · multivariatedata analysisin omics research...

37
Multivariate Data Analysis in Omics Research Diverging Alternative Splicing Fingerprints Identified in Thoracic Aortic Aneurysm Sanela Kjellqvist, PhD WABI RNAseq course 2017-11-08

Upload: others

Post on 04-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Multivariate DataAnalysis inOmicsResearch

Diverging AlternativeSplicing FingerprintsIdentified inThoracic Aortic Aneurysm

SanelaKjellqvist,PhDWABIRNAseq course

2017-11-08

Page 2: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Outline• Why multivariate dataanalysis?• Multivariate statistics

– Differentanalyses– Datapreprocessing

• Alternativesplicing inthoracic aortic aneurysm– Thoracic aortic aneurysm– Study setup– Aim of thestudy– Results– Summary

• Today’s exercise

Page 3: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

WHYMULTIVARIATE DATAANALYSIS?

Page 4: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

DevelopmentofClassicalStatistics–1930s

• Multipleregression• Canonicalcorrelation• Lineardiscriminantanalysis• Analysisofvariance

Assumptions:

• IndependentXvariables

• Manymoreobservationsthanvariables

• RegressionanalysisoneYatatime

• Nomissingdata

N

K

Tablesarelongandlean

Page 5: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Today’sdata• RNASeq,Array,LC-MS/MS,GC/MSor

NMRdata

• Problems– Manyvariables– Fewobservations– Noisydata– Missingdata– Multipleresponses

• Implications– Highdegreeofcorrelation– Difficulttoanalysewith

conventionalmethods

• Data¹ Information– Needwaystoextractinformation

fromthedata– Needreliable,predictive

information– Ignorerandomvariation(noise)

N

K

Page 6: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

PoorMethodsofDataAnalysis

X1 Y1 Y2Y3X2 X3

• Plotpairsofvariables– Tedious,impractical– Riskofspuriouscorrelations– Riskofmissinginformation

• SelectafewvariablesanduseMLR– Throwingawayinformation– Assumesno‘noise’inX– OneYatatime

Page 7: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

ABetterWay...• MultivariateanalysisbyProjection

– LooksatALLthevariablestogether– Avoidslossofinformation– Findsunderlyingtrends=“latentvariables”– Morestablemodels

Page 8: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

FundamentalDataAnalysisObjectives

Overview Discrimination Regression

TrendsOutliersQuality ControlBiological DiversityPatient Monitoring

Discriminating between groupsBiomarker candidatesComparing studies or instrumentation

Comparing blocks of omics dataMetab vs Proteomic vsGenomicOmic vs medicalPrediction

Page 9: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

MULTIVARIATE STATISTICS

Page 10: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Differentmethods• Principalcomponentanalysis(PCA)• Partialleast squares tolatentstructuresanalysis(PLS)• Orthogonalpartialleastsquarestolatentstructuresanalysis(OPLS)

• PLS-DA• OPLS-DA• K-meansclustering• Hierarchical clustering• Biplotanalysis• Canonicalcorrelationanalysis

Page 11: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

What isaprojection?

Principalcomponentanalysis(PCA) • Algebraically

–Summarizestheinformationintheobservationsasafewnew(latent)variables

• Geometrically– TheswarmofpointsinaKdimensionalspace(K=numberofvariables)isapproximatedbya(hyper)planeandthepointsareprojectedonthatplane.

Page 12: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

PCA- GeometricInterpretation

x2

x3

x1

t1

Fit first principal component (line describing maximum variation)

t2

Add second component (accounts for next largest amount of variation) and is at right angles to first - orthogonal

Each component goes through origin

12

Page 13: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

PCA- GeometricInterpretation

x2

x3

x1

X

Points are projected down onto a plane with co-ordinates t1, t2

Comp 1

t1

Comp 2

t2

K

N

“Distance to Model”

13

Page 14: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Loadings

x2

x3

x1

How do the principal components relate to the original variables?

Look at the angles betweenPCs and variable axes

t1 t2

XK

N

Comp 1

α2

α3

α1

14

Page 15: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Loadings

x2

x3

x1

Take cos(α) for each axis

Loadings vector p’ - one for each principal component

One value per variable

Comp 1

t1 t2

p’1

α2

α3

α1

cos(α1)

cos(α2)

cos(α3)

XK

N

15

Page 16: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Principalcomponentanalysis(PCA)• PCAcompresstheX datablockintoA numberoforthogonal

components• Variationseeninthescorevectort canbeinterpretedfrom

thecorrespondingloadingvectorp

PT

X1…A

1…A

T

X = t1p1T+ t2p2

T +…+tApAT +E = TPT + EPCAModel

PCA

Page 17: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Recognition of molecular quasi-species(evolving units)inenzyme evolutionbyPCA

Emrén,L.,Kurtovic,S.,Runarsdottir,A.,Larsson,A-K.,&Mannervik,B.(2006)ProcNatlAcadSciUSA,103,10866-10870Kurtovic,S,&MannervikB(2009)Biochemistry,48,9330-9339

Page 18: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Orthogonal partialleast squares to latentstructure –Discriminant analysis (OPLS-DA)

Page 19: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Orthogonal partialleast squares to latentstructure –Discriminant analysis (OPLS-DA)

X OPLSY

Class 1

Class 2

Page 20: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

OPLSwithsingleY/modellingandprediction

p1T

XTO

POT

y

’Y-predictive’’Y-orthogonal’

1 11…

1… 1 1q1T

t1 u1OPLS

X = t1p1T + TOPO

T + EOPLSModel Y = t1qT

1 + F

Page 21: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

DataPreprocessing – Scaling• PCAandothermethodsarescaledependent

– Isthesizeofavariableimportant?

• Scalingweightis1/SDforeachvariablei.e.divideeachvariablebyitsstandarddeviation– UnitVarianceScaling

• Varianceofscaledvariables=1• Manyotherkindsofscalingexist

Xws

1/SD

UV scaling

Page 22: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Cross-Validation• DataaredividedintoGgroups(defaultin

SIMCA-Pis7)andamodelisgeneratedforthedatadevoidofonegroup

• ThedeletedgroupispredictedbythemodelÞpartialPRESS(PredictiveResidualSumofSquares)

• ThisisrepeatedGtimesandthenallpartialPRESSvaluesaresummedtoformoverallPRESS

• Ifanewcomponentenhancesthepredictivepowercomparedwith thepreviousPRESSvaluethen thenewcomponentisretained

• PCA cross-validation is done in two phases and several deletion rounds: – first removal of

observations (rows)– then removal of variables

(columns)

22

Page 23: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

ModelDiagnostics• FitorR2

– ResidualsofmatrixEpooledcolumn-wise– Explainedvariation– Forwholemodelorindividualvariables

– RSS=Σ (observed- fitted)2

– R2 =1- RSS/SSX

• PredictiveAbilityorQ2

– Leaveout1/7th datainturn– ‘CrossValidation’– Predicteachmissingblockofdatainturn– Sumtheresults

– PRESS=Σ (observed- predicted)2

– Q2 =1– PRESS/SSX

Fit

PredictionStopwhenQ2 startstodrop

23

Page 24: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

ALTERNATIVESPLICING INTHORACIC AORTIC ANEURYSM

Kurtovic,Paloschi,Folkersen,Gottfries,Franco-Cereceda,Eriksson(2011)MolecularMedicine,17;665-675

Page 25: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Thoracic aortic aneurysm (TAA)

• Monogenic– Marfan syndrome– Loeys Dietz

• Aneurysm associated with bicuspid aortic valve (BAV)

• Idiopathic thoracic aortic aneurysm

Page 26: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Outline of thestudy

• Biopsiesarecollectedfrombothnon-dilatedanddilatedaortaduringvalvereplacementsurgeryandreconstructionofthedilatedaortarespectively

• Mediafromascending aorta• RNA

– Affymetrix humanexon 1.0STmicroarrays (inthis study 81patients)

– RNAseq (30patients)

• Protein– HiRiEF iTRAQ LC-MS/MS– 2Dgelelectrophoresis followed by

iTRAQ LC-MS/MS

Non-dilated Dilated

Page 27: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Aim of thestudy

• Alternativesplicing intransforminggrowthfactor-β(TGFβ)signalingpathway

• TGFβ pathway isknown to beimportant inaorticaneurysm

• Are there any alternatively spliced genesintheTGFβpathway?

• Isalternativesplicing animportant mechanism inthoracic aortic aneurysm (TAA)?

• How dowe analyze alternativesplicing?

Page 28: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Affymetrix exon array design

PSR– probeselectionregion

Exons

Introns

Page 29: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Preprocessing of data• Probe setcore level• Unique hybridization target• Robustmultichipaverage (RMA)normalized• Splice Indexcalculated (incase of exon level analysis)

i=exonj=samplek=genee=exon signalg=genesignal

• Unit variance scaled andmean centered datapriorto MVA

𝑛𝑖,𝑗 ,𝑘 =𝑒𝑖,𝑗 ,𝑘𝑔𝑗 ,𝑘

Page 30: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

AlternativesplicingpatternintheTGFβpathwayisdifferentbetweendilatedandnon-dilatedaorta

• TAVandBAVtogether• 81patientsincluded• 614exons included• Good model• Good separationbetween thetwo groups

Non-supervisedPCA SupervisedOPLS-DA

Page 31: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

AlternativesplicingpatternintheTGFβpathwayisdifferentbetweendilatedandnon-dilatedaorta

• Only TAVpatients• 29patientsincluded• 614exons included• Good model• Good separationbetween thetwo groups

Non-supervisedPCA SupervisedOPLS-DA

Page 32: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

AlternativesplicingpatternintheTGFβpathwayisdifferentbetweendilatedandnon-dilatedaorta

Non-supervisedPCA SupervisedOPLS-DA

• OnlyBAVpatients• 52patientsincluded• 614exonsincluded• Goodmodel• Goodseparationbetweenthetwogroups

Page 33: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

AlternativelysplicedexonsarepresentinbothTAVandBAVgroupsofpatients

Page 34: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

AlternativesplicinganalysisofallexonsinthehumangenomerevealstheimportanceofTGFβpathwayexons

Page 35: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Geneexpressionpatternsofdifferentiallysplicedgenes

Page 36: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Summary• TGFβ pathway exons clearly important according to anoverallexon

level analysis• Dilated andnon-dilated aortasshowdifferentalternativesplicing

patterns indilated andnon-dilated tissues with respect to TAVandBAVinTGFβ pathway

• Exons responsible forthediverging alternativesplicing fingerprints inTGFβ pathway identified

• Implies that dilatationinTAVhasdifferentunderlying molecularmechanisms compared to BAVpatients

• Newmethods foranalyzing array data

Page 37: MultivariateData Analysisin Omics Research · MultivariateData Analysisin Omics Research DivergingAlternative SplicingFingerprints Identifiedin ThoracicAorticAneurysm Sanela Kjellqvist,

Todayduringtheexercise

• PCAandOPLS-DA• Thoracicaorticaneurysmdataset• ExonlevelexpressionAffymetrix arrays• Comparetwodifferentphenotypesandsubphenotypes