multivariatedata analysisin omics research · multivariatedata analysisin omics research...
TRANSCRIPT
Multivariate DataAnalysis inOmicsResearch
Diverging AlternativeSplicing FingerprintsIdentified inThoracic Aortic Aneurysm
SanelaKjellqvist,PhDWABIRNAseq course
2017-11-08
Outline• Why multivariate dataanalysis?• Multivariate statistics
– Differentanalyses– Datapreprocessing
• Alternativesplicing inthoracic aortic aneurysm– Thoracic aortic aneurysm– Study setup– Aim of thestudy– Results– Summary
• Today’s exercise
WHYMULTIVARIATE DATAANALYSIS?
DevelopmentofClassicalStatistics–1930s
• Multipleregression• Canonicalcorrelation• Lineardiscriminantanalysis• Analysisofvariance
Assumptions:
• IndependentXvariables
• Manymoreobservationsthanvariables
• RegressionanalysisoneYatatime
• Nomissingdata
N
K
Tablesarelongandlean
Today’sdata• RNASeq,Array,LC-MS/MS,GC/MSor
NMRdata
• Problems– Manyvariables– Fewobservations– Noisydata– Missingdata– Multipleresponses
• Implications– Highdegreeofcorrelation– Difficulttoanalysewith
conventionalmethods
• Data¹ Information– Needwaystoextractinformation
fromthedata– Needreliable,predictive
information– Ignorerandomvariation(noise)
N
K
PoorMethodsofDataAnalysis
X1 Y1 Y2Y3X2 X3
• Plotpairsofvariables– Tedious,impractical– Riskofspuriouscorrelations– Riskofmissinginformation
• SelectafewvariablesanduseMLR– Throwingawayinformation– Assumesno‘noise’inX– OneYatatime
ABetterWay...• MultivariateanalysisbyProjection
– LooksatALLthevariablestogether– Avoidslossofinformation– Findsunderlyingtrends=“latentvariables”– Morestablemodels
FundamentalDataAnalysisObjectives
Overview Discrimination Regression
TrendsOutliersQuality ControlBiological DiversityPatient Monitoring
Discriminating between groupsBiomarker candidatesComparing studies or instrumentation
Comparing blocks of omics dataMetab vs Proteomic vsGenomicOmic vs medicalPrediction
MULTIVARIATE STATISTICS
Differentmethods• Principalcomponentanalysis(PCA)• Partialleast squares tolatentstructuresanalysis(PLS)• Orthogonalpartialleastsquarestolatentstructuresanalysis(OPLS)
• PLS-DA• OPLS-DA• K-meansclustering• Hierarchical clustering• Biplotanalysis• Canonicalcorrelationanalysis
What isaprojection?
Principalcomponentanalysis(PCA) • Algebraically
–Summarizestheinformationintheobservationsasafewnew(latent)variables
• Geometrically– TheswarmofpointsinaKdimensionalspace(K=numberofvariables)isapproximatedbya(hyper)planeandthepointsareprojectedonthatplane.
PCA- GeometricInterpretation
x2
x3
x1
t1
Fit first principal component (line describing maximum variation)
t2
Add second component (accounts for next largest amount of variation) and is at right angles to first - orthogonal
Each component goes through origin
12
PCA- GeometricInterpretation
x2
x3
x1
X
Points are projected down onto a plane with co-ordinates t1, t2
Comp 1
t1
Comp 2
t2
K
N
“Distance to Model”
13
Loadings
x2
x3
x1
How do the principal components relate to the original variables?
Look at the angles betweenPCs and variable axes
t1 t2
XK
N
Comp 1
α2
α3
α1
14
Loadings
x2
x3
x1
Take cos(α) for each axis
Loadings vector p’ - one for each principal component
One value per variable
Comp 1
t1 t2
p’1
α2
α3
α1
cos(α1)
cos(α2)
cos(α3)
XK
N
15
Principalcomponentanalysis(PCA)• PCAcompresstheX datablockintoA numberoforthogonal
components• Variationseeninthescorevectort canbeinterpretedfrom
thecorrespondingloadingvectorp
PT
X1…A
1…A
T
X = t1p1T+ t2p2
T +…+tApAT +E = TPT + EPCAModel
PCA
Recognition of molecular quasi-species(evolving units)inenzyme evolutionbyPCA
Emrén,L.,Kurtovic,S.,Runarsdottir,A.,Larsson,A-K.,&Mannervik,B.(2006)ProcNatlAcadSciUSA,103,10866-10870Kurtovic,S,&MannervikB(2009)Biochemistry,48,9330-9339
Orthogonal partialleast squares to latentstructure –Discriminant analysis (OPLS-DA)
Orthogonal partialleast squares to latentstructure –Discriminant analysis (OPLS-DA)
X OPLSY
Class 1
Class 2
OPLSwithsingleY/modellingandprediction
p1T
XTO
POT
y
’Y-predictive’’Y-orthogonal’
1 11…
1… 1 1q1T
t1 u1OPLS
X = t1p1T + TOPO
T + EOPLSModel Y = t1qT
1 + F
DataPreprocessing – Scaling• PCAandothermethodsarescaledependent
– Isthesizeofavariableimportant?
• Scalingweightis1/SDforeachvariablei.e.divideeachvariablebyitsstandarddeviation– UnitVarianceScaling
• Varianceofscaledvariables=1• Manyotherkindsofscalingexist
Xws
1/SD
UV scaling
Cross-Validation• DataaredividedintoGgroups(defaultin
SIMCA-Pis7)andamodelisgeneratedforthedatadevoidofonegroup
• ThedeletedgroupispredictedbythemodelÞpartialPRESS(PredictiveResidualSumofSquares)
• ThisisrepeatedGtimesandthenallpartialPRESSvaluesaresummedtoformoverallPRESS
• Ifanewcomponentenhancesthepredictivepowercomparedwith thepreviousPRESSvaluethen thenewcomponentisretained
• PCA cross-validation is done in two phases and several deletion rounds: – first removal of
observations (rows)– then removal of variables
(columns)
22
ModelDiagnostics• FitorR2
– ResidualsofmatrixEpooledcolumn-wise– Explainedvariation– Forwholemodelorindividualvariables
– RSS=Σ (observed- fitted)2
– R2 =1- RSS/SSX
• PredictiveAbilityorQ2
– Leaveout1/7th datainturn– ‘CrossValidation’– Predicteachmissingblockofdatainturn– Sumtheresults
– PRESS=Σ (observed- predicted)2
– Q2 =1– PRESS/SSX
Fit
PredictionStopwhenQ2 startstodrop
23
ALTERNATIVESPLICING INTHORACIC AORTIC ANEURYSM
Kurtovic,Paloschi,Folkersen,Gottfries,Franco-Cereceda,Eriksson(2011)MolecularMedicine,17;665-675
Thoracic aortic aneurysm (TAA)
• Monogenic– Marfan syndrome– Loeys Dietz
• Aneurysm associated with bicuspid aortic valve (BAV)
• Idiopathic thoracic aortic aneurysm
Outline of thestudy
• Biopsiesarecollectedfrombothnon-dilatedanddilatedaortaduringvalvereplacementsurgeryandreconstructionofthedilatedaortarespectively
• Mediafromascending aorta• RNA
– Affymetrix humanexon 1.0STmicroarrays (inthis study 81patients)
– RNAseq (30patients)
• Protein– HiRiEF iTRAQ LC-MS/MS– 2Dgelelectrophoresis followed by
iTRAQ LC-MS/MS
Non-dilated Dilated
Aim of thestudy
• Alternativesplicing intransforminggrowthfactor-β(TGFβ)signalingpathway
• TGFβ pathway isknown to beimportant inaorticaneurysm
• Are there any alternatively spliced genesintheTGFβpathway?
• Isalternativesplicing animportant mechanism inthoracic aortic aneurysm (TAA)?
• How dowe analyze alternativesplicing?
Affymetrix exon array design
PSR– probeselectionregion
Exons
Introns
Preprocessing of data• Probe setcore level• Unique hybridization target• Robustmultichipaverage (RMA)normalized• Splice Indexcalculated (incase of exon level analysis)
i=exonj=samplek=genee=exon signalg=genesignal
• Unit variance scaled andmean centered datapriorto MVA
𝑛𝑖,𝑗 ,𝑘 =𝑒𝑖,𝑗 ,𝑘𝑔𝑗 ,𝑘
AlternativesplicingpatternintheTGFβpathwayisdifferentbetweendilatedandnon-dilatedaorta
• TAVandBAVtogether• 81patientsincluded• 614exons included• Good model• Good separationbetween thetwo groups
Non-supervisedPCA SupervisedOPLS-DA
AlternativesplicingpatternintheTGFβpathwayisdifferentbetweendilatedandnon-dilatedaorta
• Only TAVpatients• 29patientsincluded• 614exons included• Good model• Good separationbetween thetwo groups
Non-supervisedPCA SupervisedOPLS-DA
AlternativesplicingpatternintheTGFβpathwayisdifferentbetweendilatedandnon-dilatedaorta
Non-supervisedPCA SupervisedOPLS-DA
• OnlyBAVpatients• 52patientsincluded• 614exonsincluded• Goodmodel• Goodseparationbetweenthetwogroups
AlternativelysplicedexonsarepresentinbothTAVandBAVgroupsofpatients
AlternativesplicinganalysisofallexonsinthehumangenomerevealstheimportanceofTGFβpathwayexons
Geneexpressionpatternsofdifferentiallysplicedgenes
Summary• TGFβ pathway exons clearly important according to anoverallexon
level analysis• Dilated andnon-dilated aortasshowdifferentalternativesplicing
patterns indilated andnon-dilated tissues with respect to TAVandBAVinTGFβ pathway
• Exons responsible forthediverging alternativesplicing fingerprints inTGFβ pathway identified
• Implies that dilatationinTAVhasdifferentunderlying molecularmechanisms compared to BAVpatients
• Newmethods foranalyzing array data
Todayduringtheexercise
• PCAandOPLS-DA• Thoracicaorticaneurysmdataset• ExonlevelexpressionAffymetrix arrays• Comparetwodifferentphenotypesandsubphenotypes