fehrman nat gen 2014 - journal club
TRANSCRIPT
Fehrman et al, Nat Gen 2014. Gene Expression analysis
identifies global gene dosage sensitivity in cancer
Giovanni JC 30 March 2015
What is a PCA?
• PCA is a technique to reduce a dataset with 3+ variables to two or few dimensions
• Examples: – a dataset of
individuals age, height, weight, etc..
– a dataset of gene expression
height
age
weight
heig
ht
age
weight
What is a PCA projection?
• In a PCA we rotate a 3+ dimensional plane, trying to find the best “projection” for observing separation between data points
• Implementation:– Find a line (PC1) that
separates the dataset in two groups, explaining most of the variance
– Find a second line (PC2) orthogonal to the first, to explain most of the remaining variance
PC1
PC2
Variance explained by each PC
PC coefficients• The PCA will produce a new set of data
“axes”, called Principal Components (PC)• Each PC is a combination of the original
variables, multiplied by a coefficient
Expression gene 1
Expression gene 2
Expression gene 3
Expression gene 4
Expression gene 5
PC1 PC2 PC3 PC4
* 5.4 * 3.2 *-0.4 * 0.0 *-0.2
Eigenvectorcoefficient
Interpreting each PC
● Depending on which variables contribute to a PC, we can give a biological interpretation– If weight and height contribute to PC1 while
age does not, then PC1 describes the “size” of the individual
● In gene expression, PCs can represent a set of genes expressed in the same transcription profile– Thus we rename PCs as Transcriptional Components
(TCs)
Gene Expression Dataset
• Expression data from Gene Expression Omnibus (Affymetrix, 4 datasets)
• Quality Control: – a PCA is applied to each dataset, obtaining a PC explaining
80-90% of data variance– This PC can be interpreted as probe- or platform- specific
variance, independently on the sample– All the samples that have a correlation <0.75 with this PC are
removed, as they are considered low quality
• Final dataset:– Human small: 17,309 samples– Human large: 32,427–Mouse: 17,081 – Rat: 6,023
Copy Number data and samples annotation
• 470 tumor samples with array CGH data (Agilent), analyzed with DNACopy
– 51 ERBB2-amplified breast cancer, 173 inflammatory breast cancer, 246 multiple myeloma
• Sample annotation: text-mining to determine cancer/cell line/normal samples
Number of probes and genes
Datasets for Gene Set Enrichment Analysis
PCA implementation
• Each of the 4 datasets was analyzed separately
• PCA done on the n-by-n correlation matrix, instead of co-variance matrix
– Reduces noise produced by samples with high variance
• Goal of the PCA is to identify Transcription Components, e.g. set of genes expressed in the same conditions
Parameters of the PCA• TC size: order of the component– How much of the gene expression variance is represented by the TC
• TC setting: score of the component in a given sample– How much the expression profile represented by a TC is active in the sample
• TC wiring: PC coefficient– For every gene and for every expression profile (TC), how much the expression is
supposed to change
How many Transcriptional Components there are?
● About 300 in Human
small, 600 in
HumanLarge
, …
● 2,206 TCs across all datasets
● The robustly estimated TCs (Cronbach's alpha > 0.7) captured 79-90% of the variance
Do the TCs have biological meaning?
● All the TCs had at least one gene set enriched (GSEA), meaning that they represent biological phenomena
Are the TCs different across the four datasets?
● Humansmall is very similar to Humanlarge,
● Mouse is similar to Rat
● Overall the most robust components are similar in all four datasets
TC3 represents genes
expressed in the brain
A TC-based gene network
● Constructed a gene regulation network with 19,997 genes– Two genes are connected if they are in the
same TC (co-expressed)
● This network can be used to predict gene function using “guilt-by-association” – A gene involved in a TC where 100 other
genes are associated to apoptosis is probably also associated to it
Guilt by association
● Used the 2,206 TCs from the 4 datasets
● Calculated a GSEA Z-score in each TC for each gene set
● A gene with unknown function is associated to a gene set if its GSEA scores are correlated with its eigenvector coefficients
Genes with similar function to BRCA1 and BRCA2
FEN1 is co-expressed with BRCA1 and BRCA2
The role of FEN1 in homologous recombination was not confirmed in mammals
Involvement of FEN1 in Homologous Recombination
1b: siRNA silencing of FEN1
2C, top: if homologous recombination occurs, GFP is expressed
FEM1 inhibition reduces homologous recombination
2d: chemical inhibition of FEM1 with MTT
2e: decrease of HR after inhibition of FEN1
Inhibition of FEM1 and PARP1 increases DNA breaks
2f: PARP1 inhibition
2g: higher number of DNA breaks if when both PARP1 and FEN1 are inhibited
2h: higher sensitivity to PARP1 inhibition
Identification of unstable samples
● A subset of human samples showed enrichment for genes mapping to the same chromosome band
● This is the effect of large SCNAs in cancer tissue or cancer cell lines
Autocorrelation between TC and chromosome position
Autocorrelation: eigenvector coefficients of a gene is correlated with its neighbors
e.g. expression of gene is correlated with neighbors
Identification of SCNAs from expression profiles
● Used 18,713 samples with no SCNAs to determine 718 non-genetic TCs, which are then applied to the other 18,714 samples
● SCNAs levels where correlated with residual expression (not explained by TCs), explaining 28% of variation
● This 28% variation is called Functional Genomic mRNA profile (FGM) and represent variation in gene expression that diverge from the physiological status status
Identification of potential SCNA events from expression profiles
Functional Genomic mRNA profile● FGM: Functional Genomic mRNA profile
– The portion of expression that can not be explained by the 718 physiological non-genetic PCs
● 20 trisomy samples clearly showed higher FGM expression
● In 470 cancer samples, FGM levels correlated with SCNA levels (aCGH), explaining 86% of variance
Most genes are dosage-sensitiveto chromosome arm duplications
● They did another PCA on the FGM profile data, for every chromosome arm
– Describing if there are changes in the expression of all the genes in a chromosome arm, not due to physiological constraints (718 TCs)
● The PC1, representing the most prominent FGM pattern, described a complete duplication or deletion of the arm
● 91% of the probes were dosage-sensitive to the complete duplication/deletion of a chromosome arm
More on dosage sensitivity
● Fig 4b: highly expressed genes are more dosage-sensitive
● Fig 4c: similar patterns are observed with an eQTL meta-analysis
FGM profiling of 16,172 tumor samples
● Data preparation:– Excluded cell lines (text mining + similar TC
profile)
– Excluded genetically identical samples and related individuals (based on similarity of eQTL expression) (234 mix-ups)
– Only samples with high genomic instability (high auto-correlation) (potentially cancer samples)
Hierarchical clustering of FGM
Most cancer types show samples with similarly altered expression
Some cancers have similar alteration patterns
Amplifications and deletions in the regions involved in
the FGM profiles
● Used DNACopy to determine whether the regions in FGM profiles in cancer are amplified or deleted, based on change of expression patterns (no aCGH data)
Distribution of genomic instability
● Genomic instability: autocorrelation between expression of a gene and its neighbors' – e.g. tendency of a sample to have a high number of regions
with altered expression, likely to be amplified/deleted
Higher genomic instability corresponds to lower survival and higher grade
Distribution of genomic instability across genome and genes
Samples where
CDKN2A and ERBB2 have
altered expression
Summary
● Used PCA to obtain 2,206 expression components
● Of these, 718 represent physiological non-genetic expression profiles
● The expression not explained by these 718 TCs (FGM profile) can be explained by SCNA alterations
● Most genes are dosage-sensitive, at least for arm-level alterations