coding’and’noncoding’sequence’...
TRANSCRIPT
Coding and noncoding sequence variants across cancer types
Ekta Khurana, Ph.D. Assistant Professor
Weill Cornell Medical College New York, NY
Email: [email protected]
@ekta_khurana
1
Outline
• Coding & Noncoding sequence variants in cancer • How to idenMfy drivers: coding variants • Complex modes of acMon of noncoding variants • ComputaMonal approach to idenMfy noncoding drivers: FunSeq
• ICGC/TCGA consorMum for ‘Pan Cancer Analysis of Whole Genomes’
• Precision Medicine
3
First cancer WGS,
Ley, Mardis, et al., Nature, 2008
~500 cancer WGS Alexandrov, et al., Nature,
2013
> 2000 WGS from ICGC/TCGA
2014
Seq Universe
TCGA (The Cancer Genome Atlas) endpoint: ~2.5
Petabytes
SRA >1 petabyte
Star forma)on 100K Genomes England ADSP
16 TB
220 TB
46 TB
31 TB
19 TB
TCGA 910Terabytes
in CGHub
NHLBI ESP
32 TB
NHGRI LSSP
ARRA AuMsm
Clinical
miRNA
9 TB
[slide courtesy of Heidi Sofia, NHGRI] 4
Genomic variants idenMfied from sequencing
5
ATGAACTGCAATTTCCAGAAGCATGCACCCTTGGAAG - - - TCTA"
ATGAACTGCAAATTCCAGAAGCATG - - - - CTTGGAAGAGTTCTA"SNP
Human Ref.
DeleMon InserMon Small Indels < 50 bp
InserMons
DeleMons
DuplicaMons Inversion
Large structural variants
Human Ref.
An average human genome contains ~3 million inherited variants.
6
Cancer cells contain thousands
of mutaZons relaZve to the normal cells
Where is Waldo ? What are the key mutaMons with funcMonal impact ?
7
Most frequently mutated genes in different cancer types
8
Source: InternaMonal Cancer Genome ConsorMum (dcc.icgc.org)
ComputaMonal methods to predict funcMonal impact of missense mutaMons
Gnad et al, BMC Genomics, 2013
Outline
• Coding & Noncoding sequence variants in cancer • How to idenMfy drivers: coding variants • Complex modes of acZon of noncoding variants • ComputaMonal approach to idenMfy noncoding drivers: FunSeq
• ICGC/TCGA consorMum for ‘Pan Cancer Analysis of Whole Genomes’
• Precision Medicine
Noncoding elements in the genome
11 Ecker, Nature – News and Views, 2012 Enhancers,
Insulators, Silencers
Cross-‐species sequence alignments
Compile all connecMons to build regulatory and enhancer-‐promoter
networks
ConnecMng regulatory elements to target coding
genes
Open chromaMn regions (DHS peaks)/
Histone modificaMons or TF
binding (ChIP-‐Seq peaks)
TF binding moMf
ubiquitous Brain-‐sp.
Heart-‐sp.
Liver-‐sp.
Brain
Heart
Liver
TGAATA
TGATTA
TGATTA
TGATTG
TGATTG
TTATTG
CGCCAT
CGCCAT
CGCCTT
CGCTTT
CGCTTT
CGCTTT
TGAGGG
TGATTA
TGATGG
Human
Chimp
Mouse
Regulatory elements idenMfied
Target gene
Very low or no signal
from funcMonal genomics assays
from evoluMonary conservaMon only
TF TF TF TF TF
Identifying cis-regulatory
regions
Khurana et al, Nature Rev Genet, under review
13
MB: medulloblastoma DLBC: B cell lymphoma STAD: gastric BRCA: breast PAAD: pancreaMc PRAD: prostate LIHC: liver PA: pilocyMc Astrocytoma LUAD: Lung adenocarcinoma
Khurana et al, Nature Rev Genet, under review
Somatic variants from WGS: most are in noncoding
regions
14
CGGAGG
CGGAAG mRNA
Gain-of-motif TF
TF 0.0
1.0
2.0WT
Mutated
Loss-of-motif
Loss-of-motif
TATCTAT X
TF
TATTTAT TF
WT
Mutated
T0.0
1.0
2.0
CGT
ATG
TGCA5G
CTA
CTG
ATT10AG
TGA
G
ACT
CGAT
ACT15GTA
C
Altered binding effects 2.0
Promoter Gene
X
• MYB moZf created & drives TAL1 overexpression in T-‐ALL (Mansour et
al, Science, 2014) • TERT promoter
Modes of action of noncoding variants: transcription factor binding disruption
Killela et al, PNAS, 2013 Horn et al, Science, 2013 Huang et al, Science, 2013
TERT promoter is frequently mutated in many different cancer types"
Modes of action of noncoding variants: Enhancer hijacking by genomic rearrangements in medulloblastoma (Northcott et al, Nature, 2014)
16
Modes of action of noncoding variants: ncRNAs & their binding sites
17
Figure'3D'
miRNA&binding&sites& miRNAs&
5’& 3’&
Muta4ons&in&miRNA&
binding&sites&
X
Increased&gene&expression&
CDS&mRNA&
XX
X
XX
Transla4onal&repression&
Figure'3E'
5’&
5’&
3’&
PGENE&competes&for&miRNA&binding&
PGENE&dele4on&
CDS&mRNA&
PGENE&3’&
CDS&mRNA°rada4on&
OR'
Ribosome&
miR31 binding sites mutated, overexpression of androgen receptor
in prostate cancer"
PTEN pseudogene deletion leads to downregulation of PTEN "
Lin et al, Cancer Res, 2013 Poliseno et al, Nature, 2010 Khurana et al, Nature Rev Genet, under review
Outline
• Coding & Noncoding sequence variants in cancer
• How to identify drivers: coding variants • Complex modes of action of noncoding
variants • Computational approach to identify
noncoding drivers: FunSeq • ICGC/TCGA consortium for ‘Pan Cancer
Analysis of Whole Genomes’ • Precision Medicine
19
EvoluZonary conservaZon -‐ Typically defined by comparison across species
ConservaZon among humans -‐ DepleMon of common variants/Enrichment of rare variants
1
2
4
3
Estimating negative selection
Common variant
Rare variants
Enrichment of rare SNPs as a metric for negative selection
20
0.55
0.60
0.65
0.70
0.4
0.5
0.6
0.7
0.8
0.9
A
Frac
tion
of ra
re S
NPs
(non
syn)
All C
odin
g
LOF-
tol.
Rece
ssive
GW
AS
Dom
inan
t
Esse
ntia
l
Canc
er
B
Frac
tion
of ra
re S
NPs
DHS Coding
Bone
Urot
heliu
mEm
br s
tem
cel
lBr
east
Bloo
dPa
ncr d
uct
Panc
reas
Skin
Eye
Live
rM
uscle
Bloo
d ve
ssel
Hear
tLu
ngBl
astu
laM
yom
etriu
mEp
ithel
ium
Pros
tate
Brai
nEm
br lu
ngG
ingi
val
Kidn
eySp
inal
cor
dFo
resk
inCo
nnec
tive
Adip
ose
Live
rCe
rvix
Trac
hea
Sple
enPr
osta
teEs
opha
gus
Lung
Test
isPl
acen
taBl
adde
rCo
lon
Hear
tTh
yroi
dTh
ymus
Kidn
eyBr
ain
Ova
ry
(rare=derived allele freq < 0.5%)
• Depletion of common polymorphisms in regions under selection
Negative selection restricts the allele frequency of deleterious mutations.
• Results for coding genes consistent with known phenotypic impacts
• Other metrics for selection • EvoluMonary conservaMon
(e.g. GERP) • SNP density
(confounded by mutaMon rate)
LOF-‐tol (Loss-‐of-‐funcZon tolerant): least negaZve selecZon Cancer: most selecZon Khurana et al., Science, 2013
Differential selective constraints among sub-categories of noncoding elements
21 Khurana et al., Science, 2013
Negative selection and tissue-specificity of coding and noncoding regions
22
q Ubiquitously expressed genes and bound regions show stronger selection
q Differences in constraints amongst tissues q Constraints in coding genes and regulatory genes are
correlated across tissues
Specific categories with statistically significant enrichment of rare SNPs: randomization and multiple
hypothesis correction • Generate null for each category by sliding elements across the
genome 1000 times
• Compare the actual fraction of rare variants in the category to null distribution and call it under significant selection accepting a low type 1 error rate (α=0.2%)
• Total number of categories discovered falsely under selection (False discovery rate, FDR)
E(R0), Expectation of false positives amongst all categories tested = α*RA
RA = Number of categories tested R = Number of categories identified to be under selection
At α=0.2%, FDR = 1.3% • Using these criteria, 102 categories are under significantly strong
selection 23
Category 1 Category 2 Category 3 Category 1’ Category 2’ Category 3’
FDR = E(R0 )
R
Which noncoding categories are under very strong “coding-like” selection ?
q Top categories among 700 categories
q Binding peaks of some general TFs (eg FAM48A)
q Core moMfs of some TF families (eg JUN, GATA)
q DHS sites in spinal cord and connecMve Mssue
~0.4% genomic coverage (~ top 25)
~0.02% genomic coverage (top 5)
~400-‐fold
~40-‐fold
Enrichment of know disease-‐causing mutaMons from Human Gene MutaMon
database
Khurana et al., Science, 2013 24
Building human regulatory network from linear noncoding annotations
Peak Calling (ChIP-Seq) Assigning TF binding sites to targets Filtering high confidence edges
~28K proximal edges
PotenMal Distal Edge
Strong Proximal Edge
TF TF
Nodes 119 TFs and ~9000
target genes Edges
28,000 interacMons
Using correlaMon with expression data Gerstein¶…..Khurana¶…., Nature, 2012 (¶ co-first authors) Yip et al, Genome Res, 2012 Fu et al, Genome Biol, 2014 25
Distal edges for ~17K genes
26
IdenZficaZon of noncoding candidate drivers among somaZc variants: FunSeq
Khurana et al., Science, 2013
• Core score – Each feature gets a weight
FunSeq2: weighted scoring scheme
27 Fu et al., Genome Biology, 2014
• Feature weight -‐ Weighted with mutaMon paperns in natural polymorphisms
(features frequently observed weighed less) -‐ entropy based method
!! = 1+ !!!"#!!! + 1− !! !"#! 1− !! !!!
!!"#$%! = ! !!!!!!!"!!"#$%&$'!!"#$%&"'!
28
HOT region
SensiMve region
Polymorphisms
Genome
p = probability of the feature overlapping natural polymorphisms
Feature weight:
For a variant:
wdp
Identification of noncoding candidate drivers among somatic variants: Example"
Validation of a candidate driver identified in prostate cancer sample in WDR74 gene promoter!!q Sanger sequencing in 19 additional samples confirms the
recurrence "
""""""""""""""""""""
q WDR74 shows increased expression in tumor samples"
benign" PCs"
29
IdenZfied candidate drivers with high FunSeq scores in 607 whole-‐genomes
Type Sample Coding (all samples) Noncoding (all samples)
ALL 1 17/87 3/7653
AML 7 13/87 10/3325
Breast 119 1430/6495 3128/638835
CLL 28 41/338 124/51406
Liver 88 1464/6257 3188/843489
Lung_Adeno 24 2255/9479 5291/1428263
Lymphoma_B 24 241/1212 529/126186
Medulloblastoma 100 282/1461 226/123387
Pancreas 15 129/965 179/111044
PilocyMc_A 101 13/103 15/10453
Stomach 100 5739/20374 10829/1891465
30
• Can be further prioriMzed, e.g. gene expression, luciferase reporter assays, etc.
Fu et al, Genome Biology, 2014 Data from Alexandrov et al, Nature, 2013
Outline
• Coding & Noncoding sequence variants in cancer • How to idenMfy drivers: coding variants • Complex modes of acMon of noncoding variants • ComputaMonal approach to idenMfy noncoding drivers: FunSeq
• ICGC/TCGA consorZum for ‘Pan Cancer Analysis of Whole Genomes’
• Precision Medicine
ICGC/TCGA collaboraZve effort called ‘PCAWG’
• Pan Cancer Analysis of ~2500 whole genomes (tumor-‐normal pairs)
• ~1500 samples have RNA-‐Seq data • ~1400 samples have methylaMon data • ~25 cancer types • All the processed data and variant calls will be available to public as a resource once the project is completed
Working groups of PCAWG • SomaMc mutaMon calling, structural variant calling • MutaMons (drivers) in noncoding regions, integrate networks &
pathways • IntegraMon of transcriptome & genome • IntegraMon of epigenome & genome • MutaMonal signatures & processes • Germline cancer genome • EvoluMon & heterogeneity • Molecular subtypes & classificaMon • Mitochondrial • Pathogens • Portals & visualizaMon • TranslaMon to the clinic
Outline
• Coding & Noncoding sequence variants in cancer • How to idenMfy drivers: coding variants • Complex modes of acMon of noncoding variants • ComputaMonal approach to idenMfy noncoding drivers: FunSeq
• ICGC/TCGA consorMum for ‘Pan Cancer Analysis of Whole Genomes’
• Precision Medicine
What is precision medicine ?
• Drugs for specific mutaMons in paMent’s cancer
• Example: BRAF gene is frequently mutated in melanoma (V600E). Vemurafenib is a BRAF inhibitor and received FDA approval in 2011.
Source: "3OG7" by Boghog2 -‐ Own work. Licensed under Public Domain via Wikimedia Commons
35
Flaherty KT et al. N Engl J Med 2010;363:809-‐819.
Current challenges of precision medicine
• No drugs available for many mutated proteins • Resistance to drugs
37
38
Acknowledgements Yale
Yao Fu, Xinmeng Mu (Broad), Jieming Chen, Lucas Lochovsky, Arif Harmanci, Alexej Abyzov
(Mayo), Suganthi Balasubramanian, Cristina Sisu, Declan Clarke, Mike Wilson, Yong Kong, Mark
Gerstein
Sanger Vincenza Colonna, Yuan Chen, Yali Xue, Chris
Tyler-Smith
Cornell Steven Lipkin, Jishnu Das, Robert Fragoza,
Xiaomu Wei, Haiyuan Yu
WCMC
Andrea Sboner, Dimple Chakravarty, Naoki Kitabayashi, Vaja Liluashvili,
Zeynep H. Gümüş, Mark A. Rubin
In FunSeq2 Zhu Liu, Shaoke Lou, Jason Bedford, Kevin Yip
U of Michigan Hyun Min Kang U of Geneva
Tuuli Lappalainen (NYGC), Emmanouil T. Dermitzakis
Baylor Daniel Challis, Uday Evani, Donna
Muzny, Fuli Yu, Richard Gibbs EBI
Kathryn Beal, Laura Clarke, Fiona Cunningham, Paul Flicek, Javier Herrero, Graham R. S. Ritchie
Boston College Erik Garrison, Gabor Marth
Mass Gen Hospital Kasper Lage, Daniel G. MacArthur,
Tune H. Pers Rutgers
Jeffrey A. Rosenfeld
FuncZonal InterpretaZon
Group ~50 people
Khurana lab • Priyanka Dhingra
(Postdoctoral associate) • RecruiZng more
postdocs and students