regulatory element discovery for developmental time series
DESCRIPTION
Regulatory element discovery for developmental time series. Computational Biology Program Sloan-Kettering Institute Memorial Sloan-Kettering Cancer Center. Joint work with Xuejing Li, Chris Wiggins, Valerie Reinke Christina Leslie. http://cbio.mskcc.org. Regulatory networks in development. - PowerPoint PPT PresentationTRANSCRIPT
Regulatory element discovery for developmental time series
Joint work with Xuejing Li, Chris Wiggins, Valerie Reinke
Christina Leslie
Computational Biology ProgramSloan-Kettering Institute
Memorial Sloan-Kettering Cancer Center
http://cbio.mskcc.org
Regulatory networks in development
• Reinke lab: genome-wide expression for C. elegans developmental time series
+ germ cell/gametogenesis mutants• Problem: decipher regulatory networks
governing germline- and sex-regulated genes
Previous work: MEDUSA in yeast
• Predict up/down expression of target genes from promoter + regulator expression
• Learns from a set of mRNA expression experiments without
clustering• Problem: high correlation
of nearby time points, many regulator profiles
Sequence to expression profile
• Can we learn mapping from promoter sequence to full expression trajectory (with some level of statistical significance)?
• Retain some properties of MEDUSA:– No clustering of expression profiles
– Learn motifs de novo from promoters by building from k-mers
…AGCTATGCCATCGACTGCTCCA…
Regression problem
• Idea: learn latent factors T = X W that “explain” Y
• Then regress X ≈ TPt, Y ≈ TQt
or Y ≈ BX where B WQt
X YG G
M E
motif vector (k-mer counts) for gene g
expression profile for gene g
columns wi
= weight vectors
columns of P, Q = loadings
First step: PLS regression
• Sequentially build latent factors ti = Xwi:– Maximize covariance between factors and Y– Constrain t1, …, tK to be uncorrelated
• SIMPLS: – for i = 1, …, K
in 1D case
subject to
wi argmaxw wtX tYY tXw
argmaxwCov(Y,Xw)2
witwi 1, ti
t t j witX tXw j 0, j 1i 1
Equivalent formulation
• Learn latent factors ti = Xwi and ui = Xci for both predictor and response variables– wi and ci chosen to maximize Cov(ti, ui)
– for i = 1, …, K
subject to
wi cimotif weight vector
expression weight vector
wi,c i argmax w,c wtX tYc
witwi c i
tc i 1,
tit t j wi
tX tXw j 0, j 1i 1
Next steps: sparsity, graph Laplacian
• For regulatization and interpretability of weight vectors, want– sparsity in w: want most components to be 0
– smoothness in w: define graph on set of k-mers, with edge k ~ l if corresponding k-mers are close in Hamming distance
w kk
b1
w k w l 2k~ l
b2
Preliminary results: worm time series
• Reinke data: ~9000 genes, 12 time points (3 replicates), wild type germline development
• Genes sets, from mutant expression data:– Sperm genes: high expression
in spermatogenesis– Oocyte genes: high expression
in oogenesis
• Motif matrix: filter k-mers based on expected counts
Standard PLS
• 10-fold c.v. on held-out genes
Regularized PLS
• 10-fold c.v. on held-out genes
Regularized PLS
• Sperm/oocyte gene sets: largest chi-square reduction for 3rd/1st latent factor
Interpretation of factor weights
• To infer motifs relevant for an expression pattern:– Latent factors ti = Xwi and ui = Yci for both
predictors and reponse variables
– wi and ci chosen to maximize Cov(ti,ui)
• ci gives weights over time points: interpret as expression pattern
• wi gives weights over motifs: highly weighted motifs relevant for this expression pattern
Sperm genes
• c3 correlated with sperm gene expression, consistent with drop in chi-square
Motif graph for sperm genes
• Top 50 k-mer graph for w3, clusters around GATAA (ELT-1) and ACGTG (bHLH)
Oocyte genes
• Oocyte genes correlate with c1 pattern
Oocyte motif map
• Top 50 k-mer graph for w1, log(p) vs weight
Some related work
• Zhang et al, 2008: PCA in Y for motif discovery
• Naughton et al, 2006: algorithmic motif search using graph representation
• Beer and Tavazoie, 2004; Segal et al, 2002: sequence to expression via clustering