large scale computational motif finding · encode project: gencode consensus human gene set gencode...
TRANSCRIPT
Large scale computational motif finding
Tim Hubbard
The ENCODE Project 10 years after the human genome project
20th July, CRG, Barcelona
10 years after Human Genome Project
Human genome race�
won by public project�
open access for all
Coming next: • 1000 Genomes Project
• International Cancer Genome Consortium (ICGC)
• UK10K (Wellcome Trust Sanger Inst.)
• Personal Genomes
International agreement on data release
“All human genomic sequence information should be freely available and in the public domain in order to encourage research and development and to maximise its benefit to society.”
The Bermuda Statement, February 1996
Assemblies of 1-2 kb are deposited in public database (GenBank, EBI) every 24 hours
No patents are filed
Rise of “openness”�• Open Biological data • Open Scientific Databases
• Literature: Open Access Publishing • Software: Open Source • Heathcare R&D: Public Development Partnerships • Transparent Government: Open Data
• Research is driven by openness, Internet • Business models can be based on it (e.g. IBM linux)
US: www.data.gov
UK: www.data.gov.uk
Manual annotation: Havana
Experimental: Lausanne
Computational: Ensembl
UCSC Michael Brent (WashU)
Mark Gerstein (Yale) Manolis Kellis (MIT)
Roderic Guigo (CRG)
Consensus CDS project (CCDS)
NCBI/RefSeq
EBI-SIB/UniProt
ENCODE project: GENCODE consensus human gene set
GENCODE [ENCODE]
Transcription Tom Gingeras/ENCODE
Structural Biology EU Biosapiens
Nomenclature HGNC
1) De novo motif inference with NestedMICA
2) S. cerevisiae + JASPAR 2010 as a motif inference benchmark
• Avoid motif match score cutoffs
• Compare motifs to a gold standard motif set
De novo gene regulatory motif inference and validation
in S. cerevisiae
• Stochastic: nested sampling (Monte Carlo).
• Bayesian: compute likelihood of sequences given a background model, motifs and their occupancy.
• Parallel & distributed: 32-40 CPUs.
• Designed for large scale sequence analysis.
• There’s a GUI for it: iMotifs (http://bit.ly/imotifs)
Inferring motifs by nested sampling with NestedMICA
http:/bit.ly/imotifs
The Tompa et al. (2005) motif discovery method assessment
• Most comprehensive motif inference algorithm benchmark to date: 13 methods
• Principle:
1) Create synthetic promoter sequences with planted binding site sequences (TRANSFAC).
2) Expert given a sequence set to run his algorithm on: predict one motif and its binding site matches.
3) Summary statistics of the motif matches (binding site predictions) used to assess algorithms.
Problems with the Tompa et al. (2005) assessment
1) One motif is predicted. What real eukaryotic promoter sequence set contains a single motif?
2) Motif significance score cutoff chosen by the expert affects conclusions.
3) Can we claim we know which genomic positions transcription factors really bind? Assessing summary statistics of planted motif matches not really connected to process of TF binding.
Alternative Evaluation Strategy Compare motifs, not their genome matches
1) Choose a random set of 1,000 known protein coding genes which have orthologs in yeasts.
2) Extract 200 bases upstream of TSS from the 1,000 genes. Total number of bases = 200,000nt.
3) Learn 200 motifs from the 200,000nt sequence set.
4) Compare the inferred 200 motifs (PWMs) to the non-redundant JASPAR 2010 fungal motif set.
S. cerevisiae + JASPAR 2010 as a motif inference benchmark
• JASPAR 2010 fungal motif set: A non-redundant set of 177 motifs of S. cerevisiae.
• Majority of yeast TFs covered by the set (plus some heterodimers).
• Manually curated, open access database.
S. cerevisiae + JASPAR 2010 as a motif inference benchmark
Significant matches to inferred motifs in JASPAR 2010
Num
ber
of J
ASP
AR
mot
ifs m
atch
ed
Significant matches to inferred motifs in JASPAR 2010
Inferred (NMICA) JASPAR 2010
Reciprocal best matches between inferred motifs
and JASPAR 2010
Are high-scoring motif hits enriched in
targets of expected TFs?
targets of the expected TF?
Difference in maximum motif motif match bit scores, 500bp-0bp upstream of TSS. Source for TF-target associations: YEASTRACT (curated database, evidence from small number of
Example: NMICA motif 80 = REB1 (MA0363)
of the expected TF?
Difference in maximum motif motif match bit scores, 500bp-0bp upstream of TSS. Source for TF-target associations: YEASTRACT (curated database, evidence from small number of
152 / 200 of NMICA’s inferred motifs are more conserved across yeasts than random intergenic sequence.
PhastCons scores from a 6-way MULTIZ multiple alignment of yeasts (UCSC):
S. cerevisiae, paradoxus, kudriavzeii, bayanus, castelli, kluyveri
Fraction of inferred motifs with higher than expected conservation
PhastCons scores from a 6-way MULTIZ multiple alignment of yeasts (UCSC):
S. cerevisiae, paradoxus, kudriavzeii, bayanus, castelli, kluyveri
overlap between NMICA motifs with high multi-species
conservation & motifs with low SNP rate between
S. cerevisiae strains.
64 / 71 = 90%
SNPs with error rate < 10e-6 from the �multiple alignment of 37 S. cerevisiae strains �
Summary 1. Introduced a motif and TF centric assessment of
motif inference algorithm performance.
2. Avoid motif bit score cutoffs when you can!
3. NestedMICA the best performer out of tested methods.
4. Many close matches to known TFBS patterns found, but other likely functional motifs also present.
5. Want to see your method compared alongside these? Contact [email protected]
Acknowledgements
• Matias Piipari
• Thomas Down
• Sanger Institute IT
• Wellcome Trust