large scale computational motif finding · encode project: gencode consensus human gene set gencode...

Large scale computational motif finding

Tim Hubbard

The ENCODE Project 10 years after the human genome project

20th July, CRG, Barcelona

10 years after Human Genome Project

Human genome race�

won by public project�

open access for all

Coming next: •  1000 Genomes Project

•  International Cancer Genome Consortium (ICGC)

•  UK10K (Wellcome Trust Sanger Inst.)

•  Personal Genomes

International agreement on data release

“All human genomic sequence information should be freely available and in the public domain in order to encourage research and development and to maximise its benefit to society.”

The Bermuda Statement, February 1996

Assemblies of 1-2 kb are deposited in public database (GenBank, EBI) every 24 hours

No patents are filed

Rise of “openness”�•  Open Biological data •  Open Scientific Databases

•  Literature: Open Access Publishing •  Software: Open Source •  Heathcare R&D: Public Development Partnerships •  Transparent Government: Open Data

•  Research is driven by openness, Internet •  Business models can be based on it (e.g. IBM linux)

US: www.data.gov

UK: www.data.gov.uk

Manual annotation: Havana

Experimental: Lausanne

Computational: Ensembl

UCSC Michael Brent (WashU)

Mark Gerstein (Yale) Manolis Kellis (MIT)

Roderic Guigo (CRG)

Consensus CDS project (CCDS)

NCBI/RefSeq

EBI-SIB/UniProt

ENCODE project: GENCODE consensus human gene set

GENCODE [ENCODE]

Transcription Tom Gingeras/ENCODE

Structural Biology EU Biosapiens

Nomenclature HGNC

1) De novo motif inference with NestedMICA

2) S. cerevisiae + JASPAR 2010 as a motif inference benchmark

•  Avoid motif match score cutoffs

•  Compare motifs to a gold standard motif set

De novo gene regulatory motif inference and validation

in S. cerevisiae

•  Stochastic: nested sampling (Monte Carlo).

•  Bayesian: compute likelihood of sequences given a background model, motifs and their occupancy.

•  Parallel & distributed: 32-40 CPUs.

•  Designed for large scale sequence analysis.

•  There’s a GUI for it: iMotifs (http://bit.ly/imotifs)

Inferring motifs by nested sampling with NestedMICA

http:/bit.ly/imotifs

The Tompa et al. (2005) motif discovery method assessment

•  Most comprehensive motif inference algorithm benchmark to date: 13 methods

•  Principle:

1)  Create synthetic promoter sequences with planted binding site sequences (TRANSFAC).

2)  Expert given a sequence set to run his algorithm on: predict one motif and its binding site matches.

3)  Summary statistics of the motif matches (binding site predictions) used to assess algorithms.

Problems with the Tompa et al. (2005) assessment

1)  One motif is predicted. What real eukaryotic promoter sequence set contains a single motif?

2)  Motif significance score cutoff chosen by the expert affects conclusions.

3) Can we claim we know which genomic positions transcription factors really bind? Assessing summary statistics of planted motif matches not really connected to process of TF binding.

Alternative Evaluation Strategy Compare motifs, not their genome matches

1)  Choose a random set of 1,000 known protein coding genes which have orthologs in yeasts.

2) Extract 200 bases upstream of TSS from the 1,000 genes. Total number of bases = 200,000nt.

3) Learn 200 motifs from the 200,000nt sequence set.

4) Compare the inferred 200 motifs (PWMs) to the non-redundant JASPAR 2010 fungal motif set.

S. cerevisiae + JASPAR 2010 as a motif inference benchmark

•  JASPAR 2010 fungal motif set: A non-redundant set of 177 motifs of S. cerevisiae.

•  Majority of yeast TFs covered by the set (plus some heterodimers).

•  Manually curated, open access database.

S. cerevisiae + JASPAR 2010 as a motif inference benchmark

Significant matches to inferred motifs in JASPAR 2010

Num

ber

of J

ASP

AR

mot

ifs m

atch

ed

Significant matches to inferred motifs in JASPAR 2010

Inferred (NMICA) JASPAR 2010

Reciprocal best matches between inferred motifs

and JASPAR 2010

Are high-scoring motif hits enriched in

targets of expected TFs?

targets of the expected TF?

Difference in maximum motif motif match bit scores, 500bp-0bp upstream of TSS. Source for TF-target associations: YEASTRACT (curated database, evidence from small number of

Example: NMICA motif 80 = REB1 (MA0363)

of the expected TF?

Difference in maximum motif motif match bit scores, 500bp-0bp upstream of TSS. Source for TF-target associations: YEASTRACT (curated database, evidence from small number of

152 / 200 of NMICA’s inferred motifs are more conserved across yeasts than random intergenic sequence.

PhastCons scores from a 6-way MULTIZ multiple alignment of yeasts (UCSC):

S. cerevisiae, paradoxus, kudriavzeii, bayanus, castelli, kluyveri

Fraction of inferred motifs with higher than expected conservation

PhastCons scores from a 6-way MULTIZ multiple alignment of yeasts (UCSC):

S. cerevisiae, paradoxus, kudriavzeii, bayanus, castelli, kluyveri

overlap between NMICA motifs with high multi-species

conservation & motifs with low SNP rate between

S. cerevisiae strains.

64 / 71 = 90%

SNPs with error rate < 10e-6 from the �multiple alignment of 37 S. cerevisiae strains �

Summary 1.  Introduced a motif and TF centric assessment of

motif inference algorithm performance.

2.  Avoid motif bit score cutoffs when you can!

3.  NestedMICA the best performer out of tested methods.

4.  Many close matches to known TFBS patterns found, but other likely functional motifs also present.

5.  Want to see your method compared alongside these? Contact [email protected]

Acknowledgements

• Matias Piipari

•  Thomas Down

•  Sanger Institute IT

• Wellcome Trust

large scale computational motif finding · encode project: gencode consensus human gene set gencode...

Documents