large scale computational motif finding · encode project: gencode consensus human gene set gencode...

30
Large scale computational motif finding Tim Hubbard The ENCODE Project 10 years after the human genome project 20th July, CRG, Barcelona

Upload: others

Post on 11-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Large scale computational motif finding

Tim Hubbard

The ENCODE Project 10 years after the human genome project

20th July, CRG, Barcelona

Page 2: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

10 years after Human Genome Project

Page 3: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Human genome race�

won by public project�

open access for all

Page 4: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Coming next: •  1000 Genomes Project

•  International Cancer Genome Consortium (ICGC)

•  UK10K (Wellcome Trust Sanger Inst.)

•  Personal Genomes

Page 5: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

International agreement on data release

“All human genomic sequence information should be freely available and in the public domain in order to encourage research and development and to maximise its benefit to society.”

The Bermuda Statement, February 1996

Assemblies of 1-2 kb are deposited in public database (GenBank, EBI) every 24 hours

No patents are filed

Page 6: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Rise of “openness”�•  Open Biological data •  Open Scientific Databases

•  Literature: Open Access Publishing •  Software: Open Source •  Heathcare R&D: Public Development Partnerships •  Transparent Government: Open Data

•  Research is driven by openness, Internet •  Business models can be based on it (e.g. IBM linux)

Page 7: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

US: www.data.gov

Page 8: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

UK: www.data.gov.uk

Page 9: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature
Page 10: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Manual annotation: Havana

Experimental: Lausanne

Computational: Ensembl

UCSC Michael Brent (WashU)

Mark Gerstein (Yale) Manolis Kellis (MIT)

Roderic Guigo (CRG)

Consensus CDS project (CCDS)

NCBI/RefSeq

EBI-SIB/UniProt

ENCODE project: GENCODE consensus human gene set

GENCODE [ENCODE]

Transcription Tom Gingeras/ENCODE

Structural Biology EU Biosapiens

Nomenclature HGNC

Page 11: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

1) De novo motif inference with NestedMICA

2) S. cerevisiae + JASPAR 2010 as a motif inference benchmark

•  Avoid motif match score cutoffs

•  Compare motifs to a gold standard motif set

De novo gene regulatory motif inference and validation

in S. cerevisiae

Page 12: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

•  Stochastic: nested sampling (Monte Carlo).

•  Bayesian: compute likelihood of sequences given a background model, motifs and their occupancy.

•  Parallel & distributed: 32-40 CPUs.

•  Designed for large scale sequence analysis.

•  There’s a GUI for it: iMotifs (http://bit.ly/imotifs)

Inferring motifs by nested sampling with NestedMICA

Page 13: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

http:/bit.ly/imotifs

Page 14: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

The Tompa et al. (2005) motif discovery method assessment

•  Most comprehensive motif inference algorithm benchmark to date: 13 methods

•  Principle:

1)  Create synthetic promoter sequences with planted binding site sequences (TRANSFAC).

2)  Expert given a sequence set to run his algorithm on: predict one motif and its binding site matches.

3)  Summary statistics of the motif matches (binding site predictions) used to assess algorithms.

Page 15: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Problems with the Tompa et al. (2005) assessment

1)  One motif is predicted. What real eukaryotic promoter sequence set contains a single motif?

2)  Motif significance score cutoff chosen by the expert affects conclusions.

3) Can we claim we know which genomic positions transcription factors really bind? Assessing summary statistics of planted motif matches not really connected to process of TF binding.

Page 16: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Alternative Evaluation Strategy Compare motifs, not their genome matches

Page 17: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

1)  Choose a random set of 1,000 known protein coding genes which have orthologs in yeasts.

2) Extract 200 bases upstream of TSS from the 1,000 genes. Total number of bases = 200,000nt.

3) Learn 200 motifs from the 200,000nt sequence set.

4) Compare the inferred 200 motifs (PWMs) to the non-redundant JASPAR 2010 fungal motif set.

S. cerevisiae + JASPAR 2010 as a motif inference benchmark

Page 18: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

•  JASPAR 2010 fungal motif set: A non-redundant set of 177 motifs of S. cerevisiae.

•  Majority of yeast TFs covered by the set (plus some heterodimers).

•  Manually curated, open access database.

S. cerevisiae + JASPAR 2010 as a motif inference benchmark

Page 19: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Significant matches to inferred motifs in JASPAR 2010

Num

ber

of J

ASP

AR

mot

ifs m

atch

ed

Page 20: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Significant matches to inferred motifs in JASPAR 2010

Inferred (NMICA) JASPAR 2010

Page 21: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Reciprocal best matches between inferred motifs

and JASPAR 2010

Page 22: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature
Page 23: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Are high-scoring motif hits enriched in

targets of expected TFs?

Page 24: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

targets of the expected TF?

Difference in maximum motif motif match bit scores, 500bp-0bp upstream of TSS. Source for TF-target associations: YEASTRACT (curated database, evidence from small number of

Page 25: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Example: NMICA motif 80 = REB1 (MA0363)

of the expected TF?

Difference in maximum motif motif match bit scores, 500bp-0bp upstream of TSS. Source for TF-target associations: YEASTRACT (curated database, evidence from small number of

Page 26: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

152 / 200 of NMICA’s inferred motifs are more conserved across yeasts than random intergenic sequence.

PhastCons scores from a 6-way MULTIZ multiple alignment of yeasts (UCSC):

S. cerevisiae, paradoxus, kudriavzeii, bayanus, castelli, kluyveri

Page 27: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Fraction of inferred motifs with higher than expected conservation

PhastCons scores from a 6-way MULTIZ multiple alignment of yeasts (UCSC):

S. cerevisiae, paradoxus, kudriavzeii, bayanus, castelli, kluyveri

Page 28: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

overlap between NMICA motifs with high multi-species

conservation & motifs with low SNP rate between

S. cerevisiae strains.

64 / 71 = 90%

SNPs with error rate < 10e-6 from the �multiple alignment of 37 S. cerevisiae strains �

Page 29: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Summary 1.  Introduced a motif and TF centric assessment of

motif inference algorithm performance.

2.  Avoid motif bit score cutoffs when you can!

3.  NestedMICA the best performer out of tested methods.

4.  Many close matches to known TFBS patterns found, but other likely functional motifs also present.

5.  Want to see your method compared alongside these? Contact [email protected]

Page 30: Large scale computational motif finding · ENCODE project: GENCODE consensus human gene set GENCODE [ENCODE] Transcription Tom Gingeras/ENCODE Structural Biology EU Biosapiens Nomenclature

Acknowledgements

• Matias Piipari

•  Thomas Down

•  Sanger Institute IT

• Wellcome Trust