laboratory of computational biology eead.csic.es/compbio

35
Laboratory of Computational Biology Laboratory of Computational Biology http:// www.eead.csic.es / compbio Estación Experimental de Aula Dei Estación Experimental de Aula Dei CSIC, Zaragoza, España CSIC, Zaragoza, España Álvaro Sebastian Yagüe Álvaro Sebastian Yagüe [email protected] [email protected] The relation between amino-acid The relation between amino-acid substitutions in the interface of substitutions in the interface of transcription factors and their transcription factors and their recognized DNA motifs recognized DNA motifs February 2, 2010 - V National Conference BIFI February 2, 2010 - V National Conference BIFI 2011 2011

Upload: olinda

Post on 08-Jan-2016

21 views

Category:

Documents


2 download

DESCRIPTION

The relation between amino-acid substitutions in the interface of transcription factors and their recognized DNA motifs. Álvaro Sebastian Yagüe [email protected]. Laboratory of Computational Biology http://www.eead.csic.es/compbio Estación Experimental de Aula Dei - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Laboratory of Computational Biology eead.csic.es/compbio

Laboratory of Computational BiologyLaboratory of Computational Biologyhttp://www.eead.csic.es/compbio

Estación Experimental de Aula DeiEstación Experimental de Aula DeiCSIC, Zaragoza, EspañaCSIC, Zaragoza, España

Álvaro Sebastian YagüeÁlvaro Sebastian Yagüe

[email protected]@eead.csic.es

The relation between amino-acid substitutions in the The relation between amino-acid substitutions in the

interface of transcription factors and their interface of transcription factors and their

recognized DNA motifsrecognized DNA motifs

February 2, 2010 - V National Conference BIFI 2011February 2, 2010 - V National Conference BIFI 2011

Page 2: Laboratory of Computational Biology eead.csic.es/compbio

Content index

• DNA recognition and binding

• 3D footprinting

• footprintDB database

• alignment of DNA motifs

• alignment of protein interfaces

Page 3: Laboratory of Computational Biology eead.csic.es/compbio

DNA recognition and bindingDNA recognition and binding

Page 4: Laboratory of Computational Biology eead.csic.es/compbio

DNA-binding proteins

DNA-binding proteins are proteins that are composed of DNA-binding

domains and thus have a specific or general affinity for either single or

double stranded DNA.

Jones CE, Olson OM: Sequence-specific DNA-protein interaction: the lac repressor. J Theor Biol 64:323-332, 1977.

lac repressor

Tyr 7

Tyr 12

Tyr 17

Page 5: Laboratory of Computational Biology eead.csic.es/compbio

DNA-binding proteins

Lewis M, Chang G, Horton NC, Kercher MA, Pace HC, Schumacher MA, Brennan RG, Lu P: Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271:1247-1254, 1996.

lac repressor

Tyr 7

Tyr 12

Tyr 17

DNA-binding proteins are proteins that are composed of DNA-binding

domains and thus have a specific or general affinity for either single or

double stranded DNA.

Page 6: Laboratory of Computational Biology eead.csic.es/compbio

DNA-binding proteins

lac repressor

Tyr 7Tyr 12

Tyr 17

DNA-binding proteins are proteins that are composed of DNA-binding

domains and thus have a specific or general affinity for either single or

double stranded DNA.

Lewis M, Chang G, Horton NC, Kercher MA, Pace HC, Schumacher MA, Brennan RG, Lu P: Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271:1247-1254, 1996.

Page 7: Laboratory of Computational Biology eead.csic.es/compbio

DNA-binding proteins

Tyr 7

Tyr 12

Tyr 17

DNA-binding proteins are proteins that are composed of DNA-binding

domains and thus have a specific or general affinity for either single or

double stranded DNA.

Page 8: Laboratory of Computational Biology eead.csic.es/compbio
Page 9: Laboratory of Computational Biology eead.csic.es/compbio

3D footprinting3D footprinting

Page 10: Laboratory of Computational Biology eead.csic.es/compbio

Methods for studying protein-DNA interactions

Method Advantages Limitations

Nitrocellulose filter binding assay Relatively simple handling No localisation of binding site

Footprinting assays Technical simplicityIncomplete binding frequently results in unclear footprint

Methylation interferenceCombined analysis of binding site and effect of epigenetic variations

Very complex workflow

Electrophoretic mobility shift assay (EMSA)

Technically simple assay that permits semi-quantitative studies

In complex analyses, no immediate information on binding sites or proteins involved

Chromatin immunoprecipitation (ChIP) Applicable also for in vivo analyses Relies very strongly on antibody specificity

DNA adenine methyltransferase identification (DamID)

In vivo detection Requirement of exogenous fusion proteins

Surface plasmon resonance (SPR)Real-time recording of association and dissociation

No high throughput

Systematic evolution of ligands by exponential enrichment (SELEX)

Enables in vitro selection of optimal binding partners

Only selection of best binding events

Yeast one-hybrid system In vivo assay Very complex system

DNA microarrays High throughput Analysis process for individual proteins

Protein microarrays High throughput Monomer-specificity

Proximity ligationHighly specific and sensitive down to single-molecule detection

Complex sample preparation

Atomic force microscopy, X-ray crystallography, nuclear magnetic resonance

High-resolution structural informationNo use for definition of interaction pairs or identification of genomic locations

Helwa R, Hoheisel JD: Analysis of DNA-protein interactions: from nitrocellulose filter binding assays to microarray studies. Anal Bioanal Chem 398:2551-2561.

Page 11: Laboratory of Computational Biology eead.csic.es/compbio

3D Footprinting

3D footprinting is a computational technique developed in our lab that annotates DNA-

binding interfaces by analizing 3D published structures from PDB.

Interface residues for 1d5y_A TF: 32,34,35,37,38

http://floresta.eead.csic.es/3dfootprint/

3D-footprint calcultated interface:

1D5Y

Page 12: Laboratory of Computational Biology eead.csic.es/compbio

footprintDBfootprintDB

Page 13: Laboratory of Computational Biology eead.csic.es/compbio

footprintDB

We have designed, implemented and curated a database with more than 3000 unique DNA-

binding proteins (mostly transcription factors, TFs) and 4000 Position Weight Matrices

(PWMs) extracted from the literature and other repositories.

TF sequences in footprintDB have annotated their DNA-binding interface residues by

aligning their sequences with 3D-footprint templates.

Page 14: Laboratory of Computational Biology eead.csic.es/compbio

footprintDB

Database Description TFs PWMs

TRANSFACData on transcription factors, their experimentelly-proven binding sites, their positional weight matrices and regulated genes.

367 608

JASPAR CORECurated, non-redundant set of profiles, derived from published collections of experimentally defined transcription factor binding sites for eukaryotes.

443 465

RegulonDB Curated data of the transcriptional regulatory network of Escherichia coli K12. 70 70

3D-footprintDatabase of DNA-binding protein structures that is updated weekly with Protein Data Bank complexes.

1006 1225

AthaMapGenome-wide map of potential transcription factor and small RNA binding sites in Arabidopsis thaliana

42 48

Drosophila CTFMMotif models reported in 51 primary references in the form of position PWMs for 56 Drosophila melanogaster transcription factors.

59 62

ZIFDBRepository of information on C2H2 zinc fingers and engineered zinc- finger arrays.

858 873

ZifBASE An extensive collection of various natural and engineered zinc finger proteins. 139 144

AGRISResource of Arabidopsis promoter sequences, transcription factors and their target genes.

53 53

UniPROBERepository of experimental data from universal protein binding microarray (PBM) experiments.

296 437

PLACEDatabase of motifs found in plant cis-acting regulatory DNA elements, all from previously published reports.

28 480

Page 15: Laboratory of Computational Biology eead.csic.es/compbio

footprintDB

footprintDB predicts:

1. Transcription factors which bind a specific DNA site or motif

2. DNA motifs likely to recognised by a specific DNA-binding protein

Page 17: Laboratory of Computational Biology eead.csic.es/compbio
Page 18: Laboratory of Computational Biology eead.csic.es/compbio

alignment of protein interfacesalignment of protein interfaces

Page 19: Laboratory of Computational Biology eead.csic.es/compbio

The rationale behind footprintDB is the observation that proteins which recognize a

similar DNA motif most often have a similar set of residues at the interface.

DNA motif ~ TF interface

yCAATTAws ~ RKRTQNTK

-yaATTAam ~ RRRIQNTK

-yAATTArg ~ RRRIQNAK

-TAATTArc ~ RRRIQNAK

-tmATTAAs ~ KRRIQNMK

Alignment of protein interfaces

Page 20: Laboratory of Computational Biology eead.csic.es/compbio

Alignment of protein interfaces

Noyes et al. have recently shown that homeodomain binding specificities depend on

the interface residues involved in DNA motif recognition.

Noyes, M.B., Christensen, R.G., Wakabayashi, A., Stormo, G.D., Brodsky, M.H., Wolfe, S.A.: Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133 (2008) 1277-1289

Page 21: Laboratory of Computational Biology eead.csic.es/compbio

Interface alignment with footprintDB annotated interfaces

yCAATTAws ~ RKRTQNTK-yaATTAam ~ RRRIQNTK-TAATTArc ~ RRRIQNAK-tmATTAAs ~ KRRIQNMK

Alignment of protein interfaces

Unknown homeodomain protein

Homeodomain interface residues

RRRIQNAK

Predicted DNA binding motif

TAATTArc

Page 22: Laboratory of Computational Biology eead.csic.es/compbio

ROC curve shows that interface alignments improve DNA motif predictions in comparisson with Blast scores.

Alignment of protein interfaces

Scoring of aligned protein interfaces will be more accurate in predicting which DNA

motif bind a unknown DNA binding protein that other scoring methods like local

alignment.

Homeodomains: bZIPs:

Page 23: Laboratory of Computational Biology eead.csic.es/compbio

alignment of DNA motifsalignment of DNA motifs

Page 24: Laboratory of Computational Biology eead.csic.es/compbio

DNA motif alignment issues

• Three alignment combinations: ATC / GTT ; ATC / AAC ; GAT / GTT

longer calculation time and higher false positive rate than a pairwise alignment

• Different motif sizes: TgAGt / ackrTGACGTCAycra

it’s not a big issue if we divide the score by the number of aligned nucleotides

• Small motifs are prone to false high-scoring alignments, due to the small

nucleotide alphabet size: AGt / CGT

high similarity thresholds are required, particularly with individual Zinc Fingers

that usually recognize 3 nts

Page 25: Laboratory of Computational Biology eead.csic.es/compbio

DNA motif alignment issues

• Complex motifs (multimeric proteins): ackrTGACGTCAycra /

rTGACwmAGCA

they are not easy to align and heteromultimers might bind different sites

• A single motif for TFs with multiple DNA-binding domains

it might not be possible to know which domain binds to each submotif

• TFs with different annotated motifs

as a result of different oligomeric conformations or experimental approaches

• Motifs with very low information content: akaTTrchhaAhcw

might be genuine or result from low resolution experiments; source of FP hits

Page 26: Laboratory of Computational Biology eead.csic.es/compbio

Alignment of DNA motifs

Family Motifs Multimeric Multidomain

Homeodomain TAATkr, TGAyA Sometimes Unusual

Basic helix-loop-helix (bHLH) CACGTG, CAsshG Always (homodimers, heterodimers) Never

Basic leucine zipper (bZIP) CACGTG, -ACGT-, TGAGTC Always (homodimers, heterodimers) Never

MYB GkTwGkTr Usual (multimers) Usual

High mobility group (HMG) mTT(T)GwT, TTATC, ATTCA Sometimes Unusual

GAGA GAGA Never Never

Fork head TrTTTr Unusual Never

Fungal Zn(2)-Cys(6) binuclear cluster

CGG Usual (homodimers) Never

Ets GGAw Usual (homodimers, heterodimers, multimers) Never

Rel homology domain (RHD) GGnnwTyCC' Always (homodimers, heterodimers) Never

Interferon regulatory factor AAnnGAAA Always (homodimers, heterodimers, multimers) Never

Some families of transcription factors and their singularities:

Page 27: Laboratory of Computational Biology eead.csic.es/compbio

Motifs are aligned with Smith-Waterman ungapped algorithm and motif

similarity is calculated using the sum of the Pearson Correlation

Coefficients of the motif positions.

G C C

Alignment of DNA motifs

G A C

Similarity: 1 + 0 + 1 = 2 / 3 = 0.67

Page 28: Laboratory of Computational Biology eead.csic.es/compbio

Motifs are aligned with Smith-Waterman ungapped algorithm and motif

similarity is calculated using the sum of the Pearson Correlation

Coefficients of the motif positions.

Alignment of DNA motifs

A C G T01 0 0 6 0 G02 1 4 0 1 C03 0 4 0 2 C

Pearson Correlation Coefficient:

A C G T01 0 0 3 1 G02 3 1 0 0 A03 0 4 0 0 C

Simil = 1+2+3 = 0.94 + 0.14 + 0.87 = 1.95

94.0)11()13()10()10()5.10()5.16()5.10()5.10(

)11)(5.10()13)(5.16()10)(5.10()10)(5.10(22222222

Position 1:

GCC GAC

Page 29: Laboratory of Computational Biology eead.csic.es/compbio

4900 TRANSFAC individual DNA sites were aligned with their

corresponding DNA motifs (PWMs), yielding a mean similarity of 0.70

P0 A C G T01 2 0 4 0 G02 1 0 4 1 G03 0 6 0 0 C04 2 0 0 4 T05 0 0 0 6 T06 0 6 0 0 C07 0 6 0 0 C08 3 0 0 3 W09 1 4 1 0 C

AGCTTCCTCGGCATCCAGGTCTTCCTAAGCTTCCACGGCATCCACGACTTCCTC

DNA motifs have a large variability

Half of DNA sites share <0.70 similarity with its motif

Alignment of DNA motifs

*
falta explicar en una diapo como se calcula similaridad
Page 30: Laboratory of Computational Biology eead.csic.es/compbio

4900 TRANSFAC individual DNA sites were aligned against random

footprintDB database motifs, yielding a mean similarity of 0.47.

P0 A C G T01 02 03 04 05 06 07 08 09

AGCTTCCTC

Individual DNA sites and motifs can yield

moderate similarities by chance

?

Alignment of DNA motifs

Page 31: Laboratory of Computational Biology eead.csic.es/compbio

Which motif similarity threshold should

we use to identify DNA sites and motifs?

AGCTTCCTC

P0 A C G T01 2 0 4 0 G02 1 0 4 1 G03 0 6 0 0 C04 2 0 0 4 T05 0 0 0 6 T06 0 6 0 0 C07 0 6 0 0 C08 3 0 0 3 W09 1 4 1 0 C

0.47 < ? < 0.70

Alignment of DNA motifs

Page 32: Laboratory of Computational Biology eead.csic.es/compbio

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

10.10.30.4

0.5

0.6

0.7

0.8

0.9

1

F P R

TP

R

Drawing a ROC curve interpolating TPR and FPR from TRANSFAC

alignments, we obtain that values of motif similarity ratio beween 0.60 and

0.55 cover a sensitivity (TPR) range of 0.71-0.80 and a specificity (1-FPR)

range of 0.88-0.74.

similarity0.55 – 0.60

Alignment of DNA motifs

*
explicar identity ratio en la misma diapo que similarity
Page 33: Laboratory of Computational Biology eead.csic.es/compbio

Thanks for your attentionThanks for your attention

Page 34: Laboratory of Computational Biology eead.csic.es/compbio

Laboratory of Computational Biology

Estación Experimental de Aula Dei / CSIC

Av. Montañana 1.005

50059 Zaragoza (Spain)

Tel.: +34 976716089

Web: http://www.eead.csic.es/compbio/

Page 35: Laboratory of Computational Biology eead.csic.es/compbio

Questions?Questions?