modeling splice site and transcription factor binding site variation by information theory

63
Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory Peter K. Rogan, Ph.D. St. Jude’s Children’s Research Hospital Memphis, TN May 15, 2003

Upload: ferrol

Post on 08-Jan-2016

27 views

Category:

Documents


2 download

DESCRIPTION

Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory. Peter K. Rogan, Ph.D. St. Jude’s Children’s Research Hospital Memphis, TN May 15, 2003. Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Peter K. Rogan, Ph.D.

St. Jude’s Children’s Research Hospital

Memphis, TN

May 15, 2003

Page 2: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

• Information theory provides general solutions to the problem of how to recognize members of a group of related nucleic acid (or protein) sequences.

Background

Page 3: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory
Page 4: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

• The average information of a related set of sequences, Rsequence, represents the total sequence conservation:

Rsequence = 2 - [ -f(b,l) log2 f(b,l) + e(n(l)) ]

f(b,l) is the frequency of each base b at position l,

e(n(l)) is a correction for the small sample size n at position l

Schneider et al. J. Mol. Biol. 1984

Page 5: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Sequence Logo

Conservation and diversity among related binding sites can be visualized using a sequence logo.

The area under the logo isRsequence, the average Information of the binding site.

Page 6: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Definition of Individual InformationDefinition of Individual Information

• The individual information, Ri, of a single member of a sequence family is the dot product of that sequence vector and a weight matrix, Ri(b,l), based on the of the base frequencies at each position of the sequence.

t

Ri(j) = s(b,l,j) Riw(b,l) (bits per site j) l b=a

Page 7: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Distribution of Individual Information for related binding sites

The average of the set of Ri values for a family of sequences is Rsequence.

Page 8: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Second law of thermodynamics

-kBT ln 2 q / R

q: heat dissipated; T: temperature; R: information

HLH Protein HLH Protein bound to WT DNA

q < 0 => R > 0

DNA Mutation orUnrelated sequence

q > 0 => R < 0

Page 9: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Among related sequences having a common function,functional sites can be distinguished from non-siteswith the sequence walker. (E. coli Fis protein)

Sequence Walker Definition

2

0

-4

bits

Page 10: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Sequence Walker Application I

2

0

-4

bits

The matrix can be scanned along a “test sequence” until...

Ri = - 6.7 bits at position 179 of the sequence. The Z score is -5.4.

Page 11: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Sequence Walker Application II

2

0

-4

bits

… a green bar indicates a potential binding site

Ri = 9.2 bits at position 180 of the sequence. The Z score is 0.3.

Page 12: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

mRNA splicing

gene

1 2Exons

IVS 1

3

IVS 2

Transcription

2

IVS1

3

IVS2

DNA

hnRNA

1

1 2 3

Splicing

donor acceptor

5’

5’3’

3’

Mature mRNA

or1 3

Alternative mRNA

Page 13: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Splice Site Model Building

•We extracted coordinates of unique donor and acceptor splice sites of known genes from the given strand of the 10/7/00 Human Genome Working Draft.

•Valid splice junctions were evaluated by information theory (Ri > 0) and the Ri(b,l) matrix was computed.

•This process was iterated (~ 10 cycles) until all sites evaluated with the matrix had Ri > 0.

Page 14: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Parameters Acc (+ strand) Acc (- strand) Acc_total Acc (1992)

Starting set (n) 86,068 84,076 170,144 1,744Refined model (n) 53,985 54,101 108,079 1,744 Site coordinates [-25, 2] [-25, 2] [-25, 2] [-25, 2]Rsequence 7.45 7.41 7.42 8.87Standard deviation 3.47 3.47 3.47 4.58Ri of consensus sequence 22.93 22.78 22.88 21.68

Don (+ strand) Don (- strand) Don_total Don (1992)Starting set (n) 86,221 84,229 170,450 1,799Refined model (n) 56,286 55,491 111,772 1,799 Site coordinates [-3,6] [-3,6] [-3,6] [-10,10]Rsequence 6.73 6.74 6.74 8.01Standard deviation 2.36 2.33 2.34 3.29Ri of consensus sequence 11.80 11.80 11.79 15.18

Semi-automated Splice site Model Refinement

• ~ 1/3 of exon-intron junctions are misaligned in the draft, owing to the rapid alignment procedures used (ie. BLAT).

Page 15: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Splice junction logos: (+) strand

Page 16: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Ri analysis of sequence variation at binding sites

Ri analysis of sequence variation at binding sites

• Effects of mutations

• Effects of polymorphisms

• Detection of cryptic sites

• Relationship between information content and phenotype

Page 17: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Comparison of the binding energies of normal andvariant splice junctions:

Gwt/ Gv = 2Ri

where Ri = the difference between the respective Ri values, Gwt = Free energy of the natural binding site,

Gv = Free energy of the variant binding site.

The fold difference in binding the normal vs. the variant site isGwt/ Gv.

Page 18: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

mRNA splicing mutations (*, ^)gene

1 2Exons

IVS 1

3

IVS 2

2

IVS1

3

IVS2

DNA

hnRNA

1

1 2 3

donor acceptor

5’

5’3’

3’

Leaky or no wild typemRNA

or1 3Exon

skipping (*)

**

** ^

or1 2 3

^

Crypticsplicing (^)

Mutant forms

Page 19: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

The minimum information required for donor siterecognition

Temperature sensitive mutation in COL3A1 results in 50% exon skipping and Ehlers-Danlos syndrome, Type VII. Splicing is impaired at 39 deg.Cand restored at 30 deg. C, which is consistent with weak binding by U1 splicesome.

Page 20: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Cryptic splicing mutations

A C->T mutation in exon 3 of the iduronidatesynthetase gene activates a cryptic donor site upstream of the natural donor site.

Page 21: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Mechanism of exon recognition

5’ mRNA 3’

exon

acceptor donor

U2 splice + U2AF

U1splicesome

Binding sites

Page 22: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Mechanism of exon recognition: cryptic splicing mutation

5’ mRNA 3’

exon

Naturalacceptor

Naturaldonor

U2 splice + U2AF

(2a)

U1splicesome

Binding sites

Activatedcrypticdonor Either not

recognized or to lesser degree

Recognized

Page 23: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Mild (or leaky) splicing mutation

Page 24: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Splicing among 3 common alleles that differ in length in the polymorphic polythymidine tract of the IVS 8 acceptor of the CFTR gene.The shortest allele (top walker) shows 90% skippingof exon 9 and is associated with congenital absence of the vas deferens. Individuals with the two longer alleles have a normal phenotype, although the 7T allele produces less mRNA than the 9T allele.

CFTR Polymorphism (5T, 7T, 9T)

60%

35%

5%

PopFreq

Page 25: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Prediction of clinical phenotypes

•Hereditary non-polyposis colon cancer•Hemophilia A and B•Atherosclerosis

Page 26: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

The Lynch I form of HNPCC is confined to the colon, but the more severe Lynch IItype shows multi-organ involvement. The HNPCC phenotype is hypothesized to berelated to the amount of normal and abnormal MLH1 and MSH2 mRNA presentpredicted from the individual information in mutant splice sites.

Mutant splice sites (n=31) in these genes contained significantly less information than thecognate natural sites. Each of the Lynch I mutations had R i values >2.4 bits, which isconsistent with reduction (not abolition) of mRNA. Lynch I and II phenotypes weredistinguishable by their Ri values for all but 3 Lynch II mutations (with 2.4 to 4.8 bits).

Predicting Phenotype of HNPCC Splicing Mutations by Information Analysis

Lynch I mutations

Lynch II mutations

Page 27: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Hypothesis: Ri values will be highest for normal splice sites,Intermediate values for Lynch I and lowest values for Lynch IIsyndrome.

The medians for these three groups are different and in the correct orderand that there are some outliers in the two Lynch mutation groups.The three groups have significantly different RI values.{Kruskal-Wallis 2 (df=2) =17.9833 P= 0.0001}

Each of the groups are different from one another based on pairwisecomparisons with the Wilcoxon rank-sum test:

Group comparison Corrected Rank-sum P Normal (Z) statistic________________

Lynch I vs. Normal variants 2.68 0.0072Lynch II vs. Normal variants 3.73 0.0002Lynch I vs. Lynch II 2.17 0.03

Statistical analysis: HNPCC

Page 28: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Results are consistentwith MSH2 -/-and MSH2 +/- transgenic mouse phenotypes. Increased proliferation induces widespread DNA replication errors, whichare repair normally until DNA repair systems are saturated (Cancer Res.62:2092, 2002).

Mismatch repair machinery is activated byDNA damaging agents(Nature 399:806, 1999;PNAS 96:10704, 1999).

Page 29: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Relating Information Content of F8C and F9

Splicing Mutations and Bleeding Phenotypes

To predict severity of hemophilia, mutationsin the factor VIII (F8C) or factor IX (F9) geneswere analyzed for changes in RI:

The receiver operating curve discriminatedmildly or moderately from severelyreduced protein activity for values 2.4bits or Ri < 7 bits (P=.001).

Using these thresholds:- 91% of mutations with severely

reduced protein expression werecorrectly identified (n=45; P< 0.001).

- 86% of mutations associated withsevere bleeding and all mutationswith moderate bleeding symptomswere correctly identified (n= 22 p< .0009).

Page 30: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Information Content of Splicing Mutations in Lipid Metabolizing Genes vs. Phenotype

 Ri value

cutoff(bits)

 Phenotype*

Dyslipidemia Reduction in protein level or activity

  Mild Average Severe Mild Average Severe

 < 2.4

 0/15

 10/15

 5/15

 1/9

 7/9

 1/9

> 2.4 2/5 3/5 0/5 2/3 1/3 0/3

Fraction is the number of mutations in category / total number above or below 2.4 bits. Mutant

genes included APOAII,APOB,APOCII,APOE,CBS,CETP,LCAT,LIPA,LDLR, and LPL.

Page 31: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Generating information models of eukaryotic transcription factor cis-regulatory binding sites

Unique challenges:

•Variant sequences are not obvious•Requires experimental determination and validation •Effect of ascertainment bias

in published sitesin SELEX-generated sites

•Binding protein does not necessarily signify that it activates (or represses) transcription

Page 32: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

(A) Mutation in the CCAAT box of the A-gamma globin gene results in 1.4 fold increased expression of fetal globin mRNA into adulthood. The CCAAT box protein binding site is strengthened by 0.5 bits (or 1.41 fold) over wild type. (B) The binding site logo and distribution of Ri values of 171 binding sites in the Transfac Database (www.biobase.de) are indicated. Models of NF-E2, GATA1, and GATA2 protein binding Sites were also constructed, but sites were not found in this interval (not shown).

Greek Hereditary Persistence of Fetal Hemoglobin(HBGA, -119G>A)

6.8 bits

7.3 bits

Page 33: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

The Transcription Factor Binding Site Problem:

Bias in Models Derived from TRANSFAC datatowards Consensus Sequences*

*Consensus sequences have the strongest binding, but are often not representative of the majority of sites.

Page 34: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Model development strategy

Page 35: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Refinement of the Pregnane X Receptor (PXR/RXRα) binding site model

Initial PXR/RXR Model. Published PXR/RXR binding sites (n=15; and flanking sequences) were multiply aligned by minimization of uncertainty. The -2 to +20 interval contained most of the information, was consistent with published binding studies, and was therefore used to define the site.

bits

Page 36: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Competition Curves for Novel PXREs Identified by Model 1

To quantify the relative affinity of PXR/RXR, band density was plotted versus pmol competitor to determine the concentration of competitor required to deplete PXR/RXRα binding to the CYP3A4 proximal PXRE by 50%. Relative binding was normalized to the band intensity of the reactions with no added competitor as 100%.

Page 37: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Comparison of predicted and measured binding affinities for novel PXR/RXRα sites

identified with the initial model

Predicted fold differences in binding were closer to densitometrically-determined differences when these weaker sites were added in Model 2.

RI (bits)MinimumTheoretical Changein AffinityGENE

Position(relativeto ATG)

PXRE(Model 2 derived walker)

Model1

Model2

Model 1 Model 2

ObservedChange inAffinity(EMSA)

CYP3A4 -270 17.3 18.0

CYP2B6 -8572 15.0 17.9 4.92 1.07 4.4

UGT1A3 -6930 10.9 17.2 84.4 1.74 4.4

UGT1A3 -8040 10.7 16.5 97.0 2.83 3.7

UGT1A6 -9216 9.9 14.3 168.9 13.0 29.6

RI (bits)MinimumTheoretical Changein AffinityGENE

Position(relativeto ATG)

PXRE(Model 2 derived walker)

Model1

Model2

Model 1 Model 2

ObservedChange inAffinity(EMSA)

CYP3A4 -270 17.3 18.0

CYP2B6 -8572 15.0 17.9 4.92 1.07 4.4

UGT1A3 -6930 10.9 17.2 84.4 1.74 4.4

UGT1A3 -8040 10.7 16.5 97.0 2.83 3.7

UGT1A6 -9216 9.9 14.3 168.9 13.0 29.6

Page 38: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

(A) Alignment of published + validated PXREs

(B) Histogram (C) Sequence logo

Model 2 Characteristics

Page 39: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Scans of CYP3A4 and CYP2B6 promotersEach promoter was scanned with PXR/RXR model 2. Ri values are plotted versus the position of the PXRE in the CYP3A4 gene or the CYP2B6 gene. Ri values of sites on the antisense strand are shown upside down. Previously characterized PXR binding sites identified by the model are indicated in color.

Page 40: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Activation of the CYP2B6 Distal PXRETransient transfections with CYP2B6 and control CYP3A4 PXRE fusion constructs. Rifampin induced luciferase activitiy 4- to 5-fold in cells cotransfected with an expression plasmid for human PXR and CYP2B6-dPXRE(2X)-luc, and 2- to 3- fold in cells cotransfected with CYP3A4-pPXRE(2X)-luc. Rifampin had no effect on luciferase activity in cells transfected with the enhancerless-reporter.

Average luciferase activity ± SD of three replicates from 3 independent transfections is shown.

Page 41: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

PXR/RXR Model 3

Weaker binding sites from well established PXR/RXRα target gene promoters (Ri < Rsequence) were validated and introduced into Model 3.

Page 42: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Novel validated binding sites in Model 4

Site name Site name - Ri(b,l) matrix Ri

CYP3A4-pPXRE(0/10G) NG_000004.a148729g.a148739g 15.1

CYP3A-dNR1(0/10G) NG_000004.t141178c.t141168c 16.8

CYP3A7-dNR2(0/10G) NG_000004.a190205g.a190215g 17.6

CYP2B6-dPXRE(10G) CYP2B6.a1446g 16.2

UGT1A3b(0/10G) AF297093.t137695c.t137685c 18.3

UGT1A3a(0/10G) AF297093.a138805g.a138815g 14.9

GSTM1(0/10G) AC000031.6.a1959g.a1969g 12.0

UGT1A1gtNR1(0/10G) AF297093.1.t171676c.t171666c 7.1

UGT1A1b(0/10G) AF297093.1.t165761c.t165751c 14.0

FMO4b(10G) AL031274.1.a57947g 11.0

catalase(0/10G) AL035079.14.t43503g.a43513g 14.6

NOS2A(1A) chr17_27002541-27012540.c8336t 12.9

NOS2A(11A) chr17_27002541-27012540.c8326t 10.5

MAOBd(0/10G) Z95125.t36576c.t36566c 11.1

These 14 binding sites are not present in the Nov 02 human genome draft!

Page 43: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Possible significance of novel sites

• Not present in reference sequence, but they are polymorphisms or mild mutations– Advantage is that binding is not abrogated, but

reduced, ie. gene is less PXR/RXR responsive. – Possible “wobble” code for regulatory elements

• Ancestral binding sequence present in primate lineage– PXR/RXR mutation rate is slower than cis-regulatory

element; protein retains ability to recognize sequences that are no longer present

– This could explain why heterologous cross-species transfections are faithfully regulated.

Page 44: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Development of a Xenobiotic biosensor based

on the information theory-derived optimal site

0

2

4

6

8

10

12

14

PXREv2-OPT(2X)-luc

CYP3A4-pPXRE(2X)-luc

Fir

efly

RL

U/R

enill

a R

LU DMSO

10 uMRifampin

HepG2 cells were transiently transfected with 100 ng luciferase reporter, 5 ng pRL-CMV and 25 ng pSG5-hPXRDATG with Lipofectamine Plus. After treatment for 24 hours with 10 mM Rifampin or 0.1% DMSO (solvent), cells were harvested and Dual-luciferase assays were performed. Results are the average of three separate wells transfected and treated in parallel.

Page 45: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Architecture of the Delila Genome System

Page 46: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Performance metrics

Page 47: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Histogram of binding site strengths for sites in genome scan >10 bits

Page 48: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Delila-Genome Visualization Tools

Page 49: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Visualization of successive genome scans of PXR/RXRα binding site models

Page 50: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Monitoring PXR/RXR refinement through complete genome promoter scans

Page 51: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Development and Experimental Refinementof NFkB p65/p50 Binding Site Model

Panel 1. Logos for NFkB p50/p65 binding sites. (A) Model 2 based on 55 Published and 8 experimentally determined binding sites (B) Model 3 based On 55 published and 20 experimentally determined binding sites. Inset s are histogram distributions of Ri values of sites comprising each model.

Page 52: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

CYP2D6 Promoter Mutation Analysis of NFkB p65/p50 Binding Site

C G

CYP2D6:

“C allele” 3.3 bits

“G allele” -0.8 bits

The -1496C allele contains a weak p50/p65 site (–1495 to –1508; R i =3.3 bits) that isabolished (R i < 0) in the G variant. These alleles each also contain p50 homodimerbinding sites on opposite strands; however, the C allele is predicted to bind withgreater affinity (3.5 vs. 2.7 bits; 1.6 fold difference). The higher CYP2D6 activityobserved for the –1496G allele may be due to reduced binding and repression ofCYP2D6 expression by NF-kB p50 homodimers.

Page 53: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Future efforts

• Automate binding site validation

• Genomic signature of PXR/RXRα – target genes

• (Hypothesis-based microarray studies of ligand-induced gene expression)

Page 54: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Automated binding site validation: microtiter plate immunoassay

• Covalently link reference oligo to plate• Bind synthetic PXR/RXRα ± competitor oligo*• Bind 1o RXR α (or PXR) antibody • Detect with 2o antibody/ HRP• (Automated with Biomek 2000 workstation)

*Competitor oligos are detected in PXR/RXRα target genes and exhibit Ri values that are ±2 bits of reference oligo.

Page 55: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Genomic analysis to identify genesregulated by transcription factors:

•Requires robust binding site model•Genomic signature should delineate differences between regulated and constitutively expressed genes:

• Define promoter interval interval • Binding site strength• Densities of sites• Organization of sites

Page 56: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

0-2000-4000-6000-8000-10000

Position

16

14

12

10

8

6

4

2

0

Ri

Genes regulated by NF-kB + unregulated

Legend

Ri-reg

Ri-unreg

(n=8)

Ri

(n=3)

Position

-10000 -8000 -6000 -4000 -2000 0

“NF-kB binding sites” in gene promoters

-400 bp

Page 57: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

-400 -350 -300 -250 -200 -150 -100 -50 0

Position

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

Ri

NF-kB binding sites for promoters of upregulated genes scanned by model 3

Legend

INF-beta

LCAM

E-Selectin

Lymphotoxin

TNF-alpha

IL-2

GM-CSF

Urokinase

R = 4.0i

NF-kB Binding Sites in Upregulated Genes

Page 58: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

-400 -350 -300 -250 -200 -150 -100 -50 0

Position

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

Ri

Legend

GAPDS

GAPD

VEGF

“NF-kB binding sites” in genes not known to be regulated by NF-kB

R = 1.3i

Page 59: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Criteria for scanning chromosomes 21/22 with NF-B Model 3:

•Average information threshold of >4 bits. Of 548 promoter intervals (400 bp each): the mean Ri values for sites in 138 promoters on the transcribed strand and 137 on the antisense strand had sites exceeding threshold. 37% of the genes on chromosome 21 would be NF-B targets!! Also, multiple weak binding sites with low Ri values can falsely exclude genes containing strong binding sites. This genomic signature has very LOW specificity.

•Eliminate promoters with only weak binding sites (Ri<Rsequence). This signature identifies smaller set of genes: 11 and 19, respectively, on chromosomes 21 and 22. Several expected cytokine genes are not identified with this genomic signature. These criteria introduce biased towards the consensus sequence (or an incomplete model). This approach appears to lack adequate sensitivity.

Page 60: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

True Positives

True NegativesUnknowns

Promoter region inputTraining/Validation

Prediction

Freq Dist of Binding StrengthsDistances from TSS

Markov Cluster Algorithm

Clusters of Sites

Hybrid Neural Network

Positive/Negative Prediction

Experimental ConfirmationPositive

Negative

Ref

inem

ent o

f ge

nom

ic

sign

atur

e

Add

Add

Genome Scan

Genomic signature determination for PXR/RXRwith machine learning approach

Page 61: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

Predictions of Binding Strength Network

• Network Input: Frequency distributions of binding sites based on 5 bit-wide bins

• Trained with 15 PXR/RXR responsive and 15 non-responsive promoter regions

• Results of testing 9 positive and 22 negative promoter regions:– <TP,FP,TN,FN> = <7,4,18,2>– Sensitivity = 77.8%– Specificity = 81.8%

Page 62: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

In conclusion...

•Genetic variation in binding sites can be comprehensively modeled by information theory.

•Information is related to binding energy and can be used rank order binding strengths.

•Beware of experimental bias towards strong bindingsites. Information theory can be used to develop and refine binding site models that are representative of the range of binding strengths found in the genome.

•Robust binding site models are a prerequisite for accurate mutation/polymorphism analysis and for comprehensive identification of binding sites in the genome.

Page 63: Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory

ContributorsChildren’s Mercy Hospital and Clinics:•Sashidar Gadiraju, Stan Svojanovsky•J. Steven Leeder, Carrie Vyhlidal, Ivy Hurwitz

SICE, University of Missouri-Kansas City:•Deendayal Dinakarpandian, Saumil Mehta

St. Jude’s Children’s Research Hospital: •Erin Schuetz

University of Hamburg: •Yskert von Kodolitsch

NCI: •Tom Schneider

SupportMerck Genome Research FoundationPHS ES10855-02