Download - Computational Analysis of Tissue Specificity: Decoding Promoters

Computational Analysis of Tissue Specificity: Decoding

PromotersChris Stoeckert, Ph.D.

Center for Bioinformatics & Dept. of GeneticsUniversity of Pennsylvania

Nov. 17, 2004Department of Physiology Seminar Series

University of Kentucky

What is the code for determining where (and when) a gene is expressed?

http://molbio.info.nih.gov/molbio/gcode.html

Expression

TFBS1 TFBS4TFBS3

TFBS3

TFBS4

TFBS2

TFBS2

TFBS1

TFBS = transcription factor binding site

Goal is to Identify Combinations of TFBS (cis-Regulatory Modules or

CRMs) that Specify Tissue Expression

From Wasserman & Sandelin, NRG 2004

A Genomics Unified Schema approach to understanding

gene expressionDave Barkan, Jonathan Crabtree, Shailesh Date, Steve

Fischer, Bindu Gajria, Thomas Gan, Greg Grant, Hongxian He, John Iodice, Li Li, Junmin Liu, Matt Mailman,

Elisabetta Manduchi, Joan Mazzarelli, Debbie Pinney, Angel Pizarro, Mike Saffitz, Jonathan Schug, Chris

Stoeckert, Trish WhetzelComputational Biology and Informatics Laboratory (CBIL),

Penn Center for Bioinformatics

Stem Cell Gene Anatomy Project

Beta Cell Biology Consortium

Plasmodium Genome Resource

Allgenes (human and mouse DoTS)

GUS

GUS

CoreSRESTESSRADDoTS

Oracle RDBMS

Object Layer for Data Loading

Java Servlets

GUS is an open source projectSanger Institute

U. Georgia

Flora Centromere

Database

U. Chicago

U. Penn

U. Toronto

Phytophthora sojae

genome Virginia BioinformiaticsInsitiute

GUS (Genomics Unified Schema)http://www.gusdb.org

MIAME/MAGE-OMGene ExpressionRAD

EST clusters and gene models

Sequence and annotationDoTS

DocumentationData ProvenanceCore

OntologiesShared ResourcesSres

TFBS organizationGene RegulationTESS

FeaturesDomainNamespace

RAD EST clustering and assembly

DoTS

Genomic alignmentand comparativesequence analysis

Identify sharedTF binding sites

TESS

BioMaterial annotation SRES

DoTS integrates sequence annotation including where expressed

DoTS integrates sequence annotation including where expressed

kidney, mammary gland, brain, liver, colon, lung, retina, spinal cord, rhabdomyosarcoma cell line

brain, liver, kidney, lung, melanocyte

embryo, fetus, kidney, limb, retina, salivary gland

brain, rhabdomyosarcoma cell line, kidney

Sorbs1: sorbin and SH3 domain containing 1 - GO molecular function - actin binding and protein kinase binding- GO cellular component – actin cytoskeletal stress fibers

RAD Contains Detailed Expression Experiments Including Tissue Surveys

TESS Allows You to Find Potential TFBS

But there are too many potential sites!

Promoters Features Related to Tissue-Specificity as Measured by Shannon

Entropy

Jonathan Schug1, Winfried-Paul Schuller2, Claudia Kappen2, J. Michael Salbaum2, Maja Bucan3, Christian

J. Stoeckert Jr.1

1. Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA

2. Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska, 68198, USA

3. Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA

What is a Liver-Specific Gene?

*http://expression.gnf.org/

Assessing Tissue Specificity of Genes Using Shannon Entropy

Shannon entropy is a measure of the uniformity of a discrete probability distribution. Given a set of T tissues, H ranges from 0 for a gene expressed in a single tissue to lg T for a gene expressed uniformly in all T tissues. It works well as a measure of overall tissue-specificity.

To measure specificity to a particular tissue, we combine entropy H and the relative expression level in that tissue to get Q. Q = 0 for a tissue when the gene is expressed only in that tissue and Q = 2T for a typical tissue in uniform expression.

(a) Very specific liver expression: H=1.6 and Qliver = 2.2, 98612_at cytochrome p450

(b) Near uniform expression : H=4.3 and Qliver=10.2, 104391_s_at Clcn7 chloride channel 7

Agreement between Microarrays and ESTs on Tissue Specificity

Specificity Characteristics of Tissues

Tissue Probe SetID H Q RefSeq Description

96055_at 3.2 5.8 NM_031161 cholecystokinin

93178_at 2.7 5.8 NM_019867 neuronal guanine nucleotideexchange factor

93273_at 3.7 5.8 NM_009221 synuclein, alpha

92943_at 3.5 6.0 NM_008165 glutamate receptor, ionotropic,AMPA1 (alpha 1)

Amygdala

95436_at 3.3 6.1 NM_009215 somatostatin

98406_at 2.7 4.0 NM_013653 chemokine (C-C motif) ligand5

98063_at 1.6 4.1 - glycosylation dependent celladhesion molecule 1

99446_at 2.5 4.1 NM_007641membrane-spanning 4-domains, subfamily A, member1

92741_g_at 3.3 4.5 - immunoglobulin heavy chain 4(serum IgG1)

Lymph Node

102940_at 2.8 4.6 NM_008518 lymphotoxin B94777_at 1.3 2.1 - albumin 1101287_s_at 1.6 2.2 NM_010005 cytochrome P450, 2d1099269_g_at 1.5 2.2 NM_019911 tryptophan 2,3-dioxygenase100329_at 1.4 2.3 NM_009246 serine protease inhibitor 1-4

Liver

94318_at 1.6 2.3 NM_013475 apolipoprotein H

CpG Islands are Associated with the Start Sites of Genes with Wide-Spread

Expression

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6Entropy

Fraction of Promoters w/ CpG Island

HumanMouse

CpG island = minimum 200 bp, C+G > 0.6, obs./expect. >=0.5

Tissue-Specific and Non-Specific Promoters Have Distinct Base Compositions

CpG+ CpG-

Multi-TissueH >= 4.4

TissueSpecificH <= 3.5

TATA Boxes are Associated with Tissue-Specific Genes

p h = 0.13; p m = 0.15

p h = 0.00007; p m = 0.00087

p h = 0.00005; p m = 0.00001

0

10

20

30

40

50

60

70

80

90

0-2 2-4 4-6 6-8 8-10 >10

Q-Value

% with TATAA Box

human

mouse

(7/9)

(8/9)

(4/8)

(8/28)

(16/80)

(3/8) (10/28)

(16/80)

genes with

TATAA Box

human 18.8%

mouse: 22.9%

(4/31)

(2/27)

(9/35)

(3/27)

CellularComponent

BiologicalProcess

Human Only Mouse Only

extracellular,extracellular space

microsome, vesicular fraction intermediate filament(cytoskeleton)

CGI-/TATA+

response tostimulus

organismal physiological processinflammatory responseinnate immune responsecell motilitydefense responseresponse to pest/pathogen/parasiteresponse to woundingresponse to biotic stimuluscell-cell signalingmorphogenesisdigestionmuscle contraction

chemotaxis,taxis,response to chemicalsubstance,response to abioticstimulus,muscle development

cell, cytoplasm,intracellular,mitochondrion

nucleus, ribonucleoproteincomplex

CGI+/TATA-

nucleobase, nucleoside, nucleotideand nucleic acid metabolismintracellular transportmetabolismprotein transportintracellular protein transportRNA processingRNA metabolismcell cyclemitotic cell cycle

(integral to)(plasma)membrane

extracellular,extracellular space

CGI-/TATA-

organismalphysiologicalprocess, defenseresponse, immuneresponse, responseto biotic stimulus,response tostimulus

response to pest/ pathogen/parasite, cell communication,response to wounding, cellulardefense response, signaltransduction

complement activation,complement activation(classical pathway),humoral defensemechanism (sensuVertebrata),humoral immuneresponse

Functional relationships of promoter classes based on over-represented GO terms (EASE)

First Clues: TATA Box indicates Tissue Specific;

CpG indicates Wide Spread Expression

Additional clues: CpG-/TATA+ indicates high expression, secreted proteins while CpG+/TATA- indicates cellular and mitchondrial proteins.

Pattern Analysis of Pancreas Gene Promoters

Guang (Gary) Chen, Jonathan Schug

Shannon Entropy

GNF Gene Expression Atlas

Gene Lists withTissue SpecificityDBTSS

Sequences around Transcription Start Sites

Teiresias

Pattern Clusters(PWM)

Represent Seqs with

PWMs

Gene Clusters Gene Ontology (GO)

GO Category AnalysisPatterns

Pattern Clustering

Comparative Genome Analysis

Identifying TFBMs – Method Pipeline

Starting with a gene expression tissue survey, pancreas-specific genes with common TFBS and biological processes are identified

Tissue Specific Regulatory Modules

Associated with GO Biological Process

– DBTSS: Database of Transcriptional Start Sites • Based on 400,225 and 580,209 human and mouse full length cDNA sequences,

DBTSS contains the genomic positions of the transcriptional start sites and the adjacent promoters for 8,793 and 6,875 human and mouse genes, respectively. http://dbtss.hgc.jp/

Yutaka Suzuki, Riu Yamashita, Kenta Nakai and Sumio Sugano (2002). DBTSS: DataBase ofhuman Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res. 30: 328-331.

– Pancreas genes are chosen based on efforts to understand pancreatic development and function (EPConDB)

• 500bp upstream for preliminary study• 159 human (mouse) pancrea specific genes (Qislet <7, positive(p)) & 159 human (mouse) ubiquitous genes (Qislet >10, negative (n))

– This approach can be applied to any tissue to study the tissue specificity of transcription factor binding motifs (TFBMs) & Modules

Methods & Resources (Cont.)

http://dbtss.hgc.jp/



























• A Teiresias Pattern P is a <L,W> pattern (with L ≤ W) if P containing at least L residues such that every subpattern of P containing L residues is at most W symbols in length.

Pattern ACTGGC A. C. GT

<L,W>

<L=?, W=6><L=?, W=4><L=6, W=6>

<L=?, W=6><L=?, W=4><L=4, W=6>

Method- Pattern Discovery - Teiresias

Teiresias Patterns

*Rigoutsos, I. and A. Floratos, Combinatorial Pattern Discovery in Biological Sequences: the TEIRESIAS Algorithm. Bioinformatics, 14(1), January 1998.

Identifying TFBMs - Pattern Distribution

With 117 human pancreas specific genes (Qpancreas <6.5, positive(p)) and 117 human ubiquitous genes (Qpancreas >10, negative (n)), roughly 90,000 patterns were discovered in the 1kb+/200bp- promoter region. Patterns with ∆p-n >20 (in blue box) are more likely to be pancreas specific

Each point represents a pattern with occurrence in positive data set (y-axis) and negative data set (x-axis)

For each pattern (x-axis), the occurrence difference ∆p-n (y-axis) between positive (Q<6.5) and negative (Q>10) data set

Method - Pattern Clustering

Pattern Clustering

Patterns

Smith-Waterma

n

Distance of pattern pair

Hierarchical

K-Median

Pattern Clusters (PWM)

Num of Cluster

Pattern Clustering

Results - Pattern Clustering

Clustering Results (human, ∆p-n>20, 72 patterns)

Identifying TFBMs

72 patterns (Human, ∆p-n >20) were clustered to 18 pattern clusters and 6 of them were identified as known ones by searching TRANSFAC.

Identified known binding sites associated with human pancreas genes

AP2ALPHAMEF2 SRY

NKX62 CAP_01 HOXA3

AP2ALPHA MEF2 NKX62

CAP_01

Identifying TFBM

By conducting comparative genomic analysis, some discovered TFBMs are conserved between Human & Mouse pancreas Orthologs

HOXA3

Gene Clustering - Based on TFBMs

pancreas specific genes can be clustered according to presence or absence ofconserved promoter motifs

Upstream sequences can be characterized by pattern occurrences, which can then be used to calculate pairwise similarities between sequences. For simplicity, we just used a boolean model by considering 7 conserved pattern appearance. Centered pearson correlation was used to calculated similarity, and 117 pancreas specific (Q<6.5) were clustered into 10 clusters with hierarchical clustering.

Cluster 6AP2ALPHA MEF2 NKX62 HOXA3 CCTGTT CTGCTC CAP Refseq Locus Name Description

NM_001504 2833 CXCR3 chemokine (C-X-C motif) receptor 3NM_000380 7507 XPA xeroderma pigmentosum, complementation group ANM_000278 5076 PAX2 paired box gene 2NM_003987 5076NM_003988 5076NM_003989 5076NM_003990 5076NM_002728 5553 PRG2 proteoglycan 2 (natural killer cell activator)NM_013230 934 CD24 CD24 antigen (small cell lung carcinoma cluster 4 antigen)

Gene Clustering – GO Category

Assign Gene Clusters to GO Category To interpret clustering results, we used EASE to find the significant biological features of a gene cluster of interest of a gene cluster through the GO Biological Process.

cluster GO Biological Process Gene Name Descriptionsc2 Digestion AMY1A amylase, alpha 1A; salivary

CEL carboxyl ester lipase, bile salt-dependent lipase, cholesterol esterase; fetoacinar pancreatic protein

CLPS colipase, pancreatic CTRB1 chymotrypsinogen B1 CTRL chymotrypsin-like

c4 catabolism CPA2 carboxypeptidase A2 (pancreatic)DHPS deoxyhypusine synthaseMEPA1 meprin A, alpha (PABA peptide hydrolase) CPB1 arboxypeptidase B1ELA3A elastase 3A, pancreatic (protease E)ELA2A pancreatic elastase IIA

c6 response to stimulus CD24 CD24 antigen (small cell lung carcinoma cluster 4 antigen)CXCR3 chemokine (C-X-C motif) receptor 3PAX2 paired box gene 2PRG2 proteoglycan 2, bone marrow (natural killer cell activator, eosinophil granule major basic protein)

XPA xeroderma pigmentosum, complementation group A

c8 phosphorus metabolism PDGFRA platelet-derived growth factor receptor, alpha polypeptidePRDX4 thioredoxin peroxidase; thioredoxin peroxidasePTP4A3 protein tyrosine phosphatase type IVA, member 3

c9 Transport SLC12A3 solute carrier family 12 CACNA1E calcium channel, voltage-dependent, alpha 1E subunitTCOF1 Treacher Collins-Franceschetti syndrome 1 SLC35A3 solute carrier family 35 (UDP-N-acetylglucosamine (UDP-GlcNAc)

More Clues: Known and novel TFBS found associated

with genes expressed in the pancreas

See conservation of sites between human and mouseAssociated with digestion, catabolism, and response to stimulus GO

biological processes

Discovering regulatory modules by creating profiles for Gene

Ontology Biological Processes based on tissue-specificity scores

Elisabetta Manduchi, Jonathan Schug

If we focus on biological processes that are predominantly taking place in a given tissue, can we identify regulatory modules common to genes involved in these processes?

TissueBiological Process

Genes

For a given tissue survey, we attach “tissue-specificity” profiles to gene sets defined by GO BPs, based on the ranked lists of genes in each tissue according to increasing Q.

• To this end, we use an Enrichment Score (ES) in the spirit of that described in Mootha et al. (2003), as a measure of tissue-specificity for that gene set.

• The ES turns out to be equivalent (i.e. equal up to a multiplicative constant) to a Kolmogorov-Smirnov statistic.

• The following results refer to the application of the methods described above to the GeneNote tissue survey: – 12 tissues in duplicate on the HGU95

Affymetrix chip set (Av2, B-E).• We looked at the 2316 GO BPs that we

could map to probe sets (using version 1.5.1 of the Bioconductor GO and hgu95XXX metadata R packages).

Application to a Human Tissue Survey

GO BPs having significantly specific profiles for each tissue can be identified

significant in liver significant in heartand skeletal muscle

Excerpt of cluster ofGO BPs based on theirtissue-specificity profiles(up in spinal cord/brain)

Focusing on steroid metabolismA. After mapping probe sets to RefSeqs and retrieving

from DBTSS their upstream sequences, we assembled a set of 63 promoter sequences, which was our positive set.

B. We generated 5 negative sets, each consisting of 315 sequences, by randomly scrambling each of the positive set sequences.

C. We ranked each of 666 Transcription Factor Binding Sites (TFBSs) from TRANSFAC -represented by position matrices - in terms of their ability (measured by average ROC area) in discriminating between the positive set and the negative sets.

D. We then selected high ranking TFBSs from (C) and high ranking TFBSs from an independent study focusing on liver specificity and formed all possible pairs between these two sets.

E. These pairs were ranked according to their discriminative ability and on the basis of the distance between their components in the positive hits. Optimal parameters (distance and individual TFBS match scores) were selected for each pair scoring at the top.

F. By assessing the performance over a test set composed of mouse promoter sequences, we found 2 candidate CRMs (involving 3 and, respectively, 4 TFBSs) with an over-representation of steroid metabolism genes.

Focusing on steroid metabolism

Example of production hits to steroid metabolism

mouse promoter sequences

No. mouse promoter sequences: 6875. Of these 50 belong to genes mapping to steroid metabolism.

No. production hits: 257. Of these 8 belong to genesmapping to steroid metabolism.

TSS

ProductionTFBSs: {FOXD3_01, GKLF_01, HFH1_01, MADSA_Q2}Parameters:

max distance=130 FOXD3_01 min score=9.934705GKLF_01 min score=10.815614HFH1_01 min score=9.442617MADSA_Q2 min score=8.246301

green=forward strandred=reverse strandshading indicates strength

More Clues: We can identify candidate CRMs from top-ranking GO Biological Processes for tissues

Identified a candidate CRM for steroid metabolism.

Summary• GUS is a functional genomics database system used by

a growing number of sites for genome and expression projects.

• Using expression data in GUS and entropy-based metrics, we can rank genes according to their tissue-specificity and learn promoter properties and associate functional roles

• In addition to general properties of tissue-specific promoters, we are beginning to identify combinations of motifs (i.e., regulatory modules) associated with expression in specific tissues.

Future Directions• Refine analysis from genes to transcripts• Refine analysis from organs to cells• Apply approach to splicing• Apply approach to developmental stage and

differentiation state

Our goal is to make inferences of the form: "The gene set G shows specificity for tissue T and is regulated by module M in this context".

http://www.cbil.upenn.edu

Download - Computational Analysis of Tissue Specificity: Decoding Promoters

Top Related