Computational Analysis of Tissue Specificity: Decoding
PromotersChris Stoeckert, Ph.D.
Center for Bioinformatics & Dept. of GeneticsUniversity of Pennsylvania
Nov. 17, 2004Department of Physiology Seminar Series
University of Kentucky
What is the code for determining where (and when) a gene is expressed?
http://molbio.info.nih.gov/molbio/gcode.html
Expression
TFBS1 TFBS4TFBS3
TFBS3
TFBS4
TFBS2
TFBS2
TFBS1
TFBS = transcription factor binding site
Goal is to Identify Combinations of TFBS (cis-Regulatory Modules or
CRMs) that Specify Tissue Expression
From Wasserman & Sandelin, NRG 2004
A Genomics Unified Schema approach to understanding
gene expressionDave Barkan, Jonathan Crabtree, Shailesh Date, Steve
Fischer, Bindu Gajria, Thomas Gan, Greg Grant, Hongxian He, John Iodice, Li Li, Junmin Liu, Matt Mailman,
Elisabetta Manduchi, Joan Mazzarelli, Debbie Pinney, Angel Pizarro, Mike Saffitz, Jonathan Schug, Chris
Stoeckert, Trish WhetzelComputational Biology and Informatics Laboratory (CBIL),
Penn Center for Bioinformatics
Stem Cell Gene Anatomy Project
Beta Cell Biology Consortium
Plasmodium Genome Resource
Allgenes (human and mouse DoTS)
GUS
GUS
CoreSRESTESSRADDoTS
Oracle RDBMS
Object Layer for Data Loading
Java Servlets
GUS is an open source projectSanger Institute
U. Georgia
Flora Centromere
Database
U. Chicago
U. Penn
U. Toronto
Phytophthora sojae
genome Virginia BioinformiaticsInsitiute
GUS (Genomics Unified Schema)http://www.gusdb.org
MIAME/MAGE-OMGene ExpressionRAD
EST clusters and gene models
Sequence and annotationDoTS
DocumentationData ProvenanceCore
OntologiesShared ResourcesSres
TFBS organizationGene RegulationTESS
FeaturesDomainNamespace
RAD EST clustering and assembly
DoTS
Genomic alignmentand comparativesequence analysis
Identify sharedTF binding sites
TESS
BioMaterial annotation SRES
DoTS integrates sequence annotation including where expressed
DoTS integrates sequence annotation including where expressed
kidney, mammary gland, brain, liver, colon, lung, retina, spinal cord, rhabdomyosarcoma cell line
brain, liver, kidney, lung, melanocyte
embryo, fetus, kidney, limb, retina, salivary gland
brain, rhabdomyosarcoma cell line, kidney
Sorbs1: sorbin and SH3 domain containing 1 - GO molecular function - actin binding and protein kinase binding- GO cellular component – actin cytoskeletal stress fibers
RAD Contains Detailed Expression Experiments Including Tissue Surveys
TESS Allows You to Find Potential TFBS
But there are too many potential sites!
Promoters Features Related to Tissue-Specificity as Measured by Shannon
Entropy
Jonathan Schug1, Winfried-Paul Schuller2, Claudia Kappen2, J. Michael Salbaum2, Maja Bucan3, Christian
J. Stoeckert Jr.1
1. Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA
2. Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska, 68198, USA
3. Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA
What is a Liver-Specific Gene?
*http://expression.gnf.org/
Assessing Tissue Specificity of Genes Using Shannon Entropy
Shannon entropy is a measure of the uniformity of a discrete probability distribution. Given a set of T tissues, H ranges from 0 for a gene expressed in a single tissue to lg T for a gene expressed uniformly in all T tissues. It works well as a measure of overall tissue-specificity.
To measure specificity to a particular tissue, we combine entropy H and the relative expression level in that tissue to get Q. Q = 0 for a tissue when the gene is expressed only in that tissue and Q = 2T for a typical tissue in uniform expression.
(a) Very specific liver expression: H=1.6 and Qliver = 2.2, 98612_at cytochrome p450
(b) Near uniform expression : H=4.3 and Qliver=10.2, 104391_s_at Clcn7 chloride channel 7
Agreement between Microarrays and ESTs on Tissue Specificity
Specificity Characteristics of Tissues
Tissue Probe SetID H Q RefSeq Description
96055_at 3.2 5.8 NM_031161 cholecystokinin
93178_at 2.7 5.8 NM_019867 neuronal guanine nucleotideexchange factor
93273_at 3.7 5.8 NM_009221 synuclein, alpha
92943_at 3.5 6.0 NM_008165 glutamate receptor, ionotropic,AMPA1 (alpha 1)
Amygdala
95436_at 3.3 6.1 NM_009215 somatostatin
98406_at 2.7 4.0 NM_013653 chemokine (C-C motif) ligand5
98063_at 1.6 4.1 - glycosylation dependent celladhesion molecule 1
99446_at 2.5 4.1 NM_007641membrane-spanning 4-domains, subfamily A, member1
92741_g_at 3.3 4.5 - immunoglobulin heavy chain 4(serum IgG1)
Lymph Node
102940_at 2.8 4.6 NM_008518 lymphotoxin B94777_at 1.3 2.1 - albumin 1101287_s_at 1.6 2.2 NM_010005 cytochrome P450, 2d1099269_g_at 1.5 2.2 NM_019911 tryptophan 2,3-dioxygenase100329_at 1.4 2.3 NM_009246 serine protease inhibitor 1-4
Liver
94318_at 1.6 2.3 NM_013475 apolipoprotein H
CpG Islands are Associated with the Start Sites of Genes with Wide-Spread
Expression
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1 2 3 4 5 6Entropy
Fraction of Promoters w/ CpG Island
HumanMouse
CpG island = minimum 200 bp, C+G > 0.6, obs./expect. >=0.5
Tissue-Specific and Non-Specific Promoters Have Distinct Base Compositions
CpG+ CpG-
Multi-TissueH >= 4.4
TissueSpecificH <= 3.5
TATA Boxes are Associated with Tissue-Specific Genes
p h = 0.13; p m = 0.15
p h = 0.00007; p m = 0.00087
p h = 0.00005; p m = 0.00001
0
10
20
30
40
50
60
70
80
90
0-2 2-4 4-6 6-8 8-10 >10
Q-Value
% with TATAA Box
human
mouse
(7/9)
(8/9)
(4/8)
(8/28)
(16/80)
(3/8) (10/28)
(16/80)
genes with
TATAA Box
human 18.8%
mouse: 22.9%
(4/31)
(2/27)
(9/35)
(3/27)
CellularComponent
BiologicalProcess
Human Only Mouse Only
extracellular,extracellular space
microsome, vesicular fraction intermediate filament(cytoskeleton)
CGI-/TATA+
response tostimulus
organismal physiological processinflammatory responseinnate immune responsecell motilitydefense responseresponse to pest/pathogen/parasiteresponse to woundingresponse to biotic stimuluscell-cell signalingmorphogenesisdigestionmuscle contraction
chemotaxis,taxis,response to chemicalsubstance,response to abioticstimulus,muscle development
cell, cytoplasm,intracellular,mitochondrion
nucleus, ribonucleoproteincomplex
CGI+/TATA-
nucleobase, nucleoside, nucleotideand nucleic acid metabolismintracellular transportmetabolismprotein transportintracellular protein transportRNA processingRNA metabolismcell cyclemitotic cell cycle
(integral to)(plasma)membrane
extracellular,extracellular space
CGI-/TATA-
organismalphysiologicalprocess, defenseresponse, immuneresponse, responseto biotic stimulus,response tostimulus
response to pest/ pathogen/parasite, cell communication,response to wounding, cellulardefense response, signaltransduction
complement activation,complement activation(classical pathway),humoral defensemechanism (sensuVertebrata),humoral immuneresponse
Functional relationships of promoter classes based on over-represented GO terms (EASE)
First Clues: TATA Box indicates Tissue Specific;
CpG indicates Wide Spread Expression
Additional clues: CpG-/TATA+ indicates high expression, secreted proteins while CpG+/TATA- indicates cellular and mitchondrial proteins.
Pattern Analysis of Pancreas Gene Promoters
Guang (Gary) Chen, Jonathan Schug
Shannon Entropy
GNF Gene Expression Atlas
Gene Lists withTissue SpecificityDBTSS
Sequences around Transcription Start Sites
Teiresias
Pattern Clusters(PWM)
Represent Seqs with
PWMs
Gene Clusters Gene Ontology (GO)
GO Category AnalysisPatterns
Pattern Clustering
Comparative Genome Analysis
Identifying TFBMs – Method Pipeline
Starting with a gene expression tissue survey, pancreas-specific genes with common TFBS and biological processes are identified
Tissue Specific Regulatory Modules
Associated with GO Biological Process
– DBTSS: Database of Transcriptional Start Sites • Based on 400,225 and 580,209 human and mouse full length cDNA sequences,
DBTSS contains the genomic positions of the transcriptional start sites and the adjacent promoters for 8,793 and 6,875 human and mouse genes, respectively. http://dbtss.hgc.jp/
Yutaka Suzuki, Riu Yamashita, Kenta Nakai and Sumio Sugano (2002). DBTSS: DataBase ofhuman Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res. 30: 328-331.
– Pancreas genes are chosen based on efforts to understand pancreatic development and function (EPConDB)
• 500bp upstream for preliminary study• 159 human (mouse) pancrea specific genes (Qislet <7, positive(p)) & 159 human (mouse) ubiquitous genes (Qislet >10, negative (n))
– This approach can be applied to any tissue to study the tissue specificity of transcription factor binding motifs (TFBMs) & Modules
Methods & Resources (Cont.)
• A Teiresias Pattern P is a <L,W> pattern (with L ≤ W) if P containing at least L residues such that every subpattern of P containing L residues is at most W symbols in length.
Pattern ACTGGC A. C. GT
<L,W>
<L=?, W=6><L=?, W=4><L=6, W=6>
<L=?, W=6><L=?, W=4><L=4, W=6>
Method- Pattern Discovery - Teiresias
Teiresias Patterns
*Rigoutsos, I. and A. Floratos, Combinatorial Pattern Discovery in Biological Sequences: the TEIRESIAS Algorithm. Bioinformatics, 14(1), January 1998.
Identifying TFBMs - Pattern Distribution
With 117 human pancreas specific genes (Qpancreas <6.5, positive(p)) and 117 human ubiquitous genes (Qpancreas >10, negative (n)), roughly 90,000 patterns were discovered in the 1kb+/200bp- promoter region. Patterns with ∆p-n >20 (in blue box) are more likely to be pancreas specific
Each point represents a pattern with occurrence in positive data set (y-axis) and negative data set (x-axis)
For each pattern (x-axis), the occurrence difference ∆p-n (y-axis) between positive (Q<6.5) and negative (Q>10) data set
Method - Pattern Clustering
Pattern Clustering
Patterns
Smith-Waterma
n
Distance of pattern pair
Hierarchical
K-Median
Pattern Clusters (PWM)
Num of Cluster
Pattern Clustering
Results - Pattern Clustering
Clustering Results (human, ∆p-n>20, 72 patterns)
Identifying TFBMs
72 patterns (Human, ∆p-n >20) were clustered to 18 pattern clusters and 6 of them were identified as known ones by searching TRANSFAC.
Identified known binding sites associated with human pancreas genes
AP2ALPHAMEF2 SRY
NKX62 CAP_01 HOXA3
AP2ALPHA MEF2 NKX62
CAP_01
Identifying TFBM
By conducting comparative genomic analysis, some discovered TFBMs are conserved between Human & Mouse pancreas Orthologs
HOXA3
Gene Clustering - Based on TFBMs
pancreas specific genes can be clustered according to presence or absence ofconserved promoter motifs
Upstream sequences can be characterized by pattern occurrences, which can then be used to calculate pairwise similarities between sequences. For simplicity, we just used a boolean model by considering 7 conserved pattern appearance. Centered pearson correlation was used to calculated similarity, and 117 pancreas specific (Q<6.5) were clustered into 10 clusters with hierarchical clustering.
Cluster 6AP2ALPHA MEF2 NKX62 HOXA3 CCTGTT CTGCTC CAP Refseq Locus Name Description
NM_001504 2833 CXCR3 chemokine (C-X-C motif) receptor 3NM_000380 7507 XPA xeroderma pigmentosum, complementation group ANM_000278 5076 PAX2 paired box gene 2NM_003987 5076NM_003988 5076NM_003989 5076NM_003990 5076NM_002728 5553 PRG2 proteoglycan 2 (natural killer cell activator)NM_013230 934 CD24 CD24 antigen (small cell lung carcinoma cluster 4 antigen)
Gene Clustering – GO Category
Assign Gene Clusters to GO Category To interpret clustering results, we used EASE to find the significant biological features of a gene cluster of interest of a gene cluster through the GO Biological Process.
cluster GO Biological Process Gene Name Descriptionsc2 Digestion AMY1A amylase, alpha 1A; salivary
CEL carboxyl ester lipase, bile salt-dependent lipase, cholesterol esterase; fetoacinar pancreatic protein
CLPS colipase, pancreatic CTRB1 chymotrypsinogen B1 CTRL chymotrypsin-like
c4 catabolism CPA2 carboxypeptidase A2 (pancreatic)DHPS deoxyhypusine synthaseMEPA1 meprin A, alpha (PABA peptide hydrolase) CPB1 arboxypeptidase B1ELA3A elastase 3A, pancreatic (protease E)ELA2A pancreatic elastase IIA
c6 response to stimulus CD24 CD24 antigen (small cell lung carcinoma cluster 4 antigen)CXCR3 chemokine (C-X-C motif) receptor 3PAX2 paired box gene 2PRG2 proteoglycan 2, bone marrow (natural killer cell activator, eosinophil granule major basic protein)
XPA xeroderma pigmentosum, complementation group A
c8 phosphorus metabolism PDGFRA platelet-derived growth factor receptor, alpha polypeptidePRDX4 thioredoxin peroxidase; thioredoxin peroxidasePTP4A3 protein tyrosine phosphatase type IVA, member 3
c9 Transport SLC12A3 solute carrier family 12 CACNA1E calcium channel, voltage-dependent, alpha 1E subunitTCOF1 Treacher Collins-Franceschetti syndrome 1 SLC35A3 solute carrier family 35 (UDP-N-acetylglucosamine (UDP-GlcNAc)
More Clues: Known and novel TFBS found associated
with genes expressed in the pancreas
See conservation of sites between human and mouseAssociated with digestion, catabolism, and response to stimulus GO
biological processes
Discovering regulatory modules by creating profiles for Gene
Ontology Biological Processes based on tissue-specificity scores
Elisabetta Manduchi, Jonathan Schug
If we focus on biological processes that are predominantly taking place in a given tissue, can we identify regulatory modules common to genes involved in these processes?
TissueBiological Process
Genes
For a given tissue survey, we attach “tissue-specificity” profiles to gene sets defined by GO BPs, based on the ranked lists of genes in each tissue according to increasing Q.
• To this end, we use an Enrichment Score (ES) in the spirit of that described in Mootha et al. (2003), as a measure of tissue-specificity for that gene set.
• The ES turns out to be equivalent (i.e. equal up to a multiplicative constant) to a Kolmogorov-Smirnov statistic.
• The following results refer to the application of the methods described above to the GeneNote tissue survey: – 12 tissues in duplicate on the HGU95
Affymetrix chip set (Av2, B-E).• We looked at the 2316 GO BPs that we
could map to probe sets (using version 1.5.1 of the Bioconductor GO and hgu95XXX metadata R packages).
Application to a Human Tissue Survey
GO BPs having significantly specific profiles for each tissue can be identified
significant in liver significant in heartand skeletal muscle
Excerpt of cluster ofGO BPs based on theirtissue-specificity profiles(up in spinal cord/brain)
Focusing on steroid metabolismA. After mapping probe sets to RefSeqs and retrieving
from DBTSS their upstream sequences, we assembled a set of 63 promoter sequences, which was our positive set.
B. We generated 5 negative sets, each consisting of 315 sequences, by randomly scrambling each of the positive set sequences.
C. We ranked each of 666 Transcription Factor Binding Sites (TFBSs) from TRANSFAC -represented by position matrices - in terms of their ability (measured by average ROC area) in discriminating between the positive set and the negative sets.
D. We then selected high ranking TFBSs from (C) and high ranking TFBSs from an independent study focusing on liver specificity and formed all possible pairs between these two sets.
E. These pairs were ranked according to their discriminative ability and on the basis of the distance between their components in the positive hits. Optimal parameters (distance and individual TFBS match scores) were selected for each pair scoring at the top.
F. By assessing the performance over a test set composed of mouse promoter sequences, we found 2 candidate CRMs (involving 3 and, respectively, 4 TFBSs) with an over-representation of steroid metabolism genes.
Focusing on steroid metabolism
Example of production hits to steroid metabolism
mouse promoter sequences
No. mouse promoter sequences: 6875. Of these 50 belong to genes mapping to steroid metabolism.
No. production hits: 257. Of these 8 belong to genesmapping to steroid metabolism.
TSS
ProductionTFBSs: {FOXD3_01, GKLF_01, HFH1_01, MADSA_Q2}Parameters:
max distance=130 FOXD3_01 min score=9.934705GKLF_01 min score=10.815614HFH1_01 min score=9.442617MADSA_Q2 min score=8.246301
green=forward strandred=reverse strandshading indicates strength
More Clues: We can identify candidate CRMs from top-ranking GO Biological Processes for tissues
Identified a candidate CRM for steroid metabolism.
Summary• GUS is a functional genomics database system used by
a growing number of sites for genome and expression projects.
• Using expression data in GUS and entropy-based metrics, we can rank genes according to their tissue-specificity and learn promoter properties and associate functional roles
• In addition to general properties of tissue-specific promoters, we are beginning to identify combinations of motifs (i.e., regulatory modules) associated with expression in specific tissues.
Future Directions• Refine analysis from genes to transcripts• Refine analysis from organs to cells• Apply approach to splicing• Apply approach to developmental stage and
differentiation state
Our goal is to make inferences of the form: "The gene set G shows specificity for tissue T and is regulated by module M in this context".
http://www.cbil.upenn.edu