biological modeling promoter and gene set analysis
TRANSCRIPT
-
7/29/2019 Biological Modeling Promoter and Gene Set Analysis
1/6
BIOLOGICAL MODELING PROMOTER AND GENE SET ANALYSIS
A. Whole Genome Gene Expression Data
B. Gene set enrichment analysis (GSEA)
AIM: To Identification and retrieve gene expression data (GEO database) for the genes of
interestand Using GSEA to identify how gene regulation differs between the experimental and
control conditions.
Objective:
Activation of viral pattern sensors, such as RIG-I and MDA5, in dendritic cells triggers a
transcriptional program that includes the production of Type I interferons (IFNs). These
antiviral cytokines contribute to both induction and regulation of innate and adaptive antiviral
mechanisms. IRF family transcription factors play a central role in the extensive genetic
reprogramming that follows viral infection, and in the production of IFN. Pathogenic viruses
attempt to subvert normal immune function through the expression of interferon antagonists.
For example, IRF3 activation and IFN-B expression are blocked by the NS1 protein of
influenza. In this lab we will investigate the genetic program that operates during the immune
response to Newcastle disease virus (NDV). NDV is an avian virus that is able to stimulate
innate immunity and DC maturation, but lacks the ability to evade the human interferon
response. This response thus models the operation of an uninhibited antiviral regulatory
network. Genome scale gene expression studies will be accessed. We will see how gene set
enrichment analysis (GSEA) can be used to infer the important regulatory motifs and gene
pathways that are unique to a functioning anti viral response. The TRANSFAC database will
also be accessed to extract position-weighted matrices of relevant TF binding sites on IFNB1
and then, the promoter sequence will be scanned using the UCSD genome browser
Procedure 2:
-
7/29/2019 Biological Modeling Promoter and Gene Set Analysis
2/6
Premise: We are interested in finding the transcription factors driving the gene expression
changes observed post-infection compared to other physical stress (in this instance the
stress of injection without infective particles).
Go to PubMed
Search Keywords: antiviral response cascade
Look at the paper by Zaslavsky et al, " Antiviral response dictated by choreographed
cascade of transcription factors." To the right of the abstract you will find a tab All
links from this paper
Follow the link GEO DataSets.
Alternatively you can search the GEO websiteGene Expression Omnibus at NCBI for
the first authors (Zaslavsky or Hershberg) in the Query Datasets field.
From this page, information about gene expression can be explored. We would like to
choose a time point by which we expect some expression changes following infection.
We also require that both our experiment and our control have samples at this time point.
Click on the more link following the dataset summary and more on the list of Samples.
We can now see which time points are available in this data set. Notice that the time point
6 hours post infection we includes 4 NDV infected samples and 4 controls (AF). We will
download a text file (.txt extension) of the entire dataset by clicking Series Matrix File(s)
at the bottom of the page. We will then unzip the file and open it in excel to remove theunneeded data from other time points and to format it for use in the next steps
Procedure:
Go to GSEA website. Hit the Downloads link. Login with the email provided by the
instructor (or create a user, press Launch.
GSEA works with specific file formats. You must createexpression datasetand
phenotype labelfiles:
1. To make expression dataset file open the file you downloaded from GEO. Erase
all header lines other than !Sample_title. Erase the last row as well. Use the
information from !Sample_titleto delete all the columns that do not contain time
point 6 data, then erase this row as well. Change the title of the column with probe
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/pubmed/20164420http://www.ncbi.nlm.nih.gov/pubmed/20164420http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gds&DbFrom=pubmed&Cmd=Link&LinkName=pubmed_gds&IdsFromResult=20164420http://www.ncbi.nlm.nih.gov/geohttp://www.ncbi.nlm.nih.gov/geohttp://openftp%28%27ftp//ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE18791/')http://www.broadinstitute.org/gsea/http://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Expression_Datasetshttp://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Expression_Datasetshttp://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Expression_Datasetshttp://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Phenotypeshttp://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Phenotypeshttp://www.ncbi.nlm.nih.gov/pubmed/20164420http://www.ncbi.nlm.nih.gov/pubmed/20164420http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gds&DbFrom=pubmed&Cmd=Link&LinkName=pubmed_gds&IdsFromResult=20164420http://www.ncbi.nlm.nih.gov/geohttp://openftp%28%27ftp//ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE18791/')http://www.broadinstitute.org/gsea/http://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Expression_Datasetshttp://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Phenotypeshttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed -
7/29/2019 Biological Modeling Promoter and Gene Set Analysis
3/6
set ids to NAME. Insert a column to the right of NAME and title it
DESCRIPTION. Write na in the first row of DESCRIPTION, copy it, select all
rows by clicking on the first and last rows while holding down the shift button.
Paste na in to all rows.
2. To make the phenotype file open an excel file and make the following 3 lines
8 2 1
# AF NDV
0 1 0 1 0 1 0 1
Save as a tab delaminated txt file with a .cls extension
Now that we have our datasets in the correct format, we can load them into GSEA. Go to the
load data tab and drag&drop the two files you selected into the grey box. Hit "load these files!"
to load them. If you get an error, look through the previous steps and make sure you formatted
your two files correctly. Switch to the Run GSEA tab.
Choose the Expression Dataset you made from the Geo Data.
Choose gseaftp.broadinstitute.org://pub/gsea/gene_sets/c3.tft.v2.5.symbols.gmt as a Geneset
database.
Change number of permutations to 10
Choose the NDV vs AF phenotype
Change Permutation type to - gene_set
-
7/29/2019 Biological Modeling Promoter and Gene Set Analysis
4/6
Look in the GEO webpage to determine which chip platform was used in the
experiment and select the correct Chip Platform. (it should be -
gseaftp.broadinstitute.org://pub/gsea/annotations/HG_U133_Plus_2.chip)
Expand Basic Fields
Change Enrichment statistic to Classic
Hit Run
When the running indicator (on the bottom left hand side of the window) changes to
Success click on it. This should open the details of the analysis in your web browser. In
some operating systems (for instance Linux) when you hit Success to see the results, youget an error. You just need to go to the default output folder (look for it in Options >
Preferences) and open the index.html file with a browser.
Look around a bit try Snapshot and enrichment results in HTML
In the latter page click on the Details link that belongs to the IRF1 matrix (V$IRF1).
RETRIEVE AND VERIFY THE PROMOTER REGION AND THE SEQUENCE OF THE
IFNB1 GENE (NCBI DATABASE)
In the results of the previous analysis click on enrichment results in HTML
click on the Details link that belongs to the IRF1 matrix (V$IRF1).
Near the top of the page you will find IFNB1
Click the Entrez which will take you to the Entrez database at the relevant page.
Choose the entry for IFNB1 in humans.
Follow the linkSee IFNB1 in MapViewer Choose the link dl
"Strand" box - Select "minus" strand from the pull-down menu because IFNB gene is
on the minus strand (note how the arrow next to IFNB1 on MapViewer points up).
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=gene&term=IFNB1%5Bsym%5Dhttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?direct=on&idtype=gene&id=3456http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=gene&term=IFNB1%5Bsym%5Dhttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?direct=on&idtype=gene&id=3456 -
7/29/2019 Biological Modeling Promoter and Gene Set Analysis
5/6
In the "adjust by" box in the "to" line - Increase sequence information displayed by
inserting +2K", Click Change Region/Strand (note how coordinates change and strand
designation becomes "-"). Click Display.
BLASTthe extended sequence against the human genome and verify that it is indeed the
sequence of IFNB1 (Make sure to include the line that starts with ">")
Hit the back button in browser copy the sequence into a text editor save as .fasta
TRANSCRIPTION FACTOR TARGET SITE PREDICTION FOR THE PROMOTER
REGION OF IFNB1
JASPAR
Choose the JASPAR CORE Vertebrata
Press Toggle so that all JASPAR matrix models are selected.
Paste the Human IFNB1 gene and promoter in the text area.
Hit Scan
Find the target sites for different IRF1 target sites.
JASPAR attempts to make non redundant target site Matrices so we do not see general IRF
binding sites. We do find potential target sites for both IRF1 and IRF2.
USE THE UCSC GENOME BROWSER TO LEARN MORE ABOUT THE IFNB1 GENE
AND THE IRF1 TF TARGET SITE
Open the UCSC genome browserand click the "Genomes" link at the top.
Make sure that the hg16 build is selected in the assembly box (July 2003
(NCBI34/hg16)), then type IFNB1 in the search box
Add 2000 bp on both sides of the gene
Under the regulation section unhide the annotation TFBS Conserved (TFBS =
Transcription Factor Binding Site).
Note the TF identified, especially the IRF1/IRF2 binding site at chr9:21068021-
21068033
http://www.ncbi.nlm.nih.gov/BLAST/http://jaspar.genereg.net/cgi-bin/jaspar_db.pl?rm=browse&db=core&tax_group=vertebrateshttp://genome.ucsc.edu/http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=158202547&db=hg16&position=chr9%3A21068021-21068033http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=158202547&db=hg16&position=chr9%3A21068021-21068033http://www.ncbi.nlm.nih.gov/BLAST/http://jaspar.genereg.net/cgi-bin/jaspar_db.pl?rm=browse&db=core&tax_group=vertebrateshttp://genome.ucsc.edu/http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=158202547&db=hg16&position=chr9%3A21068021-21068033http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=158202547&db=hg16&position=chr9%3A21068021-21068033 -
7/29/2019 Biological Modeling Promoter and Gene Set Analysis
6/6
Notice how we do not find the same binding sites in both types of analysis. Many fewer
sites appear in this view and only one of the IRF1 sites identified by JASPER also comes
up here.
As an alternative to the above steps, we can find the region we analyzed in step 4 by
clicking on Blat and copying in the sequence we used in Match and using search. In this
way we also verify that our identification of IFNB1 in the NCBI database is correct..
Result:
The given Procedure was followed to identify the IFNB1 promotor region and gene set
analysis by various computational methods.
http://genome.ucsc.edu/cgi-bin/hgBlat?hgsid=158200950http://genome.ucsc.edu/cgi-bin/hgBlat?hgsid=158200950