biological modeling promoter and gene set analysis

7/29/2019 Biological Modeling Promoter and Gene Set Analysis

1/6

BIOLOGICAL MODELING PROMOTER AND GENE SET ANALYSIS

A. Whole Genome Gene Expression Data

B. Gene set enrichment analysis (GSEA)

AIM: To Identification and retrieve gene expression data (GEO database) for the genes of

interestand Using GSEA to identify how gene regulation differs between the experimental and

control conditions.

Objective:

Activation of viral pattern sensors, such as RIG-I and MDA5, in dendritic cells triggers a

transcriptional program that includes the production of Type I interferons (IFNs). These

antiviral cytokines contribute to both induction and regulation of innate and adaptive antiviral

mechanisms. IRF family transcription factors play a central role in the extensive genetic

reprogramming that follows viral infection, and in the production of IFN. Pathogenic viruses

attempt to subvert normal immune function through the expression of interferon antagonists.

For example, IRF3 activation and IFN-B expression are blocked by the NS1 protein of

influenza. In this lab we will investigate the genetic program that operates during the immune

response to Newcastle disease virus (NDV). NDV is an avian virus that is able to stimulate

innate immunity and DC maturation, but lacks the ability to evade the human interferon

response. This response thus models the operation of an uninhibited antiviral regulatory

network. Genome scale gene expression studies will be accessed. We will see how gene set

enrichment analysis (GSEA) can be used to infer the important regulatory motifs and gene

pathways that are unique to a functioning anti viral response. The TRANSFAC database will

also be accessed to extract position-weighted matrices of relevant TF binding sites on IFNB1

and then, the promoter sequence will be scanned using the UCSD genome browser

Procedure 2:


2/6

Premise: We are interested in finding the transcription factors driving the gene expression

changes observed post-infection compared to other physical stress (in this instance the

stress of injection without infective particles).

Go to PubMed

Search Keywords: antiviral response cascade

Look at the paper by Zaslavsky et al, " Antiviral response dictated by choreographed

cascade of transcription factors." To the right of the abstract you will find a tab All

links from this paper

Follow the link GEO DataSets.

Alternatively you can search the GEO websiteGene Expression Omnibus at NCBI for

the first authors (Zaslavsky or Hershberg) in the Query Datasets field.

From this page, information about gene expression can be explored. We would like to

choose a time point by which we expect some expression changes following infection.

We also require that both our experiment and our control have samples at this time point.

Click on the more link following the dataset summary and more on the list of Samples.

We can now see which time points are available in this data set. Notice that the time point

6 hours post infection we includes 4 NDV infected samples and 4 controls (AF). We will

download a text file (.txt extension) of the entire dataset by clicking Series Matrix File(s)

at the bottom of the page. We will then unzip the file and open it in excel to remove theunneeded data from other time points and to format it for use in the next steps

Procedure:

Go to GSEA website. Hit the Downloads link. Login with the email provided by the

instructor (or create a user, press Launch.

GSEA works with specific file formats. You must createexpression datasetand

phenotype labelfiles:

1. To make expression dataset file open the file you downloaded from GEO. Erase

all header lines other than !Sample_title. Erase the last row as well. Use the

information from !Sample_titleto delete all the columns that do not contain time

point 6 data, then erase this row as well. Change the title of the column with probe
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/pubmed/20164420http://www.ncbi.nlm.nih.gov/pubmed/20164420http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gds&DbFrom=pubmed&Cmd=Link&LinkName=pubmed_gds&IdsFromResult=20164420http://www.ncbi.nlm.nih.gov/geohttp://www.ncbi.nlm.nih.gov/geohttp://openftp%28%27ftp//ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE18791/')http://www.broadinstitute.org/gsea/http://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Expression_Datasetshttp://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Expression_Datasetshttp://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Expression_Datasetshttp://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Phenotypeshttp://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Phenotypeshttp://www.ncbi.nlm.nih.gov/pubmed/20164420http://www.ncbi.nlm.nih.gov/pubmed/20164420http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gds&DbFrom=pubmed&Cmd=Link&LinkName=pubmed_gds&IdsFromResult=20164420http://www.ncbi.nlm.nih.gov/geohttp://openftp%28%27ftp//ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE18791/')http://www.broadinstitute.org/gsea/http://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Expression_Datasetshttp://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#Phenotypeshttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed


3/6

set ids to NAME. Insert a column to the right of NAME and title it

DESCRIPTION. Write na in the first row of DESCRIPTION, copy it, select all

rows by clicking on the first and last rows while holding down the shift button.

Paste na in to all rows.

2. To make the phenotype file open an excel file and make the following 3 lines

8 2 1

# AF NDV

0 1 0 1 0 1 0 1

Save as a tab delaminated txt file with a .cls extension

Now that we have our datasets in the correct format, we can load them into GSEA. Go to the

load data tab and drag&drop the two files you selected into the grey box. Hit "load these files!"

to load them. If you get an error, look through the previous steps and make sure you formatted

your two files correctly. Switch to the Run GSEA tab.

Choose the Expression Dataset you made from the Geo Data.

Choose gseaftp.broadinstitute.org://pub/gsea/gene_sets/c3.tft.v2.5.symbols.gmt as a Geneset

database.

Change number of permutations to 10

Choose the NDV vs AF phenotype

Change Permutation type to - gene_set


4/6

Look in the GEO webpage to determine which chip platform was used in the

experiment and select the correct Chip Platform. (it should be -

gseaftp.broadinstitute.org://pub/gsea/annotations/HG_U133_Plus_2.chip)

Expand Basic Fields

Change Enrichment statistic to Classic

Hit Run

When the running indicator (on the bottom left hand side of the window) changes to

Success click on it. This should open the details of the analysis in your web browser. In

some operating systems (for instance Linux) when you hit Success to see the results, youget an error. You just need to go to the default output folder (look for it in Options >

Preferences) and open the index.html file with a browser.

Look around a bit try Snapshot and enrichment results in HTML

In the latter page click on the Details link that belongs to the IRF1 matrix (V$IRF1).

RETRIEVE AND VERIFY THE PROMOTER REGION AND THE SEQUENCE OF THE

IFNB1 GENE (NCBI DATABASE)

In the results of the previous analysis click on enrichment results in HTML

click on the Details link that belongs to the IRF1 matrix (V$IRF1).

Near the top of the page you will find IFNB1

Click the Entrez which will take you to the Entrez database at the relevant page.

Choose the entry for IFNB1 in humans.

Follow the linkSee IFNB1 in MapViewer Choose the link dl

"Strand" box - Select "minus" strand from the pull-down menu because IFNB gene is

on the minus strand (note how the arrow next to IFNB1 on MapViewer points up).
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=gene&term=IFNB1%5Bsym%5Dhttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?direct=on&idtype=gene&id=3456http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=gene&term=IFNB1%5Bsym%5Dhttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?direct=on&idtype=gene&id=3456


5/6

In the "adjust by" box in the "to" line - Increase sequence information displayed by

inserting +2K", Click Change Region/Strand (note how coordinates change and strand

designation becomes "-"). Click Display.

BLASTthe extended sequence against the human genome and verify that it is indeed the

sequence of IFNB1 (Make sure to include the line that starts with ">")

Hit the back button in browser copy the sequence into a text editor save as .fasta

TRANSCRIPTION FACTOR TARGET SITE PREDICTION FOR THE PROMOTER

REGION OF IFNB1

JASPAR

Choose the JASPAR CORE Vertebrata

Press Toggle so that all JASPAR matrix models are selected.

Paste the Human IFNB1 gene and promoter in the text area.

Hit Scan

Find the target sites for different IRF1 target sites.

JASPAR attempts to make non redundant target site Matrices so we do not see general IRF

binding sites. We do find potential target sites for both IRF1 and IRF2.

USE THE UCSC GENOME BROWSER TO LEARN MORE ABOUT THE IFNB1 GENE

AND THE IRF1 TF TARGET SITE

Open the UCSC genome browserand click the "Genomes" link at the top.

Make sure that the hg16 build is selected in the assembly box (July 2003

(NCBI34/hg16)), then type IFNB1 in the search box

Add 2000 bp on both sides of the gene

Under the regulation section unhide the annotation TFBS Conserved (TFBS =

Transcription Factor Binding Site).

Note the TF identified, especially the IRF1/IRF2 binding site at chr9:21068021-

21068033
http://www.ncbi.nlm.nih.gov/BLAST/http://jaspar.genereg.net/cgi-bin/jaspar_db.pl?rm=browse&db=core&tax_group=vertebrateshttp://genome.ucsc.edu/http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=158202547&db=hg16&position=chr9%3A21068021-21068033http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=158202547&db=hg16&position=chr9%3A21068021-21068033http://www.ncbi.nlm.nih.gov/BLAST/http://jaspar.genereg.net/cgi-bin/jaspar_db.pl?rm=browse&db=core&tax_group=vertebrateshttp://genome.ucsc.edu/http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=158202547&db=hg16&position=chr9%3A21068021-21068033http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=158202547&db=hg16&position=chr9%3A21068021-21068033


6/6

Notice how we do not find the same binding sites in both types of analysis. Many fewer

sites appear in this view and only one of the IRF1 sites identified by JASPER also comes

up here.

As an alternative to the above steps, we can find the region we analyzed in step 4 by

clicking on Blat and copying in the sequence we used in Match and using search. In this

way we also verify that our identification of IFNB1 in the NCBI database is correct..

Result:

The given Procedure was followed to identify the IFNB1 promotor region and gene set

analysis by various computational methods.
http://genome.ucsc.edu/cgi-bin/hgBlat?hgsid=158200950http://genome.ucsc.edu/cgi-bin/hgBlat?hgsid=158200950

biological modeling promoter and gene set analysis

Documents