analysis of high-throughput gene expression profiling

35
Analysis of High- throughput Gene Expression Profiling

Upload: anabel-atkins

Post on 22-Dec-2015

229 views

Category:

Documents


3 download

TRANSCRIPT

Analysis of High-throughput Gene Expression Profiling

Why to Measure Gene Expression

1. Determines which genes are induced/repressed inresponse to a developmental phase or to anenvironmental change.2. Sets of genes whose expression rises and fallsunder the same condition are likely to have arelated function.3. Features such as a common regulatory motif can bedetected within co-expressed genes.4. A pattern of gene expression may be used as anindicator of abnormal cellular regulation.

• A useful tool for cancer diagnosis

Why to Measure Gene Expression in Large Scale?

Transitional vs. High-throughput Approaches

Techniques Used to Detect Gene Expression Level

• Microarray (single or dual channel)Microarray (single or dual channel)• SAGESAGE• EST/cDNA libraryEST/cDNA library• Northern Blots• Subtractive hybridisation• Differential hybridisation• Representational difference analysis (RDA)• DNA/RNA Fingerprinting (RAP-PCR)• Differential Display (DD-PCR)• aCGH: array CGH (DNA level)

High-throughput High-throughput

Basic Information of Microarray, SAGE and cDNA Library

(DNA) Microarray1. Developed around 1987.2. Employ methods previously exploited in immunoassay co

ntext – specific binding and marking techniques.3. Two types of probes:

Format I:Format I: probe cDNA (500~5,000 bases long) is immobilized to a solid surface such as glass; widely considered as developed at Stanford University; Traditionally called DNA microarrays. Format II:Format II: an array of oligonucleotide (20~80-mer oligos) probes is synthesized either in situ(on-chip) or by conventional synthesis followed by on-chip immobilization; developed at Affymetrix, Inc. Many companies are anufacturing oligonucleotide based chips using alternative in-situ synthesis or depositioning technologies. Historically called DNA chips.

Microarray

• Single Channel: sub-type classification

• Dual Channel: differential expression gene screening

• Tissue microarray

• Protein microarray

• ……

Array CGH

• Detecting DNA copy variation via microarray approach

• A hotspot in recent research works, especially in Cancer research

Microarray Analysis

gene discovery

pattern discovery

inferences about biological processes

classification of biological processes

Which genes are up-regulated, down-regulated, co-regulated, not-regulated?

SAGE

• Experimental technique assigned to gain a quantitive measure of gene expression.

• ~10-20 base “tags” are produced (immediately adjacent to the 3’ end of the 3’ most NlaIII restriction site).

• The SAGE technique measures not the expression level of a gene, but quantifies a "tag" which represents the transcription product of a gene.

SAGE

Tags are isolated and concatermized.

Relative expression levels can be compared between cells in different states.

SAGEmap (http://cgap.nci.nih.gov)

SAGE: comparing two relational libraries

EST library (UniGene)

Gene expression info from Unigene Library

An Example of In-house EST Library Analysis

The Algorithms and Challenges of High-throughput Gene Expression Analysis

Seeing is believing?

No, need to correct errors.

SAGE:

• A typical experiment requires ~30,000 gene expression comparisons where normal and a diseased cell is compared.

• The results were subject to the size and reliabilities of the SAGE libraries.

• Statistical measures are used to filter out candidate genes to reduce the dimensionality of the data but it is tedious and time consuming to play with these measures until a good set is found.

SAGE

• TPM: a simple normalization methodTPM=Count*1000,000/TotalCount

• Bayesian approach http://cancerres.aacrjournals.org/cgi/content/full/59/21/5403

Microarray: Sources of errors

• systematic

• random

l

og

sig

nal

in

ten

sity

log RNA abundance

Sources of Errors (Cont.)

• Printing and/or tip problems• Labeling and dye effects (differing amounts of

RNA labeled between the 2 channels)• Differences in the power of the two lasers (or

other scanner problems) • Difference in DNA concentration on arrays (pl

ate effects)• Spatial biases in ratios across the surface of t

he microarray due to uneven hybridization• cDNA array cannot distinguish alternatively

spliced forms

Errors that cannot be corrected by statistics

• Competitive hybridization of different targets on the chip

• Failure to distinguish different splicing forms

• Misinterpretation of time course data when there are not sufficient points

• Misinterpretation of relative intensity

Does clustered time course really mean co-expression?

Picture taken from http://genomics.stanford.edu/yeast/additional_figures_link.html

Yes, you can studyknown system (such as cell cycle) this way; but, how about the unknown systems?

Normalization by iterative linear regression

fit a line (y=mx+b) to the data set

set aside outliers (residuals > 2 x s.e.)

repeat until r2 changes by

< 0.001

then apply slope and intercept to the original dataset

D Finkelstein et al. http://www.camda.duke.edu/CAMDA00/abstracts.asp

average signal {log2 (Cy3 + Cy5)/2}

rati

o {

log

2 (C

y5 /

Cy3

)} Loess function fit line

0

Normalization (Curvilinear)

G Tseng et al., NAR 2001

After Normalization ……

• Differentially Expressed (DE) Gene screeing– T-test– T-statistics– SVM

• Clustering– Hierarchical– SOM– K-means

• Network (Pathway) analysis– BioCarta, KEGG, GO databases– Bayesian network learning– Topology – …

Bioinformatics challenges

1. data management

2. utilizing data from multiple experiments

3. utilizing data from multiple groups

* with different technologies

* with only processed data available

Bioinformatics Analysis of Integrated Analysis of Gene Expression Profiling

Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression

Daniel R. et al. PNAS, 2004(101), 9309-9314 T-test Q values (estimated false discovery rates) were calculated as

where P is P value, n is the total number of genes, and i is the sorted rank of P value.

Cont. Meta-Profiling.

The purpose of meta-profiling is to address the hypothesis that a selected set of differential expression signatures shares a significant intersection of genes (a meta-signature), thus inferring a biological relatedness.

67 genes were screened by mata-analysis

Integrated Cancer Gene Expression Map

7 genes were discovered by the system

THANX!!