b ioinformatics dr. aladdin hamwiehkhalid al-shamaa abdulqader jighly 2010-2011 lecture 8 analyzing...

34
BIOINFORMATICS Dr. Aladdin Hamwieh Khalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data po University lty of technical engineering rtment of Biotechnology

Upload: cuthbert-sanders

Post on 23-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

BIOINFORMATICS

Dr. Aladdin Hamwieh Khalid Al-shamaa

Abdulqader Jighly

2010-2011

Lecture 8Analyzing Microarray

Data

Aleppo UniversityFaculty of technical engineeringDepartment of Biotechnology

Page 2: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

Microarray can monitor many genes at once, a DNA microarray is an inert, solid, flat and transparent surface (e.g.: a microscopic slide) onto which 20,000 to 60,000 short DNA probes of specified sequences are orderly tethered. Each probe corresponds to a particular short section of a gene. So a single gene is covered by several probes which span different parts of the gene sequence.

Page 3: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

REPOSITORIES OF MICROARRAY STUDIES

Due to the large use of microarrays, data repositories have flourished world-wide. Three of the largest databases of gene expression are:

1.The Gene Expression Omnibus (GEO)2.National Center for Biotechnology

Information (NCBI)3.Stanford Microarray Data Base (SMD)And for PLANTS

Plant Expression databasePLEXdb

Page 4: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

DNA microarrays measure the RNA abundance with either 1 channel (one color) or 2 channels (two colors).

Affymetrix GeneChip has 1 channel and use either fluorescent red dye Cy5 or green fluorescent dye, Cy3

Stanford microarrays measure by competitive hybridization the relative expression under a given condition (fluorescent red dye Cy5) compared to its control (labeled with a green fluorescent dye, Cy3) (Two channels)

Page 5: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

R, G

16-bit TIFF files

(Rfg, Rbg), (Gfg, Gbg)

Page 6: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

R, G

16-bit TIFF files

(Rfg, Rbg), (Gfg, Gbg)

Page 7: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

VIDEO

Page 8: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

R, G

16-bit TIFF files

(Rfg, Rbg), (Gfg, Gbg)

Page 9: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

MICROARRAY EXPERIMENT

1. Isolate mRNA2. Make labelled cDNA library3. Apply your DNA on the slide4. Scan the slide5. Purify the picture6. Extract the data7. Analyse your data

Page 10: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

RESULTS

The colors denote the degree of expression in the

experimental versus the control cells.

Gene not expressed in control or in experimental

cells

Only incontrol

cells

Mostly incontrol

cells

Only inexperimental

cells

Mostly inexperimental

cells

Same inboth cells

Page 11: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical
Page 12: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

Let us talk about the analysis and the mathematical problems:

Now we have a lot of pictures which contain a huge information so:1- we have to purify the picture2- we have to extract our data.

Page 13: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

R, G

16-bit TIFF files

(Rfg, Rbg), (Gfg, Gbg)

Page 14: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

IMAGE ANALYSIS

The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes.

Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

Page 15: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

STEPS IN IMAGE ANALYSIS

1. Addressing. Estimate location of spot centers.

2. Segmentation. Classify pixels as foreground (signal) or background.

3. Information extraction. For each spot on the array and each dye

• foreground intensities;• background intensities; • quality measures.

Page 16: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

WHY DO WE CALCULATE THE BACKGROUND INTENSITIES?

Motivation behind background adjustment: A spot’s measured fluorescence intensity includes a contribution that is not specifically due to the hybridization of the target to the probe, but to something else, e.g. the chemical treatment of the slide, autofluorescence etc. Want to estimate and remove this unwanted contribution.

Page 17: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

QUANTIFICATION OF EXPRESSION

For each spot on the slide we calculate

Red intensity = Rfg - Rbg

fg = foreground, bg = background, and

Green intensity = Gfg – Gbg

Page 18: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

CDNA GENE EXPRESSION DATA

Genes

mRNA samples

Gene expression level of gene 5 in mRNA sample 4

= log2( Red intensity / Green intensity)

sample1 sample2 sample3 sample4 sample5 …1 0.46 0.30 0.80 1.51 0.00 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Data on p genes for n samples

down-regulated gene

Up-regulated gene

unchangedexpression

Page 19: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

R, G

16-bit TIFF files

(Rfg, Rbg), (Gfg, Gbg)

Page 20: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

NORMALIZATION

Why?To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples for example:

1. Dyes activity2. Dyes quantity3. scanning parameters4. location on the array5. Air bubbles

Page 21: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

SELF-SELF HYBRIDIZATIONS

How do we know it is necessary?

By examining self-self hybridizations, we label one sample from the same tissue with two dyes Cy3 , Cy5 so We find dye biases.

Page 22: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

R, G

16-bit TIFF files

(Rfg, Rbg), (Gfg, Gbg)

Page 23: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

HOMOGENEITY AND SEPARATION PRINCIPLES

Homogeneity: Elements within a cluster are close to each other

Separation: Elements in different clusters are further apart from each other

…clustering is not an easy task!

Given these points a clustering algorithm might make two distinct clusters as follows

Page 24: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

BAD CLUSTERING

This clustering violates both Homogeneity and Separation principles

Close distances from points in separate clusters

Far distances from points in the same cluster

Page 25: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

GOOD CLUSTERING

This clustering satisfies both Homogeneity and Separation principles

Page 26: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

CLUSTERING TECHNIQUES Agglomerative: Start with every element in its

own cluster, and iteratively join clusters together

Divisive: Start with one cluster and iteratively divide it into smaller clusters

Hierarchical: Organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees

Page 27: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

HIERARCHICAL CLUSTERING

12

3

4

6

5

789

7 9 84 51 2 3 6

Page 28: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

HIERARCHICAL CLUSTERING ALGORITHM1. Hierarchical Clustering (d , n)2. Form n clusters each with one element3. Construct a graph T by assigning one vertex to each cluster4. while there is more than one cluster5. Find the two closest clusters C1 and C2

6. Merge C1 and C2 into new cluster C with |C1| +|C2| elements7. Compute distance from C to all other clusters8. Add a new vertex C to T and connect to vertices C1 and C2

9. Remove rows and columns of d corresponding to C1 and C2

10. Add a row and column to d corrsponding to the new cluster C

11. return TThe algorithm takes a nxn distance matrix d of pairwise distances between points as an input.

Page 29: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

K-MEANS CLUSTERING PROBLEM: FORMULATION Input: A set, V, consisting of n points and a

parameter k Output: A set X consisting of k points

(cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X

Page 30: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

1-MEANS CLUSTERING PROBLEM: AN EASY CASE

Input: A set, V, consisting of n points

Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x

Page 31: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

1-MEANS CLUSTERING PROBLEM: AN EASY CASE Input: A set, V, consisting of n points

Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x

1-Means Clustering problem is easy.

However, it becomes very difficult (NP-complete) for more than one center.

An efficient heuristic method for K-Means clustering is the Lloyd algorithm

Page 32: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ressio

n in

co

nd

itio

n 2

x1

x2

x3

K-MEANS CLUSTERING: LLOYD ALGORITHM

Page 33: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

K-MEANS CLUSTERING: LLOYD ALGORITHM1. Lloyd Algorithm2. Arbitrarily assign the k cluster centers3. while the cluster centers keep changing4. Assign each data point to the cluster Ci

corresponding to the closest cluster representative (center) (1 ≤ i ≤

k)5. After the assignment of all data points,

compute new cluster representatives according to the center of gravity of

each cluster, that is, the new cluster representative is

∑v \ |C| for all v in C for every cluster C

*This may lead to merely a locally optimal clustering.

Page 34: B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical

THANK YOU