b ioinformatics dr. aladdin hamwiehkhalid al-shamaa abdulqader jighly 2010-2011 lecture 8 analyzing...

BIOINFORMATICS

Dr. Aladdin Hamwieh Khalid Al-shamaa

Abdulqader Jighly

2010-2011

Lecture 8Analyzing Microarray

Data

Aleppo UniversityFaculty of technical engineeringDepartment of Biotechnology

Microarray can monitor many genes at once, a DNA microarray is an inert, solid, flat and transparent surface (e.g.: a microscopic slide) onto which 20,000 to 60,000 short DNA probes of specified sequences are orderly tethered. Each probe corresponds to a particular short section of a gene. So a single gene is covered by several probes which span different parts of the gene sequence.

REPOSITORIES OF MICROARRAY STUDIES

Due to the large use of microarrays, data repositories have flourished world-wide. Three of the largest databases of gene expression are:

1.The Gene Expression Omnibus (GEO)2.National Center for Biotechnology

Information (NCBI)3.Stanford Microarray Data Base (SMD)And for PLANTS

Plant Expression databasePLEXdb

DNA microarrays measure the RNA abundance with either 1 channel (one color) or 2 channels (two colors).

Affymetrix GeneChip has 1 channel and use either fluorescent red dye Cy5 or green fluorescent dye, Cy3

Stanford microarrays measure by competitive hybridization the relative expression under a given condition (fluorescent red dye Cy5) compared to its control (labeled with a green fluorescent dye, Cy3) (Two channels)

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

R, G

16-bit TIFF files

(Rfg, Rbg), (Gfg, Gbg)

MICROARRAY EXPERIMENT

1. Isolate mRNA2. Make labelled cDNA library3. Apply your DNA on the slide4. Scan the slide5. Purify the picture6. Extract the data7. Analyse your data

RESULTS

The colors denote the degree of expression in the

experimental versus the control cells.

Gene not expressed in control or in experimental

cells

Only incontrol

cells

Mostly incontrol

cells

Only inexperimental

cells

Mostly inexperimental

cells

Same inboth cells

Let us talk about the analysis and the mathematical problems:

Now we have a lot of pictures which contain a huge information so:1- we have to purify the picture2- we have to extract our data.


Testing



Estimation

Experimental design

Image analysis

Normalization


R, G

16-bit TIFF files


IMAGE ANALYSIS

The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes.

Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

STEPS IN IMAGE ANALYSIS

1. Addressing. Estimate location of spot centers.

2. Segmentation. Classify pixels as foreground (signal) or background.

3. Information extraction. For each spot on the array and each dye

• foreground intensities;• background intensities; • quality measures.

WHY DO WE CALCULATE THE BACKGROUND INTENSITIES?

Motivation behind background adjustment: A spot’s measured fluorescence intensity includes a contribution that is not specifically due to the hybridization of the target to the probe, but to something else, e.g. the chemical treatment of the slide, autofluorescence etc. Want to estimate and remove this unwanted contribution.

QUANTIFICATION OF EXPRESSION

For each spot on the slide we calculate

Red intensity = Rfg - Rbg

fg = foreground, bg = background, and

Green intensity = Gfg – Gbg

CDNA GENE EXPRESSION DATA

Genes

mRNA samples

Gene expression level of gene 5 in mRNA sample 4

= log2( Red intensity / Green intensity)

sample1 sample2 sample3 sample4 sample5 …1 0.46 0.30 0.80 1.51 0.00 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Data on p genes for n samples

down-regulated gene

Up-regulated gene

unchangedexpression


Testing



Estimation

Experimental design

Image analysis

Normalization


R, G

16-bit TIFF files


NORMALIZATION

Why?To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples for example:

1. Dyes activity2. Dyes quantity3. scanning parameters4. location on the array5. Air bubbles

SELF-SELF HYBRIDIZATIONS

How do we know it is necessary?

By examining self-self hybridizations, we label one sample from the same tissue with two dyes Cy3 , Cy5 so We find dye biases.


Testing



Estimation

Experimental design

Image analysis

Normalization


R, G

16-bit TIFF files


HOMOGENEITY AND SEPARATION PRINCIPLES

Homogeneity: Elements within a cluster are close to each other

Separation: Elements in different clusters are further apart from each other

…clustering is not an easy task!

Given these points a clustering algorithm might make two distinct clusters as follows

BAD CLUSTERING

This clustering violates both Homogeneity and Separation principles

Close distances from points in separate clusters

Far distances from points in the same cluster

GOOD CLUSTERING

This clustering satisfies both Homogeneity and Separation principles

CLUSTERING TECHNIQUES Agglomerative: Start with every element in its

own cluster, and iteratively join clusters together

Divisive: Start with one cluster and iteratively divide it into smaller clusters

Hierarchical: Organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees

HIERARCHICAL CLUSTERING

12

3

4

6

5

789

7 9 84 51 2 3 6

HIERARCHICAL CLUSTERING ALGORITHM1. Hierarchical Clustering (d , n)2. Form n clusters each with one element3. Construct a graph T by assigning one vertex to each cluster4. while there is more than one cluster5. Find the two closest clusters C1 and C2

6. Merge C1 and C2 into new cluster C with |C1| +|C2| elements7. Compute distance from C to all other clusters8. Add a new vertex C to T and connect to vertices C1 and C2

9. Remove rows and columns of d corresponding to C1 and C2

10. Add a row and column to d corrsponding to the new cluster C

11. return TThe algorithm takes a nxn distance matrix d of pairwise distances between points as an input.

K-MEANS CLUSTERING PROBLEM: FORMULATION Input: A set, V, consisting of n points and a

parameter k Output: A set X consisting of k points

(cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X

1-MEANS CLUSTERING PROBLEM: AN EASY CASE

Input: A set, V, consisting of n points

Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x

1-MEANS CLUSTERING PROBLEM: AN EASY CASE Input: A set, V, consisting of n points

Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x

1-Means Clustering problem is easy.

However, it becomes very difficult (NP-complete) for more than one center.

An efficient heuristic method for K-Means clustering is the Lloyd algorithm

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ressio

n in

co

nd

itio

n 2

x1

x2

x3

K-MEANS CLUSTERING: LLOYD ALGORITHM

K-MEANS CLUSTERING: LLOYD ALGORITHM1. Lloyd Algorithm2. Arbitrarily assign the k cluster centers3. while the cluster centers keep changing4. Assign each data point to the cluster Ci

corresponding to the closest cluster representative (center) (1 ≤ i ≤

k)5. After the assignment of all data points,

compute new cluster representatives according to the center of gravity of

each cluster, that is, the new cluster representative is

∑v \ |C| for all v in C for every cluster C

*This may lead to merely a locally optimal clustering.

THANK YOU

b ioinformatics dr. aladdin hamwiehkhalid al-shamaa abdulqader jighly 2010-2011 lecture 8 analyzing...

Documents