future data scientists also need to be skilled in statistics, and to be able to tell stories with...

future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people.

DNA microarray and array data

analysis

What is DNA Microarray DNA microarray is a new

technology to measure the level of the RNA gene products of a living cell.

A microarray chip is a rectangular chip on which is imposed a grid of DNA spots. These spots form a two dimensional array.

Each spot in the array contains millions of copies of some DNA strand, bonded to the chip.

Chips are made tiny so that a small amount of RNA is needed from experimental cells.

DNA Microarray

Many applications in both basic and clinical research determining the role a gene plays in a

pathway, disease, diagnostics and pharmacology, …

There are three main platforms for performing microarray analyses. cDNA arrays (generic, multiple

manufacturers) Oligonucleotide arrays (genechips)

(Affymetrix) BeatArray (BeadChip) (Illumina) cDNA membranes (radioactive

detection)

cDNA Microarray Spot cloned cDNAs onto a glass/nylon

microscope slide usually PCR amplified segments of plasmids Complementary hybridization

-- CTAGCAGG actual gene

-- GATCGTCC cDNA (Reverse transcriptase)-- CUAGCAGG mRNA

Label 2 mRNA samples with 2 different colors of fluorescent dye -- control vs. experimental

Mix two labeled mRNAs and hybridize to the chip

Make two scans - one for each color Combine the images to calculate ratios of

amounts of each mRNA that bind to each spot

CTRL

TEST

Spotted Microarray Process

cDNA Array Experiment Movie

http://www.bio.davidson.edu/courses/genomics/chip/chip.html



Affymetrix Uses 25 base oligos synthesized in place on

a chip (20 pairs of oligos for each gene) cRNA labeled and scanned in a single

“color” one sample per chip

Can have as many as 760,000 probes on a chip

Arrays get smaller every year (more genes) Chips are expensive Proprietary system: “black box” software,

can only use their chips

Affymetrix GeneChip® Probe

Arrays

24~50µm

Each probe cell or feature containsmillions of copies of a specificoligonucleotide probe

Image of Hybridized Probe Array

Single stranded, fluorescentlylabeled cRNA target

Oligonucleotide probe

**

**

*

1.28cm

GeneChip Probe Array

Hybridized Probe Cell

*

GeneChip® Human Gene 1.0 ST Array

Affymetrix Genome Arrays

Microarray Data Analysis

Data processing and visualization Supervised learning

Feature selection Machine learning approaches

Unsupervised learning Clustering and pattern detection

Infer gene interactions in pathways and networks

Gene regulatory regions predictions based co-regulated genes

Linkage between gene expression data and gene sequence/function databases

…

Microarrays: An ExampleLeukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 199972 examples (38 train, 34 test), about 7,000 probes

well-studied (CAMDA-2000), good test exampleALL AML

Visually similar, but genetically very different

Normalization

Need to scale the red sample so that the overall intensities for each chip are equivalent

control control

Sam

ple

1

Sam

ple

2

What can we tell from the two plots ?

Normalization To insure the data are comparable,

normalization attempts to correct the following variables: Number of cells in the sample Total RNA isolation efficiency Signal measurement sensitivity …

Can use simple/complicated math Normalization by global scaling (bring each

image to the same average brightness) Normalization by sectors Normalization to housekeeping genes …

Active research area

SP22 vs. SP23

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

AML vs ALL

SP 33 vs SP34

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

AML vs ALL








…

Feature selectionProbe AML1 AML2 AML3 ALL1 ALL2 ALL3

D21869_s_at

170.7 55.0 43.7 5.5 807.9 1283.5

D25233cds_at

605 31.0 629.2 441.7 95.3 205.6

D25543_at 2148.7

2303.0

1915.5

49.2 96.3 89.8

L03294_g_at

241.8 721.5 77.2 66.1 107.3 132.5

J03960_at 774.5 3439.8

614.3 556 14.4 12.9

M81855_at 1087 1283.7

1372.1

1469 4611.7 3211.8

L14936_at 212.6 2848.5

236.2 260.5 2650.9 2192.2

L19998_at 367 3.2 661.7 629.4 151 193.9

L19998_g_at

65.2 56.9 29.6 434.0 719.4 565.2

AB017912_at

1813.7

9520.6

2404.3

3853.1 6039.4 4245.7

AB017912_g_at

385.4 2396.8

363.7 419.3 6191.9 5617.6

U86635_g_at

83.3 470.9 52.3 3272.5 3379.6 5174.6

… … … … … … …

Feature selectionProbe AML1 AML2 AML3 ALL1 ALL2 ALL3 p-value

D21869_s_at

170.7 55.0 43.7 5.5 807.9 1283.5 0.243

D25233cds_at

605 31.0 629.2 441.7 95.3 205.6 0.487

D25543_at 2148.7

2303.0

1915.5

49.2 96.3 89.8 0.0026

L03294_g_at

241.8 721.5 77.2 66.1 107.3 132.5 0.332

J03960_at 774.5 3439.8

614.3 556 14.4 12.9 0.260

M81855_at 1087 1283.7

1372.1

1469 4611.7 3211.8 0.178

L14936_at 212.6 2848.5

236.2 260.5 2650.9 2192.2 0.626

L19998_at 367 3.2 661.7 629.4 151 193.9 0.941

L19998_g_at

65.2 56.9 29.6 434.0 719.4 565.2 0.022

AB017912_at

1813.7

9520.6

2404.3

3853.1 6039.4 4245.7 0.963

AB017912_g_at

385.4 2396.8

363.7 419.3 6191.9 5617.6 0.236

U86635_g_at

83.3 470.9 52.3 3272.5 3379.6 5174.6 0.022

… … … … … … … …

Hypothesis Testing

Hypothesis Testing

Null hypothesis is a hypothesis set up to be nullified in order to support an alternative hypothesis.

Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data

Example: Test whether the time to respond to a tone is

affected by the consumption of alcohol Hypothesis : µ1 - µ2 = 0

µ1 is the mean time to respond after consuming alcohol

µ2 is the mean time to respond otherwise

?

Z-test Theorem: If xi has a normal distribution with

mean and standard deviation 2, i=1,…,n, then U= ai xi has a normal distribution with a mean E(U)= ai and standard deviation D(U)=2 ai

2. xi /n ~ N(, 2/n).

Z test : H: µ = µ0 (µ0 and 0 are known, assume = 0)

What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and = 8? Use

Note: z follows a normal distribution N(0, 1)

two tail 0.05

104 100 43.39

8 1.1846

obt

obt

X

Xz

1.96critz Reject the null hypothesis.

http://davidmlane.com/hyperstat/z_table.html

Z-test

Theorem: If xi follows a normal distribution with mean and standard deviation 2, i=1,…,n, then U= ai xi has a normal distribution with a mean E(U)= ai and standard deviation D(U)=2 ai

2. xi /n ~ N(, 2/n).

Z test : H: µ = µ0 (µ0 and 0 are known, assume = 0)

But, in practice 0 is often unknown.

21

22

2121

21

221

111

20

01

111

)(1

1, ,

)()(

:

),(~,...,

),(~,...,

: testsample Two

)(1

1 ,

/

:

),(~,...,

: testsample One

xxn

sm

s

n

ss

s

yxt

H

Nyy

Nxx

xxn

sns

xt

H

Nxx

iyxyx

m

n

i

n

T-test

Sx-y standard error of the difference

Assuming 1 and 2 are different

http://www.stat.tamu.edu/~west/applets/tdemo.html

)2(~

:

)1(~

:

21

21

01

nntt

H

ntt

H

William Sealey Gosset (1876-1937)

(Guinness Brewing Company)

T-test

P-value

Does a particular gene have the same expression level in ALL and AML?

Probe AML1 AML2 AML3 ALL1 ALL2 ALL3 p-value

D25543_at 2148.7

2303.0

1915.5

49.2 96.3 89.8 0.0026

L03294_g_at

241.8 721.5 77.2 66.1 107.3 132.5 0.332

… … … … … … … …

ALL AML

Data processing

Feature selectionT-testBased on the fold change

Matlab ttest

[H,P] = ttest2(X,Y)

Determines whether the means from matrices X and Y are statistically different.

H return a 0 or 1 indicating accept or reject null hypothesis (that the means are the same)

P will return the significance level








…

Feature 2

Fea

ture

1L

L

L

L

L

LL

MM

M

M

M

M

Nearest Neighbor Classification

= AML

= ALL

= test sample

M

L

Feature 2

Fea

ture

1L

L

L

L

L

LL

MM

M

M

M

M

Feature 2

Fea

ture

1 L

L

L

L

L

LL

M MM

M

MM

= AML

= ALL

= test sample

M

L

Distance Issues Euclidean distance

■ Pearson distance

N

ii

N

ii

i

N

ii

)yy()xx(

)yy()xx(

),(d

1

2

1

2

1yx

N

iji yxd

1

2)(),( yxg1

g2

g3

g4

0

50

100

150

200

250

300

350

400

gene1 gene2 gene3 gene4

time0time1time2time3

Cross-validation http://en.wikipedia.org/wiki/Cross-

validation_(statistics)







…

Genetic Algorithm for Feature Selection

SampleClear cell RCC,etc.

Rawmeasurementdata

f1f2f3f4f5

Featurevector= pattern

Why Genetic Algorithm? Assuming 2,000 relevant genes, 20

important discriminator genes (features). Cost of an exhaustive search for the

optimal set of features ?C(n,k)=n!/k!(n-k)!C(2,000, 20) = 2000!/(20!1980!) ≥ (100)^20

= 10^40If it takes one femtosecond (10-15 second) to evaluate a set of features, it takes more than 310^17 years to find the optimal solution on the computer.

Evolutionary Methods

Based on the mechanics of Darwinian evolution The evolution of a solution is loosely based on

biological evolution Population of competing candidate

solutions Chromosomes (a set of features)

Genetic operators (mutation, recombination, etc.) generate new candidate solutions

Selection pressure directs the search those that do well survive (selection) to form

the basis for the next set of solutions.

A Simple Evolutionary Algorithm

SelectionGeneticOperators

Evaluation

Genetic Operators

Crossover

10 30 50 70

20 40 60 80

Randomly SelectedCrossover Point

10 30

50 7020 40

60 80

Mutation

10 30 62 80

Randomly Selected Mutation Site

Recombination is intended to produce promising individuals.

Mutation maintains population diversity, preventing premature convergence.

Genetic Algorithm

g2g1g6g3g21

g201

g17g51g21g1

g12g7g15g12g10

g25g72g56g23g10

g20g7g5g2g100 Good enough Stop

g20g7g6g3g21

g20g7g25g23g14

g12g7g15g22g10

g25g72g56g23g10

g2g1g5g2g100

Not good enough5

2

1

4

3

GA Fitness At the core of any optimization

approach is the function that measures the quality of a solution or optimization.

Called: Objective function Fitness function Error function measure etc.

Encoding

Most difficult, and important part of any GA

Encode so that illegal solutions are not possible

Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space

Most GA’s use a binary encoding of a solution, but other schemes are possible

Genetic Algorithm/K-Nearest Neighbor Algorithm

Classifier(kNN)

Feature Selection(GA)

MicroarrayDatabase



Machine learning approaches Unsupervised learning

Clustering and pattern detection Gene regulatory regions predictions

based co-regulated genes Linkage between gene expression data

and gene sequence/function databases …

Unsupervised learning Supervised methods

Can only validate or reject hypotheses

Can not lead to discovery of unexpected partitions

Unsupervised learning No prior knowledge is used

Explore structure of data on the basis of similarities

DEFINITION OF THE CLUSTERING PROBLEM

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

BUT WHAT ABOUT THE OKAPI ?

52 41 3

Agglomerative Hierarchical Clustering

3

1

4 2

5

Distance between joined clusters

Dendrogram

at each step merge pair of nearest clustersinitially : each point = cluster

Need to define the distance between thenew cluster and the other clusters.Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

(UPGMA)

Hierarchical Clustering -Summary

Results depend on distance update method

Greedy iterative process

NOT robust against noise

No inherent measure to identify stable clusters

Average Linkage (UPGMA) – the most widely used clustering method in gene expression analysis

Cluster both genes and samples

Sample should cluster together based on experimental design Often a way to

catch labelling errors or heterogeneity in samples

nature 2002 breast cancer

Heat map

Centroid methods – K-means

Data points at Xi , i= 1,...,N

Centroids at Y , = 1,...,K

Assign data point i to centroid ; Si =

Cost E:

E(S1 , S2 ,...,SN ; Y1 ,...YK ) =

Minimize E over Si , Y

2

1 1

))(,(

YXS i

N

i

K

i

K-means

“Guess” K=3

Start with random positions of centroids.

K-means

Iteration = 0

K-means

Iteration = 1


Assign each data point to closest centroid.

K-means

Iteration = 2



Move centroids to center of assigned points

K-means

Iteration = 3



Move centroids to center of assigned points

Iterate till minimal cost

Fast algorithm: compute distances from data points to centroids

Result depends on initial centroids’ position

Must preset K Fails for “non-spherical”

distributions

K-means - Summary

Issues in Cluster Analysis

A lot of clustering algorithms A lot of distance/similarity metrics Which clustering algorithm runs

faster and uses less memory? How many clusters after all? Are the clusters stable? Are the clusters meaningful?

Which Clustering Method Should I Use?

What is the biological question? Do I have a preconceived notion of

how many clusters there should be? How strict do I want to be? Spilt or

Join? Can a gene be in multiple clusters? Hard or soft boundaries between

clusters

future data scientists also need to be skilled in statistics, and to be able to tell stories with...

Documents

probe pairs

type of probe

meach probe cell

microarray chip

dna microarraydna microarray

dimensional array

chip cdna strand

microarray analyses