future data scientists also need to be skilled in statistics, and to be able to tell stories with...

63
future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people.

Upload: alexandrina-stone

Post on 16-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people.

Page 2: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

DNA microarray and array data

analysis

Page 3: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people
Page 4: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

What is DNA Microarray DNA microarray is a new

technology to measure the level of the RNA gene products of a living cell.

A microarray chip is a rectangular chip on which is imposed a grid of DNA spots. These spots form a two dimensional array.

Each spot in the array contains millions of copies of some DNA strand, bonded to the chip.

Chips are made tiny so that a small amount of RNA is needed from experimental cells.

Page 5: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

DNA Microarray

Many applications in both basic and clinical research determining the role a gene plays in a

pathway, disease, diagnostics and pharmacology, …

There are three main platforms for performing microarray analyses. cDNA arrays (generic, multiple

manufacturers) Oligonucleotide arrays (genechips)

(Affymetrix) BeatArray (BeadChip) (Illumina) cDNA membranes (radioactive

detection)

Page 6: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

cDNA Microarray Spot cloned cDNAs onto a glass/nylon

microscope slide usually PCR amplified segments of plasmids Complementary hybridization

-- CTAGCAGG actual gene

-- GATCGTCC cDNA (Reverse transcriptase)-- CUAGCAGG mRNA

Label 2 mRNA samples with 2 different colors of fluorescent dye -- control vs. experimental

Mix two labeled mRNAs and hybridize to the chip

Make two scans - one for each color Combine the images to calculate ratios of

amounts of each mRNA that bind to each spot

Page 7: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

CTRL

TEST

Spotted Microarray Process

Page 8: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

cDNA Array Experiment Movie

http://www.bio.davidson.edu/courses/genomics/chip/chip.html

Page 9: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Affymetrix Uses 25 base oligos synthesized in place on

a chip (20 pairs of oligos for each gene) cRNA labeled and scanned in a single

“color” one sample per chip

Can have as many as 760,000 probes on a chip

Arrays get smaller every year (more genes) Chips are expensive Proprietary system: “black box” software,

can only use their chips

Page 10: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Affymetrix GeneChip® Probe

Arrays

24~50µm

Each probe cell or feature containsmillions of copies of a specificoligonucleotide probe

Image of Hybridized Probe Array

Single stranded, fluorescentlylabeled cRNA target

Oligonucleotide probe

**

**

*

1.28cm

GeneChip Probe Array

Hybridized Probe Cell

*

Page 11: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

GeneChip® Human Gene 1.0 ST Array

Page 12: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Affymetrix Genome Arrays

Page 13: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people
Page 14: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Microarray Data Analysis

Data processing and visualization Supervised learning

Feature selection Machine learning approaches

Unsupervised learning Clustering and pattern detection

Infer gene interactions in pathways and networks

Gene regulatory regions predictions based co-regulated genes

Linkage between gene expression data and gene sequence/function databases

Page 15: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Microarrays: An ExampleLeukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 199972 examples (38 train, 34 test), about 7,000 probes

well-studied (CAMDA-2000), good test exampleALL AML

Visually similar, but genetically very different

Page 16: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Normalization

Need to scale the red sample so that the overall intensities for each chip are equivalent

control control

Sam

ple

1

Sam

ple

2

What can we tell from the two plots ?

Page 17: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Normalization To insure the data are comparable,

normalization attempts to correct the following variables: Number of cells in the sample Total RNA isolation efficiency Signal measurement sensitivity …

Can use simple/complicated math Normalization by global scaling (bring each

image to the same average brightness) Normalization by sectors Normalization to housekeeping genes …

Active research area

Page 18: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

SP22 vs. SP23

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

AML vs ALL

Page 19: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

SP 33 vs SP34

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

AML vs ALL

Page 20: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Microarray Data Analysis

Data processing and visualization Supervised learning

Feature selection Machine learning approaches

Unsupervised learning Clustering and pattern detection

Infer gene interactions in pathways and networks

Gene regulatory regions predictions based co-regulated genes

Linkage between gene expression data and gene sequence/function databases

Page 21: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Feature selectionProbe AML1 AML2 AML3 ALL1 ALL2 ALL3

D21869_s_at

170.7 55.0 43.7 5.5 807.9 1283.5

D25233cds_at

605 31.0 629.2 441.7 95.3 205.6

D25543_at 2148.7

2303.0

1915.5

49.2 96.3 89.8

L03294_g_at

241.8 721.5 77.2 66.1 107.3 132.5

J03960_at 774.5 3439.8

614.3 556 14.4 12.9

M81855_at 1087 1283.7

1372.1

1469 4611.7 3211.8

L14936_at 212.6 2848.5

236.2 260.5 2650.9 2192.2

L19998_at 367 3.2 661.7 629.4 151 193.9

L19998_g_at

65.2 56.9 29.6 434.0 719.4 565.2

AB017912_at

1813.7

9520.6

2404.3

3853.1 6039.4 4245.7

AB017912_g_at

385.4 2396.8

363.7 419.3 6191.9 5617.6

U86635_g_at

83.3 470.9 52.3 3272.5 3379.6 5174.6

… … … … … … …

Page 22: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Feature selectionProbe AML1 AML2 AML3 ALL1 ALL2 ALL3 p-value

D21869_s_at

170.7 55.0 43.7 5.5 807.9 1283.5 0.243

D25233cds_at

605 31.0 629.2 441.7 95.3 205.6 0.487

D25543_at 2148.7

2303.0

1915.5

49.2 96.3 89.8 0.0026

L03294_g_at

241.8 721.5 77.2 66.1 107.3 132.5 0.332

J03960_at 774.5 3439.8

614.3 556 14.4 12.9 0.260

M81855_at 1087 1283.7

1372.1

1469 4611.7 3211.8 0.178

L14936_at 212.6 2848.5

236.2 260.5 2650.9 2192.2 0.626

L19998_at 367 3.2 661.7 629.4 151 193.9 0.941

L19998_g_at

65.2 56.9 29.6 434.0 719.4 565.2 0.022

AB017912_at

1813.7

9520.6

2404.3

3853.1 6039.4 4245.7 0.963

AB017912_g_at

385.4 2396.8

363.7 419.3 6191.9 5617.6 0.236

U86635_g_at

83.3 470.9 52.3 3272.5 3379.6 5174.6 0.022

… … … … … … … …

Page 23: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Hypothesis Testing

Page 24: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Hypothesis Testing

Null hypothesis is a hypothesis set up to be nullified in order to support an alternative hypothesis.

Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data

Example: Test whether the time to respond to a tone is

affected by the consumption of alcohol Hypothesis : µ1 - µ2 = 0

µ1 is the mean time to respond after consuming alcohol

µ2 is the mean time to respond otherwise

?

Page 25: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Z-test Theorem: If xi has a normal distribution with

mean and standard deviation 2, i=1,…,n, then U= ai xi has a normal distribution with a mean E(U)= ai and standard deviation D(U)=2 ai

2. xi /n ~ N(, 2/n).

Z test : H: µ = µ0 (µ0 and 0 are known, assume = 0)

What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and = 8? Use

Note: z follows a normal distribution N(0, 1)

two tail 0.05

104 100 43.39

8 1.1846

obt

obt

X

Xz

1.96critz Reject the null hypothesis.

Page 26: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Z-test

Theorem: If xi follows a normal distribution with mean and standard deviation 2, i=1,…,n, then U= ai xi has a normal distribution with a mean E(U)= ai and standard deviation D(U)=2 ai

2. xi /n ~ N(, 2/n).

Z test : H: µ = µ0 (µ0 and 0 are known, assume = 0)

But, in practice 0 is often unknown.

Page 27: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

21

22

2121

21

221

111

20

01

111

)(1

1, ,

)()(

:

),(~,...,

),(~,...,

: testsample Two

)(1

1 ,

/

:

),(~,...,

: testsample One

xxn

sm

s

n

ss

s

yxt

H

Nyy

Nxx

xxn

sns

xt

H

Nxx

iyxyx

m

n

i

n

T-test

Sx-y standard error of the difference

Assuming 1 and 2 are different

Page 28: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

)2(~

:

)1(~

:

21

21

01

nntt

H

ntt

H

William Sealey Gosset (1876-1937)

(Guinness Brewing Company)

T-test

Page 29: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

P-value

Does a particular gene have the same expression level in ALL and AML?

Probe AML1 AML2 AML3 ALL1 ALL2 ALL3 p-value

D25543_at 2148.7

2303.0

1915.5

49.2 96.3 89.8 0.0026

L03294_g_at

241.8 721.5 77.2 66.1 107.3 132.5 0.332

… … … … … … … …

ALL AML

Page 30: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Data processing

Feature selectionT-testBased on the fold change

Page 31: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Matlab ttest

[H,P] = ttest2(X,Y)

Determines whether the means from matrices X and Y are statistically different.

H return a 0 or 1 indicating accept or reject null hypothesis (that the means are the same)

P will return the significance level

Page 32: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Microarray Data Analysis

Data processing and visualization Supervised learning

Feature selection Machine learning approaches

Unsupervised learning Clustering and pattern detection

Infer gene interactions in pathways and networks

Gene regulatory regions predictions based co-regulated genes

Linkage between gene expression data and gene sequence/function databases

Page 33: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Feature 2

Fea

ture

1L

L

L

L

L

LL

MM

M

M

M

M

Nearest Neighbor Classification

= AML

= ALL

= test sample

M

L

Feature 2

Fea

ture

1L

L

L

L

L

LL

MM

M

M

M

M

Feature 2

Fea

ture

1 L

L

L

L

L

LL

M MM

M

MM

= AML

= ALL

= test sample

M

L

Page 34: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Distance Issues Euclidean distance

■ Pearson distance

N

ii

N

ii

i

N

ii

)yy()xx(

)yy()xx(

),(d

1

2

1

2

1yx

N

iji yxd

1

2)(),( yxg1

g2

g3

g4

0

50

100

150

200

250

300

350

400

gene1 gene2 gene3 gene4

time0time1time2time3

Page 35: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Cross-validation http://en.wikipedia.org/wiki/Cross-

validation_(statistics)

Page 36: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Microarray Data Analysis

Data processing and visualization Supervised learning

Feature selection Machine learning approaches

Unsupervised learning Clustering and pattern detection

Gene regulatory regions predictions based co-regulated genes

Linkage between gene expression data and gene sequence/function databases

Page 37: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Genetic Algorithm for Feature Selection

SampleClear cell RCC,etc.

Rawmeasurementdata

f1f2f3f4f5

Featurevector= pattern

Page 38: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Why Genetic Algorithm? Assuming 2,000 relevant genes, 20

important discriminator genes (features). Cost of an exhaustive search for the

optimal set of features ?C(n,k)=n!/k!(n-k)!C(2,000, 20) = 2000!/(20!1980!) ≥ (100)^20

= 10^40If it takes one femtosecond (10-15 second) to evaluate a set of features, it takes more than 310^17 years to find the optimal solution on the computer.

Page 39: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Evolutionary Methods

Based on the mechanics of Darwinian evolution The evolution of a solution is loosely based on

biological evolution Population of competing candidate

solutions Chromosomes (a set of features)

Genetic operators (mutation, recombination, etc.) generate new candidate solutions

Selection pressure directs the search those that do well survive (selection) to form

the basis for the next set of solutions.

Page 40: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

A Simple Evolutionary Algorithm

SelectionGeneticOperators

Evaluation

Page 41: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Genetic Operators

Crossover

10 30 50 70

20 40 60 80

Randomly SelectedCrossover Point

10 30

50 7020 40

60 80

Mutation

10 30 62 80

Randomly Selected Mutation Site

Recombination is intended to produce promising individuals.

Mutation maintains population diversity, preventing premature convergence.

Page 42: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Genetic Algorithm

g2g1g6g3g21

g201

g17g51g21g1

g12g7g15g12g10

g25g72g56g23g10

g20g7g5g2g100 Good enough Stop

g20g7g6g3g21

g20g7g25g23g14

g12g7g15g22g10

g25g72g56g23g10

g2g1g5g2g100

Not good enough5

2

1

4

3

Page 43: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

GA Fitness At the core of any optimization

approach is the function that measures the quality of a solution or optimization.

Called: Objective function Fitness function Error function measure etc.

Page 44: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Encoding

Most difficult, and important part of any GA

Encode so that illegal solutions are not possible

Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space

Most GA’s use a binary encoding of a solution, but other schemes are possible

Page 45: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Genetic Algorithm/K-Nearest Neighbor Algorithm

Classifier(kNN)

Feature Selection(GA)

MicroarrayDatabase

Page 46: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Microarray Data Analysis

Data processing and visualization Supervised learning

Machine learning approaches Unsupervised learning

Clustering and pattern detection Gene regulatory regions predictions

based co-regulated genes Linkage between gene expression data

and gene sequence/function databases …

Page 47: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Unsupervised learning Supervised methods

Can only validate or reject hypotheses

Can not lead to discovery of unexpected partitions

Unsupervised learning No prior knowledge is used

Explore structure of data on the basis of similarities

Page 48: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

DEFINITION OF THE CLUSTERING PROBLEM

Page 49: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

Page 50: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

BUT WHAT ABOUT THE OKAPI ?

Page 51: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

52 41 3

Agglomerative Hierarchical Clustering

3

1

4 2

5

Distance between joined clusters

Dendrogram

at each step merge pair of nearest clustersinitially : each point = cluster

Need to define the distance between thenew cluster and the other clusters.Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

(UPGMA)

Page 52: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Hierarchical Clustering -Summary

Results depend on distance update method

Greedy iterative process

NOT robust against noise

No inherent measure to identify stable clusters

Average Linkage (UPGMA) – the most widely used clustering method in gene expression analysis

Page 53: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Cluster both genes and samples

Sample should cluster together based on experimental design Often a way to

catch labelling errors or heterogeneity in samples

Page 54: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

nature 2002 breast cancer

Heat map

Page 55: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Centroid methods – K-means

Data points at Xi , i= 1,...,N

Centroids at Y , = 1,...,K

Assign data point i to centroid ; Si =

Cost E:

E(S1 , S2 ,...,SN ; Y1 ,...YK ) =

Minimize E over Si , Y

2

1 1

))(,(

YXS i

N

i

K

i

Page 56: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

K-means

“Guess” K=3

Page 57: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Start with random positions of centroids.

K-means

Iteration = 0

Page 58: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

K-means

Iteration = 1

Start with random positions of centroids.

Assign each data point to closest centroid.

Page 59: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

K-means

Iteration = 2

Start with random positions of centroids.

Assign each data point to closest centroid.

Move centroids to center of assigned points

Page 60: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

K-means

Iteration = 3

Start with random positions of centroids.

Assign each data point to closest centroid.

Move centroids to center of assigned points

Iterate till minimal cost

Page 61: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Fast algorithm: compute distances from data points to centroids

Result depends on initial centroids’ position

Must preset K Fails for “non-spherical”

distributions

K-means - Summary

Page 62: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Issues in Cluster Analysis

A lot of clustering algorithms A lot of distance/similarity metrics Which clustering algorithm runs

faster and uses less memory? How many clusters after all? Are the clusters stable? Are the clusters meaningful?

Page 63: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people

Which Clustering Method Should I Use?

What is the biological question? Do I have a preconceived notion of

how many clusters there should be? How strict do I want to be? Spilt or

Join? Can a gene be in multiple clusters? Hard or soft boundaries between

clusters