future data scientists also need to be skilled in statistics, and to be able to tell stories with...
TRANSCRIPT
![Page 1: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/1.jpg)
future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people.
![Page 2: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/2.jpg)
DNA microarray and array data
analysis
![Page 3: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/3.jpg)
![Page 4: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/4.jpg)
What is DNA Microarray DNA microarray is a new
technology to measure the level of the RNA gene products of a living cell.
A microarray chip is a rectangular chip on which is imposed a grid of DNA spots. These spots form a two dimensional array.
Each spot in the array contains millions of copies of some DNA strand, bonded to the chip.
Chips are made tiny so that a small amount of RNA is needed from experimental cells.
![Page 5: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/5.jpg)
DNA Microarray
Many applications in both basic and clinical research determining the role a gene plays in a
pathway, disease, diagnostics and pharmacology, …
There are three main platforms for performing microarray analyses. cDNA arrays (generic, multiple
manufacturers) Oligonucleotide arrays (genechips)
(Affymetrix) BeatArray (BeadChip) (Illumina) cDNA membranes (radioactive
detection)
![Page 6: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/6.jpg)
cDNA Microarray Spot cloned cDNAs onto a glass/nylon
microscope slide usually PCR amplified segments of plasmids Complementary hybridization
-- CTAGCAGG actual gene
-- GATCGTCC cDNA (Reverse transcriptase)-- CUAGCAGG mRNA
Label 2 mRNA samples with 2 different colors of fluorescent dye -- control vs. experimental
Mix two labeled mRNAs and hybridize to the chip
Make two scans - one for each color Combine the images to calculate ratios of
amounts of each mRNA that bind to each spot
![Page 7: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/7.jpg)
CTRL
TEST
Spotted Microarray Process
![Page 8: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/8.jpg)
cDNA Array Experiment Movie
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
![Page 9: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/9.jpg)
Affymetrix Uses 25 base oligos synthesized in place on
a chip (20 pairs of oligos for each gene) cRNA labeled and scanned in a single
“color” one sample per chip
Can have as many as 760,000 probes on a chip
Arrays get smaller every year (more genes) Chips are expensive Proprietary system: “black box” software,
can only use their chips
![Page 10: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/10.jpg)
Affymetrix GeneChip® Probe
Arrays
24~50µm
Each probe cell or feature containsmillions of copies of a specificoligonucleotide probe
Image of Hybridized Probe Array
Single stranded, fluorescentlylabeled cRNA target
Oligonucleotide probe
**
**
*
1.28cm
GeneChip Probe Array
Hybridized Probe Cell
*
![Page 11: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/11.jpg)
GeneChip® Human Gene 1.0 ST Array
![Page 12: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/12.jpg)
Affymetrix Genome Arrays
![Page 13: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/13.jpg)
![Page 14: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/14.jpg)
Microarray Data Analysis
Data processing and visualization Supervised learning
Feature selection Machine learning approaches
Unsupervised learning Clustering and pattern detection
Infer gene interactions in pathways and networks
Gene regulatory regions predictions based co-regulated genes
Linkage between gene expression data and gene sequence/function databases
…
![Page 15: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/15.jpg)
Microarrays: An ExampleLeukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 199972 examples (38 train, 34 test), about 7,000 probes
well-studied (CAMDA-2000), good test exampleALL AML
Visually similar, but genetically very different
![Page 16: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/16.jpg)
Normalization
Need to scale the red sample so that the overall intensities for each chip are equivalent
control control
Sam
ple
1
Sam
ple
2
What can we tell from the two plots ?
![Page 17: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/17.jpg)
Normalization To insure the data are comparable,
normalization attempts to correct the following variables: Number of cells in the sample Total RNA isolation efficiency Signal measurement sensitivity …
Can use simple/complicated math Normalization by global scaling (bring each
image to the same average brightness) Normalization by sectors Normalization to housekeeping genes …
Active research area
![Page 18: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/18.jpg)
SP22 vs. SP23
1
10
100
1000
10000
100000
1 10 100 1000 10000 100000
AML vs ALL
![Page 19: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/19.jpg)
SP 33 vs SP34
1
10
100
1000
10000
100000
1 10 100 1000 10000 100000
AML vs ALL
![Page 20: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/20.jpg)
Microarray Data Analysis
Data processing and visualization Supervised learning
Feature selection Machine learning approaches
Unsupervised learning Clustering and pattern detection
Infer gene interactions in pathways and networks
Gene regulatory regions predictions based co-regulated genes
Linkage between gene expression data and gene sequence/function databases
…
![Page 21: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/21.jpg)
Feature selectionProbe AML1 AML2 AML3 ALL1 ALL2 ALL3
D21869_s_at
170.7 55.0 43.7 5.5 807.9 1283.5
D25233cds_at
605 31.0 629.2 441.7 95.3 205.6
D25543_at 2148.7
2303.0
1915.5
49.2 96.3 89.8
L03294_g_at
241.8 721.5 77.2 66.1 107.3 132.5
J03960_at 774.5 3439.8
614.3 556 14.4 12.9
M81855_at 1087 1283.7
1372.1
1469 4611.7 3211.8
L14936_at 212.6 2848.5
236.2 260.5 2650.9 2192.2
L19998_at 367 3.2 661.7 629.4 151 193.9
L19998_g_at
65.2 56.9 29.6 434.0 719.4 565.2
AB017912_at
1813.7
9520.6
2404.3
3853.1 6039.4 4245.7
AB017912_g_at
385.4 2396.8
363.7 419.3 6191.9 5617.6
U86635_g_at
83.3 470.9 52.3 3272.5 3379.6 5174.6
… … … … … … …
![Page 22: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/22.jpg)
Feature selectionProbe AML1 AML2 AML3 ALL1 ALL2 ALL3 p-value
D21869_s_at
170.7 55.0 43.7 5.5 807.9 1283.5 0.243
D25233cds_at
605 31.0 629.2 441.7 95.3 205.6 0.487
D25543_at 2148.7
2303.0
1915.5
49.2 96.3 89.8 0.0026
L03294_g_at
241.8 721.5 77.2 66.1 107.3 132.5 0.332
J03960_at 774.5 3439.8
614.3 556 14.4 12.9 0.260
M81855_at 1087 1283.7
1372.1
1469 4611.7 3211.8 0.178
L14936_at 212.6 2848.5
236.2 260.5 2650.9 2192.2 0.626
L19998_at 367 3.2 661.7 629.4 151 193.9 0.941
L19998_g_at
65.2 56.9 29.6 434.0 719.4 565.2 0.022
AB017912_at
1813.7
9520.6
2404.3
3853.1 6039.4 4245.7 0.963
AB017912_g_at
385.4 2396.8
363.7 419.3 6191.9 5617.6 0.236
U86635_g_at
83.3 470.9 52.3 3272.5 3379.6 5174.6 0.022
… … … … … … … …
![Page 23: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/23.jpg)
Hypothesis Testing
![Page 24: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/24.jpg)
Hypothesis Testing
Null hypothesis is a hypothesis set up to be nullified in order to support an alternative hypothesis.
Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data
Example: Test whether the time to respond to a tone is
affected by the consumption of alcohol Hypothesis : µ1 - µ2 = 0
µ1 is the mean time to respond after consuming alcohol
µ2 is the mean time to respond otherwise
?
![Page 25: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/25.jpg)
Z-test Theorem: If xi has a normal distribution with
mean and standard deviation 2, i=1,…,n, then U= ai xi has a normal distribution with a mean E(U)= ai and standard deviation D(U)=2 ai
2. xi /n ~ N(, 2/n).
Z test : H: µ = µ0 (µ0 and 0 are known, assume = 0)
What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and = 8? Use
Note: z follows a normal distribution N(0, 1)
two tail 0.05
104 100 43.39
8 1.1846
obt
obt
X
Xz
1.96critz Reject the null hypothesis.
![Page 26: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/26.jpg)
Z-test
Theorem: If xi follows a normal distribution with mean and standard deviation 2, i=1,…,n, then U= ai xi has a normal distribution with a mean E(U)= ai and standard deviation D(U)=2 ai
2. xi /n ~ N(, 2/n).
Z test : H: µ = µ0 (µ0 and 0 are known, assume = 0)
But, in practice 0 is often unknown.
![Page 27: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/27.jpg)
21
22
2121
21
221
111
20
01
111
)(1
1, ,
)()(
:
),(~,...,
),(~,...,
: testsample Two
)(1
1 ,
/
:
),(~,...,
: testsample One
xxn
sm
s
n
ss
s
yxt
H
Nyy
Nxx
xxn
sns
xt
H
Nxx
iyxyx
m
n
i
n
T-test
Sx-y standard error of the difference
Assuming 1 and 2 are different
![Page 28: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/28.jpg)
)2(~
:
)1(~
:
21
21
01
nntt
H
ntt
H
William Sealey Gosset (1876-1937)
(Guinness Brewing Company)
T-test
![Page 29: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/29.jpg)
P-value
Does a particular gene have the same expression level in ALL and AML?
Probe AML1 AML2 AML3 ALL1 ALL2 ALL3 p-value
D25543_at 2148.7
2303.0
1915.5
49.2 96.3 89.8 0.0026
L03294_g_at
241.8 721.5 77.2 66.1 107.3 132.5 0.332
… … … … … … … …
ALL AML
![Page 30: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/30.jpg)
Data processing
Feature selectionT-testBased on the fold change
![Page 31: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/31.jpg)
Matlab ttest
[H,P] = ttest2(X,Y)
Determines whether the means from matrices X and Y are statistically different.
H return a 0 or 1 indicating accept or reject null hypothesis (that the means are the same)
P will return the significance level
![Page 32: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/32.jpg)
Microarray Data Analysis
Data processing and visualization Supervised learning
Feature selection Machine learning approaches
Unsupervised learning Clustering and pattern detection
Infer gene interactions in pathways and networks
Gene regulatory regions predictions based co-regulated genes
Linkage between gene expression data and gene sequence/function databases
…
![Page 33: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/33.jpg)
Feature 2
Fea
ture
1L
L
L
L
L
LL
MM
M
M
M
M
Nearest Neighbor Classification
= AML
= ALL
= test sample
M
L
Feature 2
Fea
ture
1L
L
L
L
L
LL
MM
M
M
M
M
Feature 2
Fea
ture
1 L
L
L
L
L
LL
M MM
M
MM
= AML
= ALL
= test sample
M
L
![Page 34: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/34.jpg)
Distance Issues Euclidean distance
■ Pearson distance
N
ii
N
ii
i
N
ii
)yy()xx(
)yy()xx(
),(d
1
2
1
2
1yx
N
iji yxd
1
2)(),( yxg1
g2
g3
g4
0
50
100
150
200
250
300
350
400
gene1 gene2 gene3 gene4
time0time1time2time3
![Page 35: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/35.jpg)
Cross-validation http://en.wikipedia.org/wiki/Cross-
validation_(statistics)
![Page 36: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/36.jpg)
Microarray Data Analysis
Data processing and visualization Supervised learning
Feature selection Machine learning approaches
Unsupervised learning Clustering and pattern detection
Gene regulatory regions predictions based co-regulated genes
Linkage between gene expression data and gene sequence/function databases
…
![Page 37: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/37.jpg)
Genetic Algorithm for Feature Selection
SampleClear cell RCC,etc.
Rawmeasurementdata
f1f2f3f4f5
Featurevector= pattern
![Page 38: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/38.jpg)
Why Genetic Algorithm? Assuming 2,000 relevant genes, 20
important discriminator genes (features). Cost of an exhaustive search for the
optimal set of features ?C(n,k)=n!/k!(n-k)!C(2,000, 20) = 2000!/(20!1980!) ≥ (100)^20
= 10^40If it takes one femtosecond (10-15 second) to evaluate a set of features, it takes more than 310^17 years to find the optimal solution on the computer.
![Page 39: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/39.jpg)
Evolutionary Methods
Based on the mechanics of Darwinian evolution The evolution of a solution is loosely based on
biological evolution Population of competing candidate
solutions Chromosomes (a set of features)
Genetic operators (mutation, recombination, etc.) generate new candidate solutions
Selection pressure directs the search those that do well survive (selection) to form
the basis for the next set of solutions.
![Page 40: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/40.jpg)
A Simple Evolutionary Algorithm
SelectionGeneticOperators
Evaluation
![Page 41: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/41.jpg)
Genetic Operators
Crossover
10 30 50 70
20 40 60 80
Randomly SelectedCrossover Point
10 30
50 7020 40
60 80
Mutation
10 30 62 80
Randomly Selected Mutation Site
Recombination is intended to produce promising individuals.
Mutation maintains population diversity, preventing premature convergence.
![Page 42: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/42.jpg)
Genetic Algorithm
g2g1g6g3g21
g201
g17g51g21g1
g12g7g15g12g10
g25g72g56g23g10
g20g7g5g2g100 Good enough Stop
g20g7g6g3g21
g20g7g25g23g14
g12g7g15g22g10
g25g72g56g23g10
g2g1g5g2g100
Not good enough5
2
1
4
3
![Page 43: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/43.jpg)
GA Fitness At the core of any optimization
approach is the function that measures the quality of a solution or optimization.
Called: Objective function Fitness function Error function measure etc.
![Page 44: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/44.jpg)
Encoding
Most difficult, and important part of any GA
Encode so that illegal solutions are not possible
Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space
Most GA’s use a binary encoding of a solution, but other schemes are possible
![Page 45: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/45.jpg)
Genetic Algorithm/K-Nearest Neighbor Algorithm
Classifier(kNN)
Feature Selection(GA)
MicroarrayDatabase
![Page 46: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/46.jpg)
Microarray Data Analysis
Data processing and visualization Supervised learning
Machine learning approaches Unsupervised learning
Clustering and pattern detection Gene regulatory regions predictions
based co-regulated genes Linkage between gene expression data
and gene sequence/function databases …
![Page 47: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/47.jpg)
Unsupervised learning Supervised methods
Can only validate or reject hypotheses
Can not lead to discovery of unexpected partitions
Unsupervised learning No prior knowledge is used
Explore structure of data on the basis of similarities
![Page 48: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/48.jpg)
DEFINITION OF THE CLUSTERING PROBLEM
![Page 49: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/49.jpg)
CLUSTER ANALYSIS YIELDS DENDROGRAM
T (RESOLUTION)
![Page 50: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/50.jpg)
BUT WHAT ABOUT THE OKAPI ?
![Page 51: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/51.jpg)
52 41 3
Agglomerative Hierarchical Clustering
3
1
4 2
5
Distance between joined clusters
Dendrogram
at each step merge pair of nearest clustersinitially : each point = cluster
Need to define the distance between thenew cluster and the other clusters.Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Need to define the distance between thenew cluster and the other clusters.Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
(UPGMA)
![Page 52: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/52.jpg)
Hierarchical Clustering -Summary
Results depend on distance update method
Greedy iterative process
NOT robust against noise
No inherent measure to identify stable clusters
Average Linkage (UPGMA) – the most widely used clustering method in gene expression analysis
![Page 53: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/53.jpg)
Cluster both genes and samples
Sample should cluster together based on experimental design Often a way to
catch labelling errors or heterogeneity in samples
![Page 54: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/54.jpg)
nature 2002 breast cancer
Heat map
![Page 55: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/55.jpg)
Centroid methods – K-means
Data points at Xi , i= 1,...,N
Centroids at Y , = 1,...,K
Assign data point i to centroid ; Si =
Cost E:
E(S1 , S2 ,...,SN ; Y1 ,...YK ) =
Minimize E over Si , Y
2
1 1
))(,(
YXS i
N
i
K
i
![Page 56: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/56.jpg)
K-means
“Guess” K=3
![Page 57: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/57.jpg)
Start with random positions of centroids.
K-means
Iteration = 0
![Page 58: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/58.jpg)
K-means
Iteration = 1
Start with random positions of centroids.
Assign each data point to closest centroid.
![Page 59: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/59.jpg)
K-means
Iteration = 2
Start with random positions of centroids.
Assign each data point to closest centroid.
Move centroids to center of assigned points
![Page 60: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/60.jpg)
K-means
Iteration = 3
Start with random positions of centroids.
Assign each data point to closest centroid.
Move centroids to center of assigned points
Iterate till minimal cost
![Page 61: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/61.jpg)
Fast algorithm: compute distances from data points to centroids
Result depends on initial centroids’ position
Must preset K Fails for “non-spherical”
distributions
K-means - Summary
![Page 62: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/62.jpg)
Issues in Cluster Analysis
A lot of clustering algorithms A lot of distance/similarity metrics Which clustering algorithm runs
faster and uses less memory? How many clusters after all? Are the clusters stable? Are the clusters meaningful?
![Page 63: Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people](https://reader030.vdocument.in/reader030/viewer/2022012904/56649e1a5503460f94b074c2/html5/thumbnails/63.jpg)
Which Clustering Method Should I Use?
What is the biological question? Do I have a preconceived notion of
how many clusters there should be? How strict do I want to be? Spilt or
Join? Can a gene be in multiple clusters? Hard or soft boundaries between
clusters