non-linear da and clustering

28
Non-linear DA and Clustering Stat 600

Upload: adeola

Post on 25-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Non-linear DA and Clustering. Stat 600. Nonlinear DA. We discussed LDA where our discriminant boundary was linear Now, lets consider scenarios where it could be non-linear We will discuss: QDA RDA MDA As before all these methods aim to MINIMIZE the probability of misclassification. QDA. - PowerPoint PPT Presentation

TRANSCRIPT

Non-linear DA and Clustering

Stat 600

Nonlinear DA• We discussed LDA where our discriminant boundary was linear• Now, lets consider scenarios where it could be non-linear• We will discuss:

– QDA– RDA– MDA

As before all these methods aim to MINIMIZE the probability of misclassification.

QDA

• Difference from LDA: Allows the variance for each class to be different.

• Hence boundaries are curvilinear in nature.• However, the requirements are more stringent as we need to estimate a

Variance-Covariance matrix for each class. • Hence,

– Inverse matrices need to exist– #predictors << # of observations within a class– No collinearity– If predictors are discrete it does not work well.

RDA: Regularized DA

• Introduced by Friedman• Idea is it is a compromise between LDA and QDA

)1()(~ ll

MDA: Mixed LDA

• Extension of LDA (introduced by Hastie and Tibshirani 1996)• Here like LDA it assumes the same Variance structure for all the

classes, but for each class it allows for a mixture of MVN to model the mean).

• The class specific distributions are combined into a single MVN by creating a per class mixture.

• Suppose Dl is the discriminant function for the kth subclass of the lth class, the overall Discriminant for the lth class would be proportional to the weighted sum of the discriminant function for that subclass.

)(1 xDD lkLk lkl

l

Other topics

• Nueral Networks• Support Vector Machines• Flexible DA

• But before we do that lets take a quick look at un-supervised learning.

Predicting Class

• We talked about predicting class based on situations when the class is known.

• Lets consider scenarios when the classes are UNKNOWN.• Also called unsupervised learning.• Idea is to predict class of data sets, when there is NOTHING known

about the classes.

What is Clustering?

• Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters” with the idea that objects within a cluster are similar and objects in different clusters are different.

• It uses different distance measures between units of a group and across groups to decide which units fall in a group.

Data in Clustering

• Generally we have data on several variables on each individual.

• We could cluster the individuals in terms of the ones with similar variables are grouped together.

• We could cluster the variables by seeing which individuals group together.

• Fundamentally an exploratory tool, clustering is firmly imbedded in many biologists’ minds as the statistical method for the analysis of data.

Why Cluster Samples?

• Clustering leads to readily interpretable figures and can be helpfulfor identifying patterns in time or space, especially artifacts!

• There are very few formal theories about clustering though intuitively the idea is:

• cluster the internal cohesion and external isolation. • Time-course experiments are often clustered to see if there are

developmental similarities.• Useful for visualization.• Generally considered appropriate in typical clinical experiments.

Clustering

• How is “closeness decided”?

• For clustering we generally need two ideas:

• Distance: the original distance used to measure the distance between two points (this is looking at distance between the observations)

• Linkage: condensation of each group of observations into a single representative point (technique used to group the observations together).

Clustering: preliminaries• Distance or similarity measures:• Geometric distances• L1 (Manhattan): d1(x,y)= |xi-yi| • L2(Euclidean, ruler distance): d2(x,y)= [ (xi-yi)2 ]1/2 • [ (xi-yi)’ (xi-yi) ]1/2

• Standardized ruler-distance [ (zi1-zi2)’ (zi1-zi2) ]1/2 • Mahalanobis Distance: [ (xi-yi)’ -1(xi-yi) ]1/2 • Correlation distance: 1-r, where r is the correlation

coefficient.• CAN HAVE WEIGHTED VERSIONS OF THESE.

Clustering: preliminaries

Linkage:• Average Linkage: the distance between two groups of points is the

average of all pairwise distances.• Median Linkage: the distance between two groups of points is the

median of all pairwise distances.• Centroid method: the distance between two groups of points is the

distance between the centroids of the two groups.• Single Linkage: the distance between two-groups is the smallest of all

pairwise distances.• Complete Linkage: the distance between two-groups is the largest of

all pairwise distances.

Types of Clustering

• Hierarchical and Non-hierarchical methods:• Non-Hierarchical (Partitioning): Have an initial set of

cluster seed points and then build clusters around the point, using one of the distance measures. If the cluster is too large, it can split into smaller ones.

• Hierarchical: Observed data points are grouped into clusters in a nested sequence of groups.

Non-hierarchical: Partitioning methods

Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups.

Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares.

Issues:Need to know the seeds and the number of clusters to start off with. If one uses the computer the pick the seeds the order of entry of the data may make a difference.

Hierarchical methods

• Hierarchical clustering methods produce a tree or dendrogram often using single-link clustering methods

• They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level.

• The tree can be built in two distinct ways - bottom-up: agglomerative clustering. - top-down: divisive clustering.

Partitioning vs. Hierarchical • Partitioning:Advantages• Optimal for certain criteria.• Genes automatically assigned

to clustersDisadvantages• Need initial k;• Often require long computation

times.• All genes are forced into a

cluster.

• Hierarchical

Advantages• Faster computation.• Visual.

Disadvantages• Unrelated genes are eventually

joined• Rigid, cannot correct later for

erroneous decisions made earlier.

• Hard to define clusters.

Bottom-up- Agglomerative Method• This is the most common used method and produces the

famous tree-diagram. • Start with n clusters• At each step, merge the two closest clusters using a

measure of between-cluster dissimilarity which reflects the shape of the clusters

• The distance between clusters is defined by the method used (e.g., if complete linkage, the distance is defined as the distance between furthest pair of points in the two clusters)

Example

• Suppose we have 5 obs with a distance matrix given by:

1 2 3 4 51 .31 .43 .47 .232 .48 .47 .333 .37 .464 .455

Example

• First we have 5 clusters:• C0 = {[1],[2],[3],[4],[5]}• Since 1 and 5 have the least distance they are combined

and C1 = {[1,5],[2],[3],[4]}• And C2= {[1,5],[2],[3,4]}• And C3= {[1,5,2],[3,4]}• And C4= {[1,5,2,3,4]}

Dendograms

• The dendogram should be interpreted with care, remember each branch of the dendogram is really like a mobile and can rotate, without altering the mathematical structure of the tree.

• Neighboring nodes are “close” ONLY if they lie on the same branch.

• It has been proposed one should slice the tree and look at the clusters produced therein. However, WHERE to cut the tree is subjective and there is no consensus about this.

• Issue: mistakes made early have no way of being corrected later in this approach.

Some remarks on clustering- 1

• Simplistically, clustering cannot fail. That is, every clustering method will return clusters, whether the data are organized in clusters or not.

• Clustering helps to group / order information and is a visualization tool for learning about the data. However, clustering results do not provide any kind of “proof” of anything.

Some remarks-II

• One of the more paradoxical aspects of clustering is that it gets used in biology, even when class labels are available instead of using a discrimination method.

• The idea is: it is somehow seen as less “biased” to demonstrate the ability of the data to produce the class differences without using class labels.

• When the inferred clusters largely coincide with the known classes, this is thought to “validate” the class labels.

• The illogicality and inefficiency of this process does not seem to have become widely appreciated. One sees different “classifiers” (e.g. different gene sets) compared w.r.t their ability to separate known classes, simply by inspecting the clustering they produce, rather than by building classifiers.

• library(stats)• library(cluster)• my.data=read.table(“cluster.csv”,header=TRUE, sep=”,”)• #clustering using correlation distance, complete linkage• clust.cor=hclust(as.dist(1-cor(my.data)),method=”complete”)• #clustering using Euclidean distance, average linakge• clust.euc=hclust(dist(t(my.data)),method=”average”)• #clustering using Manhattan distance single linkage• clust.man=hclust(dist(t(my.data),method=”manhattan”),method=”aver

age”)• par(mfrow=c(1,3))• plclust(clust.cor)• plclust(clust.euc)• plclust(clust.man)

Dendograms from Rc1 c9

c6c2 c3

c4 c10

c7c5 c80.

900.

951.

001.

051.

101.

15

hclust (*, "complete")as.dist(1 - cor(my.data))

Hei

ght

c7c3

c2c4 c1

0c6

c8c1

c5 c93.7

3.8

3.9

4.0

4.1

4.2

4.3

hclust (*, "average")dist(t(my.data))

Hei

ght

c7c3

c4 c10

c2c9

c1 c5c6 c8

4344

4546

4748

49

hclust (*, "average")dist(t(my.data), method = "manhattan")

Hei

ght