machine learning: algorithms and applicationszini/ml/slides/ml_2012_lecture_11.pdf · commonly used...

21/05/12

1

Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 11: 21 May 2012

Unsupervised Learning (cont…)

Slides courtesy of Bing Liu: www.cs.uic.edu/~liub/WebMiningBook.html

21/05/12

2

Road map n  Basic concepts n  K-means algorithm n  Representation of clusters n  Hierarchical clustering n  Distance functions n  Data standardization n  Handling mixed attributes n  Which clustering algorithm to use? n  Cluster evaluation n  Summary

Mixed attributes n  The distance functions we have seen are for

data with all numeric attributes, or all nominal attributes, etc.

n  In many practical cases data has different types of attributes, from the following 6: q  interval-scaled q  ratio-scaled q  symmetric binary q  asymmetric binary q  nominal q  ordinal

n  Clustering a data set involving mixed attributes is a challenging problem

21/05/12

3

Convert to a single type n  One common way of dealing with mixed

attributes is to: 1.  Choose a dominant attribute type 2.  Convert the other types to this type

n  E.g., if most attributes in a data set are interval-scaled q  we convert ordinal attributes and ratio-scaled

attributes to interval-scaled attributes q  it is also appropriate to treat symmetric binary

attributes as interval-scaled attributes

Convert to a single type (cont …)

n  It does not make much sense to convert a nominal attribute or an asymmetric binary attribute to an interval-scaled attribute q  but it is frequently done in practice by assigning

some numbers to them according to some hidden ordering, e.g., prices of the fruits

n  Alternatively, a nominal attribute can be converted to a set of (symmetric) binary attributes, which are then treated as numeric attributes

21/05/12

4

Combining individual distances n  This approach computes individual attribute distances and then

combine them n  A combination formula, proposed by Gower, is

q  The distance dist(xi,xj) is between 0 and 1 q  r is the number of attributes

q 

q  dij

f is the distance contributed by attribute f, in the range [0,1]

!ijf =

1 if xif and x jf are not missing

0 if xif or x jf is missing

0 if attribute f is asymmetric and xif and x jf are both 0

!

"##

$##

∑∑

=

== r

ffij

fij

r

ffij

ji

ddist

1

1),(δ

δxx (4)

Combining individual distances (cont …) n  If f is a binary or nominal attribute

q  distance (4) reduces to

n  equation (3)-lect 10 if all attributes are nominal n  the simple matching distance (1)-lect 10 if all attributes are symmetric binary n  the Jaccard distance (2)-lect 10 if all attributes are asymmetric

n  If f is interval-scaled q  Rf is the value range of f q  If all the attributes are interval-scaled, distance (4) reduces to

Manhattan distance n  Assuming that all attributes values are standardized

n  Ordinal and ratio-scaled attributes are converted to interval-scaled attributes and handled in the same way

dijf =

1 if xif ! x jf0 otherwise

"#$

%$

dijf =

xif ! x jfRf

Rf =max( f )!min( f )

21/05/12

5


How to choose a clustering algorithm n  Clustering research has a long history

q  A vast collection of algorithms are available q  We only introduced several main algorithms

n  Choosing the “best” algorithm is challenging q  Every algorithm has limitations and works well with certain

data distributions q  It is very hard, if not impossible, to know what distribution

the application data follow n  The data may not fully follow any “ideal” structure or distribution

required by the algorithms

q  One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values

21/05/12

6

How to choose a clustering algorithm (cont …)

n  Due to these complexities, the common practice is to 1.  run several algorithms using different distance functions

and parameter settings 2.  carefully analyze and compare the results

n  The interpretation of the results must be based on q  insight into the meaning of the original data q  knowledge of the algorithms used

n  Clustering is highly application dependent and to certain extent subjective (personal preferences)


21/05/12

7

Cluster Evaluation: hard problem

n  The quality of a clustering is very hard to evaluate because q  We do not know the correct clusters

n  Some methods are used q  User inspection

n  A panel of experts inspects the resulting clusters and scores them q  Study centroids as spreads q  Examine rules (e.g., from a decision tree) that describe the clusters q  For text documents, one can inspect by reading

n  The final score is the average of the individual scoring n  Manual inspection is labor intensive and time consuming

Cluster evaluation: ground truth

n  We use some labeled data (for classification) q  Assumption: Each class is a cluster

n  Let the classes in the data D be C=(c1, c2,…,ck) q  The clustering method produces k clusters, which

divides D into k disjoint subsets, D1, D2, …, Dk

n  After clustering, a confusion matrix is constructed q  From the matrix, we compute various measurements:

entropy, purity, precision, recall and F-score

21/05/12

8

Evaluation measures: Entropy

n  For each cluster, we can measure the entropy as

q  Pri(cj): proportion of class cj in cluster Di

n  The entropy of the whole clustering is

q  |Di|/|D| is the weight of cluster Di, proportional to its size

entropy(Di ) = ! Pri (cj )log2j=1

k

" Pri (cj )

entropytotal (D) =Di

Dentropy(Di )

i=1

k

!

Evaluation measures: purity

n  Measures the extent a cluster contains only one class of data

n  The purity of the whole clustering is

q  |Di|/|D| is the weight of cluster Di, proportional to its size

n  Precision, recall, and F-measure can be computed as well q  Based on the class that is most frequent in the cluster

purity(Di ) =max j Pr(cj )( )

puritytotal (D) =Di

Dpurity(Di )

i=1

k

!

21/05/12

9

An example

n  We can use the total entropy or purity to compare q  different clustering results from the same algorithm q  different algorithms

n  Precision, recall and F-measure can be computed as well for each cluster q  The precision of Science in cluster 1 is 0.89, the recall is 0.83, the F-measure is

thus 0.86

A remark about ground truth evaluation

n  Commonly used to compare different clustering algorithms

n  A real-life data set for clustering has no class labels q  Thus although an algorithm may perform very well on some

labeled data sets, no guarantee that it will perform well on the actual application data at hand

n  The fact that it performs well on some label data sets does give us some confidence of the quality of the algorithm

n  This evaluation method is said to be based on external data or information

21/05/12

10

Evaluation based on internal information

n  Intra-cluster cohesion (compactness): q  Cohesion measures how near the data points in a

cluster are to the cluster centroid q  Sum of squared error (SSE) is a commonly used

measure n  Inter-cluster separation (isolation):

q  Separation means that different cluster centroids should be far away from one another

n  In most applications, expert judgments are still the key

Indirect evaluation n  In some applications, clustering is not the primary

task, but used to help perform another task n  We can use the performance on the primary task to

compare clustering methods n  For instance, in an application, the primary task is to

provide recommendations on book purchasing to online shoppers q  If we can cluster shoppers according to their features, we

might be able to provide better recommendations q  We can evaluate different clustering algorithms based on

how well they help with the recommendation task q  Here, we assume that the recommendation can be reliably

evaluated

21/05/12

11


Summary n  Clustering is has along history and still active

q  There are a huge number of clustering algorithms q  More are still coming every year

n  We only introduced several main algorithms. There are many others, e.g., q  density based algorithm, sub-space clustering, scale-up methods,

neural networks based methods, fuzzy clustering, co-clustering, etc.

n  Clustering is hard to evaluate, but very useful in practice q  This partially explains why there are still a large number of clustering

algorithms being devised every year

n  Clustering is highly application dependent and to some extent subjective

21/05/12

12

Reinforcement Learning

These slides are an adaptation of slides drawn by Tom Mitchell and modified by Liviu Ciortuz

Introduction � Supervised learning is the simplest and most

studied type of learning � How can an agent learn behaviors when it doesn’t

have a teacher to tell it how to perform? �  The agent has a task to perform �  It takes some actions in the world �  At some later point, it gets feedback telling it how well it

did on performing the task �  The agent performs the same task over and over again

� This problem is called reinforcement learning: �  The agent gets positive reward for tasks done well �  The agent gets negative reward for tasks done poorly

21/05/12

13

Introduction (cont…) � The goal is to get the agent to act in the world

so as to maximize its rewards � The agent has to figure out what it did that

made it get the reward/punishment �  This is known as the credit assignment problem

� Reinforcement learning can be used to train computers to do many tasks, such as: �  playing board games �  job shop scheduling �  controlling robot �  flight/taxy scheduling � …

Overview � Task: Control learning

�  make an autonomous agent (robot) to perform actions, observe consequences and learn a control strategy

� The Q learning algorithm �  acquire optimal control strategies from delayed rewards,

even when the agent has no prior knowledge of the effect of its actions on the environment

� Reinforcement Learning is related to dynamic programming, used to solve optimization problems �  While DP assumes that the agent/program knows the

effect (and rewards) of all its actions, in RL the agent has to experiment in the real world

21/05/12

14

Reinforcement Learning Problem

Example: •  play Backgammon (TD-Gammon [Tesauro, 1995]);

immediate reward •  +100 if win, •  -100 if lose, •  0 otherwise

� Target function to learn:

� Goal: maximize where

! : S! A

r0 +!r1 +!2r2 +...

0 ! ! <1

Control learning characteristics

21/05/12

15

Learning Sequential Control Strategies Using Markov Decision Processes

Agent’s Learning Task

machine learning: algorithms and applicationszini/ml/slides/ml_2012_lecture_11.pdf · commonly used...

Documents