a intelligence (cscu9ye ) l 1: revision lecture · exam ¢ date: thursday 15 december, 14:00 –...

ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 1: REVISION LECTURE

Gabriela Ochoa, Nadarajen Veerapen, Fabio Daolio

EXAM

¢ Date: Thursday 15 December, 14:00 – 15:30 (Room: 2A13) 1.5 hour exam

¢ Attempt BOTH questions. �  Q1: Search (25 Marks) �  Q2: Machine Learning (25 Marks)

¢ The distribution of marks among the parts of

each question is indicated.

2

SOLVING PROBLEMS BY SEARCHING ¢ Problem-solving agents decide what to do by finding

sequences of actions that lead to desirable states ¢ What is a problem and what is a solution?

�  Problem: a goal and a set of means to achieve it �  Solution: a sequence of actions to achieve that

goal

¢ Given a precise definition of problem, it is possible to construct a search process for finding solutions

3

EXAMPLE: ROMANIA Google Map: Romania

4

PROBLEM FORMULATION More formally, a problem is defined by these main components: 1.  Initial state where the agent starts: e.g., "at Arad"

2.  Actions available to the agent �  e.g., Arad à Zerind , Arad à Sibiu, … etc.

3.  Goal test, determines whether a given state is a goal state. �  explicit, e.g., x = "at Bucharest“ �  implicit, e.g., Checkmate(x)

4.  Path cost (additive) function that assigns a numeric cost to each path. Reflects agents performance measure �  e.g., sum of distances, number of actions executed, etc. �  c(x,a,y) is the step cost, assumed to be ≥ 0

5

SEARCH ALGORITHMS

¢ Uninformed search strategies: can find solutions to problems by systematically generating new states, and testing them against the goal (eg. BFS and DFS)

¢  Informed search strategies: use some problem-specific knowledge

¢ Knowledge is given by an evaluation function that returns a number describing the desirability (or lack thereof) of expanding a nodes: Examples: Best-first search, Greedy Search, A*

6

Shaded nodes: expanded nodes

Outlined nodes: generated but not expanded

7

BREADTH-FIRST SEARCH

¢ Expand shallowest unexpanded node ¢  Implementation:

�  Frontier is a FIFO queue, i.e., new successors go at end

8



�  Frontier is a FIFO queue, i.e., new successors go at end

9



�  fringe is a FIFO queue, i.e., new successors go at end

10



�  fringe is a FIFO queue, i.e., new successors go at end

11

OPTIMISATION PROBLEMS ARE EVERYWHERE!

Logistics, transportation, supply change management

Manufacturing, production lines

Timetabling

Cutting & packing Computer networks and Telecommunications

Software - SBSE 12

HILL-CLIMBING SEARCH Like climbing a mountain in thick fog with amnesia

13

Best Improvement (gradient descent, greedy hill-climbing): Choose maximally improving neighbour First Improvement: Choose the first found improving move. Local optimum: No other solution in the neighbourhood has better fitness

HILL-CLIMBING SEARCH Problem: depending on initial state, can get stuck in local maxima

14

ITERATED LOCAL SEARCH

Procedure Iterated Local Search (ILS) s = initislise(s) s = hill-climbing (s) while NOT termination_criterion { r = s s = perturbation(s) s = hill-climbing (s) if s < r

s = r }

¢  Key idea: use two stages �  Local search for reaching local optima (intensification) �  Perturbation stage, for escaping local optima (diversification)

¢  Acceptance criterion: to control diversification vs. intensificaction

15

Artificial Intelligence (CSC9YE)Revision - Machine Learning - Decision Trees

Fabio [email protected]

Definitionfrom (T. Mitchell 1997)

“A computer program is said to learn from experience E

with respect to some class of tasks T and performance

measure P, if its performance at tasks in T, as measured

by P, improves with experience E.”

1 / 18

Learning Paradigmswhat kind of experience, what kind of tasks?

Supervised Learning: the program is presented with a series ofinput-output examples and learns a function thatmaps inputs to outputs.

I regressionI classification

Unsupervised Learning: the program is presented with a series ofinputs and learns how they are organised.

I clustering (or segmentation)I dimensionality reduction

Reinforcement Learning: the program learns to determine the idealbehaviour based on feedback from the environment,rewards or punishments.

I game playingI on-line control

2 / 18

Supervised Learning settingoutcome measurements and predictors measurements are available

< x11, x12, … , x1p >

< x21, x22, … , x2p >

< x31, x32, … , x3p >

...

...

...< xn1, xn2, … , xnp >

y1

y2

y3

...

...

...yn

X y Data: list of observations in the formL = {< X , y >}

X

n⇥p feature matrix / design matrixn samples / examples / instancesp features / predictors / covariates

y

n⇥1 target vector / labelsI regression: continuous valuesI classification: finite set of types

Problem: learn y = f (X )

3 / 18

A Binary Classification Task< x

1

, x2

>2 R features, < y >2 {class1, class2} labels, n = 30.How to automatically find a mapping f from (x

1

, x2

) to y?

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00x1

x 2

y class1 class2

4 / 18

Base model: predict the majority classminimise misclassification error

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00x1

x 2

y class1 class2

class122 8

5 / 18

Divide and Conquerrecursive partition and assign a base model to each partition

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00x1

x 2

y class1 class2

x2 < 0.43

x2 >= 0.77

class122 8

class113 0

class19 8

class17 0

class22 8

yes no

6 / 18


0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00x1

x 2

y class1 class2

x2 < 0.43

x2 >= 0.77

class122 8

class113 0

class19 8

class17 0

class22 8

yes no

7 / 18


0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00x1

x 2

y class1 class2

x2 < 0.43

x2 >= 0.77

x1 >= 0.7

class122 8

class113 0

class19 8

class17 0

class22 8

class12 0

class20 8

yes no

8 / 18

Decision Tree: recursive binary splittingthings to notice

I the target y is approximated by a piecewise constant function

I the feature space X is partitioned into disjoint regions

I the goal is to find partitions that minimise the prediction error

I it is computationally infeasible to consider all possiblepartitions

I the recursive binary splitting is a top-down, greedy procedure:I splits are defined by a split variable and a split pointI at any step, all possible splits in the data are testedI the split that yields the most “pure” nodes is chosen

I the splitting could continue until all nodes are “pure”...

9 / 18

Tree Building Algorithmcode from (G. Louppe 2014)

function BuildDecisionTree(L)Create node t

if the stopping criterion is met for t then

Assign a model to byt

else

Find the split on L that maximizes impurity decrease

s

⇤ = argmaxs

i(t) � pLi(tsL) � pR i(t

sR)

Partition L into LtL [ LtR according to s

⇤

tL = BuildDecisionTree(LtL)tR = BuildDecisionTree(LtR )

end if

return t

end function

10 / 18

Measuring Nodes Impurityfor a binary classification task, figure from (Hastie et al. 2009)

9.2 Tree-Based Methods 309

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

p

Entropy

Gini ind

ex

Misclas

sifica

tion e

rror

FIGURE 9.3. Node impurity measures for two-class classification, as a functionof the proportion p in class 2. Cross-entropy has been scaled to pass through(0.5, 0.5).

impurity measure Qm(T ) defined in (9.15), but this is not suitable forclassification. In a node m, representing a region Rm with Nm observations,let

pmk =1

Nm

�

xi�Rm

I(yi = k),

the proportion of class k observations in node m. We classify the obser-vations in node m to class k(m) = arg maxk pmk, the majority class innode m. Di�erent measures Qm(T ) of node impurity include the following:

Misclassification error: 1Nm

Pi�Rm

I(yi �= k(m)) = 1 � pmk(m).

Gini index:P

k �=k� pmkpmk� =PK

k=1 pmk(1 � pmk).

Cross-entropy or deviance: �PK

k=1 pmk log pmk.(9.17)

For two classes, if p is the proportion in the second class, these three mea-sures are 1 � max(p, 1 � p), 2p(1 � p) and �p log p � (1 � p) log (1 � p),respectively. They are shown in Figure 9.3. All three are similar, but cross-entropy and the Gini index are di�erentiable, and hence more amenable tonumerical optimization. Comparing (9.13) and (9.15), we see that we needto weight the node impurity measures by the number NmL and NmR ofobservations in the two child nodes created by splitting node m.

In addition, cross-entropy and the Gini index are more sensitive to changesin the node probabilities than the misclassification rate. For example, ina two-class problem with 400 observations in each class (denote this by(400, 400)), suppose one split created nodes (300, 100) and (100, 300), while

I If p is the proportion of samples of the other class in node t:

Misclassification Rate: i(t) = p � max (p, 1 � p)Gini Index: i(t) = 2p(1 � p)

Cross-Entropy: i(t) = �p log(p)�(1�p) log(1�p)2 log(2)

11 / 18

Classification And Regression Trees

By swapping impurity function and leaf model, decision trees canbe used to solve classification and regression tasks:

classification:

Iy symbolic, discrete, e.g., Y = {class1, class2}

Iy = argmaxc2Y p(c |t), i.e. the majority class in node t

Ii(t) = entropy(t) or i(t) = gini(t)

regression:

Iy numeric, continuous

Iy = mean(y |t), i.e. the point average in node t

Ii(t) = 1

nt

Px,y2Lt

(y � byt)2, i.e. the mean squared error

12 / 18

A Simple Regression Tree

< x , y > continuous variables, n = 20

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

x < 418

x >= 154 x < 460

19.5n=20

14.8n=14

11.7n=9

20.5n=5

30.5n=6

24.4n=3

36.5n=3

yes no

13 / 18

Model Selection on tree parametersdepth=1 depth=2

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

depth=3 depth=4

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10

20

30

40

0 100 200 300 400 500x

y

14 / 18

Stopping condition: e.g., max depth or min samplesdepth=1 depth=2

x < 418

19.5n=20

14.8n=14

30.5n=6

yes no

x < 418

x >= 154 x < 460

19.5n=20

14.8n=14

11.7n=9

20.5n=5

30.5n=6

24.4n=3

36.5n=3

yes no

depth=3 depth=4

x < 418

x >= 154

x < 366 x >= 37.1

x < 460

x >= 444

19.5n=20

14.8n=14

11.7n=9

10.6n=8

20.5n=1

20.5n=5

18.4n=3

23.7n=2

30.5n=6

24.4n=3

22.3n=2

28.5n=1

36.5n=3

yes no

x < 418

x >= 154

x < 366 x >= 37.1

x < 21.9

x < 460

x >= 444 x < 474

x >= 478

19.5n=20

14.8n=14

11.7n=9

10.6n=8

20.5n=1

20.5n=5

18.4n=3

23.7n=2

20.4n=1

27n=1

30.5n=6

24.4n=3

22.3n=2

28.5n=1

36.5n=3

33.3n=1

38.1n=2

33.7n=1

42.6n=1

yes no

15 / 18

Recall: Underfitting and Overfittingthe goal of the model is to minimise the prediction error on unseen data

10

20

30

1 2 3 4tree maximum depth

MSE

settesttrain

I Overly complex trees are likely to overfit the training data:I to avoid this, tune the stopping criteria (or post-hoc prune)I cross-validation can be used for model selection

16 / 18

Recall: Bias and Variancemodels with low bias and low variance have lower expected prediction error

Low Bias

Low Variance

••••••••

High Variance

••

•

•

• ••

•

High Bias••••••••

••

••

•• •

•

17 / 18

Bias and Variance of a Regression Tree

I Decision trees have, in general, low bias but high variance:I to reduce variance, combine the predictions of several trees!

(see bagging and ensembles of randomised trees)

18 / 18

References

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013).

An Introduction to Statistical Learning: with Applications in R.

Springer.

Louppe, G. (2014).

Understanding Random Forests: From Theory to Practice.

PhD thesis, Universite de Liege, Liege, Belgique.

Artificial Intelligence (CSC9YE)Revision – Machine Learning – Clustering

Unsupervised Learning

I Unsupervised learning: no labeled examples, no training set.

I We want to find interesting things about a set of data. Isthere an informative way to visualize the data? Can wediscover subgroups among the variables or among theobservations?

I This means grouping and separating data points at the sametime.

I We need a way to measure how (dis)similar the data pointsare, for example with the Euclidean distance.

I It is intrinsically more di�cult than supervised learningbecause there is no gold standard (like an outcome variable)and no single objective (like test set accuracy).

1 / 11

Two Clustering Methods

I In K-means clustering, we seek to partition the observationsinto a pre-specified number of clusters k .

I In hierarchical clustering, we do not know in advance howmany clusters we want; in fact, we end up with a tree-likevisual representation of the observations, called a dendrogram,that allows us to view at once the clusterings obtained foreach possible number of clusters, from 1 to n.

2 / 11

K-means: An Optimisation Problem

I Minimise within-cluster variation.I Algorithm:

1. Randomly select k points. These serve as initial clustercentroids for the observations.

2. Assign each observation to the cluster whose centroid isclosest.

3. Iterate until the cluster assignments stop changing:

3.1 For each of the k clusters, compute the cluster centroid. The

k thcluster centroid is the vector of the p feature means for

the observations in the k thcluster.

3.2 Assign each observation to the cluster whose centroid is

closest.

I Properties:I This algorithm is guaranteed to decrease the value of the

objective. However it is not guaranteed to give the globalminimum.

I The algorithm may get stuck in a local optimum.

3 / 11

K-means AlgorithmK-means with k=2, randomly choose centroids in initial step

0 2 4 6 8

02

46

8 1

2

3

4

5 6

7

8

9

10

Distancesc1 c2

1 6.08 5.39

2 5.10 5.10

3 4.24 3.16

4 2.24 5.39

5 1.00 6.40

6 0.00 7.21

7 7.28 6.08

8 6.08 5.00

9 8.06 3.61

10 7.21 0.00

4 / 11

K-means AlgorithmLast step: no change in centroids

0 2 4 6 8

02

46

8 1

2

3

4

5 6

7

8

9

10

c1

c2

Distancesc1 c2

1 3.34 7.96

2 2.34 7.30

3 1.86 4.51

4 0.69 5.94

5 2.67 5.77

6 2.91 6.77

7 7.47 2.51

8 6.07 1.68

9 7.03 1.35

10 5.01 3.58

5 / 11

Hierarchical Clustering

I Hierarchical clustering does not require that we commit to aparticular choice of k .

I Bottom-up or agglomerative clustering: a dendrogram (a tree)is built starting from the leaves and combining clusters up tothe trunk.

I Algorithm:

1. Start with each point in its own cluster.2. Repeat until all points are in a single cluster.

IIdentify the closest two clusters and merge them.

I Similarity between clusters: for single/complete/averagelinkage, compute all pairwise distances between theobservations in cluster A and the observations in cluster B,and record the smallest/largest/average of these distances.

6 / 11

Hierarchical ClusteringExample using Single Linkage

0 2 4 6 8

02

46

8 1

2

3

4

5 6

7

8

9

10

7 / 11


Distance Matrix1 2 3 4 5 6 7 8 9 10

1 0.02 1.0 0.03 3.6 2.8 0.04 4.0 3.0 2.2 0.05 6.0 5.0 3.6 2.0 0.06 6.1 5.1 4.2 2.2 1.0 0.07 10.0 9.2 6.4 7.2 6.3 7.3 0.08 8.6 7.8 5.0 5.8 5.1 6.1 1.4 0.09 8.6 8.1 5.4 7.1 7.1 8.1 3.2 2.8 0.010 5.4 5.1 3.2 5.4 6.4 7.2 6.1 5.0 3.6 0.0

Clusters: {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10}

8 / 11

Hierarchical Clusteringd k Clusters Comment

0.0 10 {1}, {2}, {3}, {4}, {5},{6}, {7}, {8}, {9}, {10}

Start with each observation as one

cluster.

1.0 8 {1, 2}, {3}, {4}, {5, 6},{7}, {8}, {9}, {10}

Merge {1} and {2} as well as {5}and {6} since they are the closest:

d(1,2)=1 and d(5,6)=1

1.4 7 {1, 2}, {3}, {4}, {5, 6},{7, 8}, {9}, {10}

Merge {7} and {8} since they are the

closest: d(7,8)=1.4

2.0 6 {1, 2}, {3}, {4, 5, 6},{7, 8}, {9}, {10}

Merge {4} and {5, 6} since 4 and 5

are the closest: d(4,5)=2.0

2.2 5 {1, 2}, {3, 4, 5, 6},{7, 8}, {9}, {10}

Merge {3} and {4, 5, 6} since 3 and

4 are the closest: d(3,4)=2.2

2.8 3 {1, 2, 3, 4, 5, 6},{7, 8, 9}, {10}

Merge {1, 2} and {3, 4, 5, 6} as

well as {7, 8} and {9} since 2 and

3 as well as 8 and 9 are the closest:

d(2,3)=2.8 and d(8,9)=2.8

3.2 2 {1, 2, 3, 4, 5, 6, 10},{7, 8, 9}

Merge {1, 2, 3, 4, 5, 6} and

{10} since 3 and 10 are the closest:

d(3,10)=3.2

3.6 1 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} Merge remaining two clusters,

d(9,10)=3.6

9 / 11


0 2 4 6 8

02

46

8 1

2

3

4

5 6

7

8

9

10

10 / 11


9

7 8

10

1 2

3

4

5 6

1.0

1.5

2.0

2.5

3.0

3.5

Single Linkage Cluster DendrogramH

eigh

t

11 / 11

a intelligence (cscu9ye ) l 1: revision lecture · exam ¢ date: thursday 15 december, 14:00 –...

Documents