today’s topics chapter 2 in one slide chapter 18: machine learning (ml) creating an ml dataset...

17
Today’s Topics • Chapter 2 in One Slide • Chapter 18: Machine Learning (ML) • Creating an ML Dataset – “Fixed-length feature vectors” – Relational/graph-based examples • HW0 (due in one week) • Getting ‘Labeled’ Training Examples • Train/Tune/Test Sets N-fold Cross Validation 9/8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1 1

Upload: quentin-barker

Post on 30-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

Lecture 1, Slide 1

Today’s Topics

• Chapter 2 in One Slide

• Chapter 18: Machine Learning (ML)

• Creating an ML Dataset

– “Fixed-length feature vectors”

– Relational/graph-based examples

• HW0 (due in one week)

• Getting ‘Labeled’ Training Examples

• Train/Tune/Test Sets

• N-fold Cross Validation9/8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1

Page 2: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1 Lecture 1, Slide 2

The Big AI Picture – Chapter 2

9/8/15

EnvironmentAI

“Agent”

1: Sense

5: Learn

3: Act

4: Get Feedback

2: Reason

The study of ‘agents’ that exist in an environment and perceive, act, and learn

Page 3: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

What Do You Think Machine Learning Means?

Given:

Do:

9/8/15

Throughout the semester, think of what is missing in current ML, compared to human learning

3CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1

Page 4: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

What is Learning?

“Learning denotes changes in the system that

… enable the system to do the same task …

more effectively the next time.”

- Herbert Simon

“Learning is making useful changes in our minds.”

- Marvin Minsky

9/8/15 4CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1

But remember, cheese and wine get better over time but don’t

learn!

Page 5: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

9/8/15

Supervised Machine Learning: Task Overview

Concepts/Classes/Decisions

Concepts/Classes/Decisions

Feature “Design”(usually done by humans)

Classifier Construction(done by learning algorithm)

Real WorldReal World

Feature SpaceFeature Space

5CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1

Page 6: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

9/8/15

Standard Approach for Constructing an ML Dataset for a TaskStep 1: Choose a feature space

We will use fixed-length feature vectorsChoose N featuresEach feature has Vi

possible valuesEach example is represented by a vector of N feature values (is a point in the feature space)eg <red, 50, round>

color weight shape

Step 2: Collect examples (“I/O” pairs)

Defines a space

colorshape

weight

6CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1

Page 7: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

Another View of Std ML Datasets - a Single Table (2D array)

9/8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1 7

Feature 1 Feature 2. . .

Feature NOutput

Category

Example 1 0.0 small red true

Example 2 9.3 medium red false

Example 3 8.2 small blue false

. . .

Example M 5.7 medium green true

Page 8: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

9/8/15

Standard Feature Typesfor representing training examples – a source of “domain knowledge”

• Nominal (including Boolean)– No ordering among possible values

eg, color {red, blue, green} (vs. color = 1000 Hertz)

• Linear (or Ordered)– Possible values of the feature are totally ordered

eg, size {small, medium, large} ← discrete

weight [0…500] ← continuous

• Hierarchical (not commonly used)– Possible values are partially

ordered in an ISA hierarchy eg, shape

closed

polygon continuous

trianglesquare circle ellipse

Keep your eye out for places where domain knowledgeis (or should be) used in ML

8 CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1

Page 9: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

9/8/15

A Richer Testbed:The Internet Movie Database (IMDB)

IMDB richly represents data

note each movie is potentially represented

by a graph of a different size

9 CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1

Figure from David Jensen of UMass

Page 10: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

Lecture 1, Slide 10

Learning with Data in Multiple Tables (Relational ML) – not covered in cs540

Previous Mammograms Previous

Blood Tests

Prev. Rx

Key challenge different amount of data for each patient

Patients

Page 11: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1 Lecture 1, Slide 11

HWO – Reading in an Dataset

• Due in one week (most HWs will have two weeks between when assigned and when due)

• The Thoracic Surgery Dataset (original version)

9/8/15

Page 12: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1 Lecture 1, Slide 12

Getting Labeled Examples

• The ‘Achilles Heel’ of ML

• Often ‘experts’ label– eg ‘books I like’ or ‘patients that should get drug X’

• ‘Time will tell’ concepts– wait a month and see if medical treatment worked or

stock appreciated over a year

• Use of Amazon Mechanical Turk– ‘the crowd’

• Need representative examples, especially good ‘negative’ (counter) examples

9/8/15

Page 13: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1 Lecture 1, Slide 13

If it is Free, You are the Product

9/8/15

Google is using authentication (as a human) as a way to get labeled data for their ML algorithms!

Page 14: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

9/8/15

IID and Other Assumptions

• We are assuming examples are IID: independently identically distributed

• We are ignoring temporal dependencies (covered in time-series learning)

• We assume the ML algo has no say in which examples it gets (covered in active learning)

Data arrives in any order

14 CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1

Page 15: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

9/8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1

Train/Tune/Test Sets: A Pictorial Overview

generate solutions

select best

ML Algo

training examples

train’ set tune set

testing examples

classifier

expected accuracy on future examples

collection of classified examples (here each column is an example)

15

Page 16: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

9/8/15

N -fold Cross Validation

Can be used to 1) estimate future accuracy (via test sets)2) choose parameter settings (via tuning sets)

Method1) Randomly permute examples2) Divide into N bins3) Train on N - 1 bins, measure accuracy on bin ‘left out’4) Compute average accuracy on held-out sets

Examples

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1 16

Page 17: Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based

Dealing with Data that Comes from Larger Objects• Assume examples

are sentences contained in books

• Or web pages from computer science depts

• Or short DNA sequences from genes

• (Usually) need to cross validate on the LARGER objects

Eg, first partition books into N folds, then collect sentences from a fold’s books

9/8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 2, Week 1 17

Sentences in Books

Fold1 Fold2