data-driven modeling: lecture 02

27
Data-driven modeling APAM E4990 Jake Hofman Columbia University January 30, 2012 Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 1 / 23

Upload: jakehofman

Post on 06-May-2015

6.843 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Data-driven modeling: Lecture 02

Data-driven modelingAPAM E4990

Jake Hofman

Columbia University

January 30, 2012

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 1 / 23

Page 2: Data-driven modeling: Lecture 02

Outline

1 Digit recognition

2 Image classification

3 Acquiring image data

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 2 / 23

Page 3: Data-driven modeling: Lecture 02

Digit recognition

Classification is an supervised learning task by which we aim topredict the correct label for an example given its features

↓0 5 4 1 4 9

e.g. determine which digit {0, 1, . . . , 9} is in depicted in eachimage

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 3 / 23

Page 4: Data-driven modeling: Lecture 02

Digit recognition

Determine which digit {0, 1, . . . , 9} is in depicted in each image

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 4 / 23

Page 5: Data-driven modeling: Lecture 02

Images as arrays

Grayscale images ↔ 2-d arrays of M × N pixel intensities

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 5 / 23

Page 6: Data-driven modeling: Lecture 02

Images as arrays

Grayscale images ↔ 2-d arrays of M × N pixel intensities

Represent each image as a “vector of pixels”, flattening the 2-darray of pixels to a 1-d vector

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 5 / 23

Page 7: Data-driven modeling: Lecture 02

k-nearest neighbors classification

k-nearest neighbors: memorize training examples, predict labelsusing labels of the k closest training points

Intuition: nearby points have similar labels

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 6 / 23

Page 8: Data-driven modeling: Lecture 02

k-nearest neighbors classification

Small k gives a complex boundary, large k results in coarseaveraging

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 7 / 23

Page 9: Data-driven modeling: Lecture 02

k-nearest neighbors classification

Evaluate performance on a held-out test set to assessgeneralization error

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 7 / 23

Page 10: Data-driven modeling: Lecture 02

Digit recognition

Simple digit classifer with k=1 nearest neighbors

./ classify_digits.py

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 8 / 23

Page 11: Data-driven modeling: Lecture 02

Outline

1 Digit recognition

2 Image classification

3 Acquiring image data

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 9 / 23

Page 12: Data-driven modeling: Lecture 02

Image classification

Determine if an image is a landscape or headshot

↓ ↓’landscape’ ’headshot’

Represent each image with a binned RGB intensity histogram

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 10 / 23

Page 13: Data-driven modeling: Lecture 02

Images as arrays

Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities

import matplotlib.image as mpimg

I = mpimg.imread('chairs.jpg')

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 11 / 23

Page 14: Data-driven modeling: Lecture 02

Images as arrays

Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities

import matplotlib.image as mpimg

I = mpimg.imread('chairs.jpg')

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 11 / 23

Page 15: Data-driven modeling: Lecture 02

Intensity histograms

Disregard all spatial information, simply count pixels by intensities(e.g. lots of pixels with bright green and dark blue)

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 12 / 23

Page 16: Data-driven modeling: Lecture 02

Intensity histograms

How many bins for pixel intensities?

Too many bins gives a noisy, overly complex representation ofthe data,

while using too few bins results in an overly simple one

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 13 / 23

Page 17: Data-driven modeling: Lecture 02

Image classification

Classify

./ classify_flickr.py 16 9

flickr_headshot flickr_landscape

Change in performance on test set with number of neighbors

k = 1, accuracy = 0.7125

k = 3, accuracy = 0.7425

k = 5, accuracy = 0.7725

k = 7, accuracy = 0.7650

k = 9, accuracy = 0.7500

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 14 / 23

Page 18: Data-driven modeling: Lecture 02

Outline

1 Digit recognition

2 Image classification

3 Acquiring image data

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 15 / 23

Page 19: Data-driven modeling: Lecture 02

Simple screen scraping

One-liner to download ESL digit data

wget -Nr --level =1 --no -parent http ://www -

stat.stanford.edu/~tibs/ElemStatLearn/

datasets/zip.digits

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 16 / 23

Page 20: Data-driven modeling: Lecture 02

Simple screen scraping

One-liner to scrape images from a webpage

wget -O- http :// bit.ly/zxy0jN |

tr ''\''"=' '\n' |

egrep '^http .*(png|jpg|gif)' |

xargs wget

• get page source

• translate quotes and = to newlines

• match urls with image extensions

• download qualifying images

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 17 / 23

Page 21: Data-driven modeling: Lecture 02

Simple screen scraping

One-liner to scrape images from a webpage

wget -O- http :// bit.ly/zxy0jN |

tr ''\''"=' '\n' |

egrep '^http .*(png|jpg|gif)' |

xargs wget

• get page source

• translate quotes and = to newlines

• match urls with image extensions

• download qualifying images

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 17 / 23

Page 22: Data-driven modeling: Lecture 02

“cat flickr∣∣ xargs wget”?

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 18 / 23

Page 23: Data-driven modeling: Lecture 02

Flickr API

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 19 / 23

Page 24: Data-driven modeling: Lecture 02

YQL: SELECT * FROM Internet1

http://developer.yahoo.com/yql

1http://oreillynet.com/pub/e/1369Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 20 / 23

Page 25: Data-driven modeling: Lecture 02

YQL: Console

http://developer.yahoo.com/yql/console

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 21 / 23

Page 26: Data-driven modeling: Lecture 02

YQL + Python

Python function for public YQL queries

def yql_public(query , env=False):

# build dictionary of GET parameters

params = {'q': query , 'format ': 'json '}if env:

params['env'] = env

# escape query

query_str = urlencode(params)

# fetch results

url = '%s?%s' % (YQL_PUBLIC , query_str)

result = urlopen(url)

# parse json and return

return json.load(result )['query ']['results ']

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 22 / 23

Page 27: Data-driven modeling: Lecture 02

YQL + Python + Flickr

Fetch info for “interestingness” photos

./ simpleyql.py 'select * from flickr.photos

.interestingness (20) where api_key="..."'

Download thumbnails for photos tagged with “vivid”

./ download_flickr.py vivid 500 <api_key >

Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 23 / 23