1 machine learning spring 2013 rong jin. 2 cse847 machine learning instructor: rong jin office...

Machine Learning

Spring 2013

Rong Jin

CSE847 Machine Learning Instructor: Rong Jin Office Hour:

Tuesday 4:00pm-5:00pm TA, Qiaozi Gao, Thursday 4:00pm-5:00pm

Textbook Machine Learning The Elements of Statistical Learning Pattern Recognition and Machine Learning Many subjects are from papers

Web site: http://www.cse.msu.edu/~cse847

Requirements ~10 homework assignments Course project

Topic: visual object recognition Data: over one million images with extracted

visual features Objective: build a classifier that automatically

identifies the class of objects in images Midterm exam & final exam

Goal Familiarize you with the state-of-art in

Machine Learning Breadth: many different techniques Depth: Project Hands-on experience

Develop the way of machine learning thinking Learn how to model real-world problems by

machine learning techniques Learn how to deal with practical issues

Course Outline

Theoretical Aspects• Information Theory

• Optimization Theory

• Probability Theory

• Learning Theory

Practical Aspects• Supervised Learning Algorithms

• Unsupervised Learning Algorithms

• Important Practical Issues

• Applications

Today’s Topics Why is machine learning? Example: learning to play backgammon General issues in machine learning

Why Machine Learning? Past: most computer programs are mainly

made by hand Future: Computers should be able to program

themselves by the interaction with their environment

Recent Trends Recent progress in algorithm and theory Growing flood of online data Computational power is available Growing industry

Big Data Challenge

• 2.7 Zetabytes (1021) of data exists in the digital universe today.

• Huge amount of data generated on the Internet every minute• YouTube users upload 48 hours

of video, • Facebook users share 684,478

pieces of content, • Instagram users share 3,600 new

photos,http://www.visualnews.com/2012/06/19/how-much-data-created-every-minute/

Big Data Challenge High dimensional data appears in many

applications of machine learning

Fine grained visual classification [1]• 250,000 features

Why Data Size Matters ? Matrix completion• Classification, clustering, recommender systems

Why Data Size Matters ?• Matrix can be perfectly recovered provided

the number of observed entries O(rnlog2(n))

Why Data Size Matters ?• The recovery error can be arbitrarily large if

the number of observed entries < O(rnlog(n))

Why Data Size Matters ?

# observed entries

O(rnlog (n)) O(rnlog2(n))

Unknow

• Difficult to access finance for small & medium business

• Minimum loan

• Tedious loan approval procedure

• Low approval rate

• Long cycle

• Completely big data driven

• Leverage e-commerce data to financial services

Alibaba Small and Micro Financial Services

• Insurance contracts has year-on-year growth rate of 100%.• Over 1 billion contracts in 2013• Over 100 million contracts one day on November 11, 2013

40.00%

80.00%

120.00%

Overall rate of compensation

Shipping Insurance for Returned Products

Uniform 5% fixed rate

Fixed rate Solely based on historical

data and demographics

Actuarial approach

Simple Easy to explain

Pricing model based on a few couple parameters

Data based pricing

Relatively accurate

Millions of features, real time pricing

Machine learned model

Dynamic pricing

Highly accurate

Shipping Insurance for Returned Products

Three Niches for Machine Learning Data mining: using historical data to improve

decisions Medical records medical knowledge

Software applications that are difficult to program by hand Autonomous driving Image Classification

User modeling Automatic recommender systems

Typical Data Mining Task

Given:• 9147 patient records, each describing pregnancy and birth• Each patient contains 215 features

Task:• Classes of future patients at high risk for Emergency Cesarean Section

Data Mining Results

One of 18 learned rules:If no previous vaginal delivery

abnormal 2nd Trimester Ultrasound Malpresentation at admission

Then probability of Emergency C-Section is 0.6

Credit Risk Analysis

Learned Rules:If Other-Delinquent-Account > 2

Number-Delinquent-Billing-Cycles > 1Then Profitable-Costumer ? = no

If Other-Delinquent-Account = 0(Income > $30K or Years-of-Credit > 3)

Then Profitable-Costumer ? = yes

Programs too Difficult to Program By Hand

ALVINN drives 70mph on highways

Positive Examples

Negative Examples

Train Test

Classify Bird Images

Visual object recognition

Image Retrieval using Texts

Software that Models Users

Description:A homicide detective and a fire marshall must stop a pair of murderers who commit videotaped crimes to become media darlings

Rating:

Description: Benjamin Martin is drawn into the American revolutionary war against his will when a brutal British commander kills his son.

Rating:

Description: A biography of sports legend, Muhammad Ali, from his early days to his days in the ring

Rating:

History What to Recommend?Description: A high-school boy is given the chance to write a story about an up-and-coming rock band as he accompanies it on their concert tour.

Recommend: ?

Description: A young adventurer named Milo Thatch joins an intrepid group of explorers to find the mysterious lost continent of Atlantis.

Recommend: ?

Netflix Contest

Relevant Disciplines Artificial Intelligence Statistics (particularly Bayesian Stat.) Computational complexity theory Information theory Optimization theory Philosophy Psychology …

What is the Learning Problem Learning = Improving with experience at some task

Improve over task T With respect to performance measure P Based on experience E

Example: Learning to Play Backgammon T: Play backgammon P: % of games won in world tournament E: opportunity to play against itself

Backgammon

More than 1020 states (boards) Best human players see only small fraction of all board

during lifetime Searching is hard because of dice (branching factor > 100)

TD-Gammon by Tesauro (1995)

Trained by playing with itself Now approximately equal to the best human

player

Learn to Play Chess Task T: Play chess Performance P: Percent of games won in the

world tournament Experience E:

What experience? How shall it be represented? What exactly should be learned? What specific algorithm to learn it?

Choose a Target Function Goal:

Policy: : b m Choice of value

function V: b, m

B = board

= real values

Choose a Target Function Goal:

Policy: : b m Choice of value

function V: b, m V: b

B = board

= real values

Value Function V(b): Example Definition

If b final board that is won: V(b) = 1 If b final board that is lost: V(b) = -1

If b not final board V(b) = E[V(b*)] where b* is final board after playing optimally

Representation of Target Function V(b)

Same value

for each board

Lookup table

(one entry for each board)

No Learning No Generalization

Summarize experience into• Polynomials• Neural Networks

Example: Linear Feature Representation Features:

pb(b), pw(b) = number of black (white) pieces on board b

ub(b), ub(b) = number of unprotected pieces

tb(b), tb(b) = number of pieces threatened by opponent

Linear function: V(b) = w0pb(b)+ w1pw(b)+ w2ub(b)+ w3uw(b)+ w4tb(b)+

w5tw(b)

Learning: Estimation of parameters w0, …, w5

Given: board b Predicted value V(b) Desired value V*(b)

Calculateerror(b) = (V*(b) – V(b))2

For each board feature fi

wi wi + cerror(b)fi

Stochastically minimizesb (V*(b)-V(b))2

Tuning Weights

Gradient Descent Optimization

Obtain Boards

Random boards Beginner plays Professionals plays

Obtain Target Values Person provides value V(b) Play until termination. If outcome is

Win: V(b) 1 for all boards Loss: V(b) -1 for all boards Draw: V(b) 0 for all boards

Play one move: b b’V(b) V(b’)

Play n moves: b b’… b(n)

V(b) V(b(n))

A General Framework

MathematicalModeling

Finding Optimal Parameters

Statistics Optimization+

Machine Learning

Importants Issues in Machine Learning Obtaining experience

How to obtain experience? Supervised learning vs. Unsupervised learning

How many examples are enough? PAC learning theory

Learning algorithms What algorithm can approximate function well, when? How does the complexity of learning algorithms impact the learning

accuracy? Whether the target function is learnable?

Representing inputs How to represent the inputs? How to remove the irrelevant information from the input representation? How to reduce the redundancy of the input representation?

1 machine learning spring 2013 rong jin. 2 cse847 machine learning instructor: rong jin office...

machine learning slide

machine learning techniques

machine learning breadth

machine learning spring

environment slide

historical data

data size matters

n unknown slide

Documents

1 machine learning spring 2010 rong jin. 2 cse847 machine...

research paper homoharringtonine inhibited breast cancer...

unsupervised learning: clustering rong jin outline ...

online multiple kernel classification steven c.h. hoi, rong...

1 collaborative filtering rong jin department of computer...

compartmentalized gene regulatory network of the ... ·...

content-based image retrieval rong jin. content-based image...

huang rong

blog mining rong jin. blog data mining blogspace analysis ...

unconstrained optimization rong jin. logistic regression the...

distance metric learning: a comprehensive survey liu yang...

semi-crowdsourced clustering: generalizing crowd labeling...

collaborative filtering rong jin dept. of computer science...

lei wu , steven c.h. hoi , rong jin #, jianke zhu, nenghai...

semi-supervised learning rong jin. semi-supervised learning ...

1 machine learning for information retrieval rong jin...

cross lingual information retrieval (clir) rong jin

language models for tr rong jin department of computer...

1 vector space model rong jin. 2 basic issues in a retrieval...

collaborative image retrieval via regularized metric...