machine learning: what we learned from our first coursera ...minncas.org/docs/minncas jan 22nd -...
TRANSCRIPT
Minnesota Casualty Actuarial Symposium
Machine Learning: What we learned from our first Coursera courseNathan Hubbell, Laura Johnson, Patrick Fillmore, Stephen Segroves
January 22nd, 2013
11
Agenda
1. MOOC Overview - Nathan
2. Machine Learning Concepts - Patrick
3. Machine Learning in Practice – Stephen
4. Other Learnings from Machine Learning - Laura
5. Q&A
MOOC OverviewNathan Hubbell
3
MOOC Overview
• MOOC – Massive Open Online Course• Big MOOC Names
– Feb, 2011: Udacity (Stanford)– April, 2012: Coursera (Stanford)– March, 2012 edX (Harvard, MIT, Berkeley)– Sept, 2006: Khan Academy
• Features– Open access– Scalability– Discussion Boards
3http://en.wikipedia.org/wiki/Massive_open_online_course
“Coursera doubles university count to 33, now hosts over
200 courses for over 1.3 million students”
The Next Web InsiderSeptember 19th, 2012
4
MOOCs in the News
4
• On Udacity’s site:– “… But that seems to be a willful misreading of the regulation
(which seems silly in the first place). Coursera isn't a degree mill. It's not about earning the degree, it's about actually learning. Minnesota's interpretation of the law is fairly ridiculous. It basically means that anyone who wants to access online educational material in Minnesota is limited by the state determining what it considers okay."
• Slate.com: Larry Pogemiller, director of the MN Office of Higher Education:– “Obviously, our office encourages lifelong learning and wants
Minnesotans to take advantage of educational materials available on the Internet, particularly if they’re free. No Minnesotan should hesitate to take advantage of free, online offerings from Coursera.”
5
MOOCsperience
• Class Structure– 10 Week Course – 2-3 hours of video content per week– Wiki-based Course Notes– Questions? Discussion Forum
• Homework– Review Questions: Quick 5-question / 10 minutes– Programming Exercises: 1 – 4 hours
5
The following slides’ content are drawn heavily from theCoursera Machine Learning class content: https://www.coursera.org/#course/ml“
Machine Learning ConceptsPatrick Fillmore
7
What is Machine Learning?
-Tom Mitchell, American computer scientist and E. Fredkin University Professor at the Carnegie Mellon University
7
“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its
performance on T, as measured by P, improves with experience E.”
TaskExperience Performance
8
What is Machine Learning?
8
TaskExperience Performance
Predict Future Losses
Policy LossHistory
Actual Losses / Loss Ratio
Ratemaking
Predict Future Development
LossDevelopment
History
Final / PredictedUltimates
Reserving
9
Machine Learning Techniques
• Familiar – Linear Regression / Linear Models– Logistic Regression / GLMs
TaskExperience Performance
• Not Machine Learning Algorithms– Judgmental selection of LDFs
– Risk/Reinsurance Models
• Unfamiliar– Supervised Learning
– Regularization– Neural Networks– Support Vector Machines
– Unsupervised Learning– Principal Component Analysis– Clustering– Recommender Systems
– Many More!
Data Driven Modeling
10
Linear Regression
10
Weight = Height * Factor + Intercept
Human Height vs. Weight
0
20
40
60
80
100
120
140
160
180
62 64 66 68 70 72 74 76
Height (Inches)
Wei
ght (
Poun
ds)
Hypothesis: 110)( xxhy
11
Linear Regression: Cost Function
11
110)( xxh Hypothesis:Human Height vs. Weight
0
20
40
60
80
100
120
140
160
180
62 64 66 68 70 72 74 76
Height (Inches)
Wei
ght (
Poun
ds)
Fitting Goal: minimize J
How to find a good fit?Cost Function!
Cost Function:
m
i
ii
mSSEyxh
mJ
1
2)()(10 2
)(21),(
Use Gradient Descent!
12
Linear Regression: Minimize Cost (Gradient Descent)
12
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
0 1 2 3 4
Theta
Cos
t Fun
ctio
n
1. Start with a ; determine cost
Iteration J()
0 1.00 5,488,884
1
2
3
4
5
6
13
Linear Regression: Minimize Cost (Gradient Descent)
13
Iteration J() dJ/d
0 1.00 5,488,884 -165.30
1
2
3
4
5
6
2. Determine how J changes with (dJ/d)
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
0 1 2 3 4
Theta
Cos
t Fun
ctio
n
14
Linear Regression: Minimize Cost (Gradient Descent)
14
Iteration J() dJ/d
0 1.00 5,488,884 -165.30
1 2.65
2
3
4
5
6
3. Calculate a new
New = Old ‐
ddJ
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
0.00 1.00 2.00 3.00 4.00
Theta
Cos
t Fun
ctio
n
= learning rate = .01
15
Linear Regression: Minimize Cost (Gradient Descent)
15
Iteration J() dJ/d
0 1.00 5,488,884 -165.30
1 2.65 581,450 -52.98
2 3.18
3
4
5
6
4. Iterate until Convergence
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
0.00 1.00 2.00 3.00 4.00
Theta
Cos
t Fun
ctio
n
16
Linear Regression: Minimize Cost (Gradient Descent)
16
Iteration J() dJ/d
0 1.00 5,488,884 -165.30
1 2.65 581,450 -52.98
2 3.18 77,351 -16.98
3 3.35
4
5
6
4. Iterate until Convergence
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
0.00 1.00 2.00 3.00 4.00
Theta
Cos
t Fun
ctio
n
17
Linear Regression: Minimize Cost (Gradient Descent)
17
Iteration J() dJ/d
0 1.00 5,488,884 -165.30
1 2.65 581,450 -52.98
2 3.18 77,351 -16.98
3 3.35 25,569 -5.44
4 3.41
5
6
4. Iterate until Convergence
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
0.00 1.00 2.00 3.00 4.00
Theta
Cos
t Fun
ctio
n
18
Linear Regression: Minimize Cost (Gradient Descent)
18
Iteration J() dJ/d
0 1.00 5,488,884 -165.30
1 2.65 581,450 -52.98
2 3.18 77,351 -16.98
3 3.35 25,569 -5.44
4 3.41 20,250 -1.74
5 3.42
6
4. Iterate until Convergence
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
0.00 1.00 2.00 3.00 4.00
Theta
Cos
t Fun
ctio
n
19
Linear Regression: Minimize Cost (Gradient Descent)
19
Iteration J() dJ/d
0 1.00 5,488,884 -165.30
1 2.65 581,450 -52.98
2 3.18 77,351 -16.98
3 3.35 25,569 -5.44
4 3.41 20,250 -1.74
5 3.42 19,704 -0.56
6 3.43
4. Iterate until Convergence
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
0.00 1.00 2.00 3.00 4.00
Theta
Cos
t Fun
ctio
n
Final : 3.43
20
Cost Function – One Parameter
3 3.25 3.5 3.75 4
Theta
Cos
t Fun
ctio
n
21
Cost Function – Two Parameters
21
22
Linear Regression: GD vs. Normal Equations
22
Human Height vs. Weight
y = 3.4327x - 106.03
0
20
40
60
80
100
120
140
160
180
62 64 66 68 70 72 74 76
Height (Inches)
Wei
ght (
Poun
ds)
23
Why discuss Gradient Descent at all?
• Basic fitting algorithm for Machine Learning• Many other Systems/Models use Gradient Descent
Andrew Ng: If you understand gradient descent and can implement it, you can use optimized software to solve problems, and are ahead of many of the people working on this stuff in this field.
24
Neural Networks
Layer 1 Layer 2 Layer 3 Layer 4
25
Selecting Model Structure (The Right Machine for the Job)
25
Bias/variance: How would you fit this model?
Price
Size
26
Selecting Model Structure (The Right Machine for the Job)
26
Bias/variance: How would you fit this model?
Price
Size
27
Bias vs. Variance
27
High bias(underfit)
Price
Size
28
Bias vs. Variance
28
High bias(underfit)
Price
Size
High variance(overfit)
Price
Size
29
Bias vs. Variance
29
High bias(underfit)
Price
Size
High variance(overfit)
Price
Size
“Just right”
PriceSize
30
Cross Validation
30
DATA
Model
31
Cross Validation
31
ModelFit
Training Validation Testing (Holdout)
ModelStucture
FinalModel
Testing
32
Bias vs. Variance
32
High bias(underfit)
Price
Size
High variance(overfit)
Price
Size
“Just right”
PriceSize
33
Regularization
33
High variance(overfit)
Price
Size
Machine Learning in Practice: Cluster AnalysisStephen Segroves
Training set:
Supervised Learning
Training set:
Unsupervised Learning
Applications of Clustering
Market Segmentation / Customer Profiling
Territory Grouping
Social Network Analysis
Clustering: K-Means Algorithm
Clustering: K-Means Algorithm
Clustering: K-Means Algorithm
Clustering: K-Means Algorithm
Clustering: K-Means Algorithm
Clustering: K-Means Algorithm
Clustering: K-Means Algorithm
Clustering: K-Means Algorithm
Clustering: K-Means Algorithm
Randomly initialize cluster centroidsRepeat {
for = 1 to := index (from 1 to ) of cluster centroid
closest to for = 1 to
:= average (mean) of points assigned to cluster
}
Clustering: K-Means Algorithm
Potential Issues: Local Optima
For i = 1 to 100 {
Randomly initialize K-means.Run K-means. Get .Compute cost function (distortion)
}
Pick clustering that gave lowest cost
Potential Solution: Local Optima
Other Learnings from the Machine Learning CourseLaura Johnson
This course was a great way to learn – WHY?• Structure and foundation given
– 58,000 students across the world across multiple disciplines• Well laid out web site• Discussion forums, wikis, etc.
– Basic building blocks provided
• Technical enhancements to recorded sessions– Notes – color!!– Captions / transcript– Speed control– “interactive” feedback
51
Coursera Look and Feel - Structure
52
Coursera – Teaching using Building Blocks
Coursera Technical Enhancements – Notes, Captions, Speed
54
Neural Networks
Layer 1 Layer 2 Layer 3 Layer 4
Coursera Technical Enhancments – Notes in Color!!
56
Coursera Technical Enhancments - Feedback
Coursera Technical Enhancements - Feedback
Machine Learning MOOC Recommendations• Time
– Only take one MOOC at a time!– Do the homework on time
• Software Required– Google Chrome, Firefox, IE9– Octave (Free Matlab)– Text Editor (UltraEdit, SublimeText, TextWrangler)
• Suggested Prerequisites– Linear Algebra– Some Programming Experience a Plus
• Team up!• Final comments on Machine Learning:
– Data: GIGO– Half science / half art
59
Questions?