Download - Just the basics_strata_2013
Photo by mikebaird, www.flickr.com/photos/mikebaird
Just the Basics: Core Data Science Skills William Cukierski, [email protected]!Ben [email protected]!
JUST the basics!
We mean the basics!– Ask dumb questions!
(we’ll give dumb answers)!– We can’t be comprehensive, but
we can omit pretense and jargon!
– Expect a little Python, R, Matlab, Excel, command line, hand-waving!
Pronounced Kah-gull (as in waggle),not Kegel (as in bagel)!
Before we get started!
You’ll need a Kaggle account!www.kaggle.com/account/register!
!
Create a team for the competition!www.kaggle.com/c/just-the-basics-strata-2013!Add (Strata) to the end of your team name!!e.g. – William Cukierski (Strata) !
Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!
Will background!Physics & Biomedical Engineering!
– Studied machine learning for diagnosis of pathology images!
– Constantly reinventing sophomore-level CS concepts!
Former “successful” machine learning competitor!
– Successful?!• Finished near top?!• Got me a job?!• Fooled people into believing I
understand stats(a.k.a. “data scientist”)!
Biomedical Engineering & Electrical Engineering!
– Applied machine learning to improve brain-computer interface!
– Software development in various languages / domains!
Machine learning competitions!– Top finishes in many 2010-2011!– Teamed up with Will on several!– Switched to the dark side, spent much
of the past year designing competitions at Kaggle!
Ben Background!
Driving a Brain-Controlled Wheelchair
The unfortunate hype of modern analytics!• BIG DATA!!• Every second 6.2 trillion exabytes of data are being collected!• Need shared vocabulary, shared protocols!• Need to leverage!
– weather reports!– surveys!– text documents!– human genomes!– regulatory information!– cell phone logs!– satellite surveillance !– etc.!– etc.!– etc.!
What do we do about it?!
• Create committees, consortiums, taxonomies, platforms, frameworks, clouds!
• Create acronyms for our committees, consortiums, taxonomies, platforms, frameworks, clouds!
• Go to conferences to promote and learn about our acronym’d things!
• And if time permits and the mood strikes?!
work
I’m ready to leave now !
Big Data Barry!Lives by the Shirky Principle:!
Preserving the problem to which he is the solution!
Favorite talking points!
Data provenance, data warehousing, data privacy, data regulations, data silos, need for standards, need for standards on standards of standards, lack of data correctness, need for communication!
Source: http://mojette.deviantart.com/!
Listen, I’ve been in this field for 22 years. The Bayesian guys in the modeling group are never gonna talk to the IT guys because they don’t speak the same language. In my 22 years of experience, what we need are tighter standards around what the processes should be for requesting data, how that data should be stored, and who should have access to the data. Also privacy.
Privacy is a thing about which I have no clue, but nonetheless I’m compelled to steamroll even the most
benign use of our data for anything beyond occupying a database. Oh, and speaking of databases and my 22
years of experience, we need stricter governance about the schemas a policies that inform the ways the data gets
federated, so the model guys will stop trying to implement things that’ll never work.…!
Seriously,guys, let me out !
The plight of the data scientist!
Job description:!Data Scientists (n.) Person who is better at statistics than any software engineer and better at software engineering than any statistician.!!Job reality:!Data Scientists (n.) Person who is worse at statistics than any statistician and worse at software engineering than any software engineer.!!!
This problem can only be solved by an 8th-order
kernel projection onto an orthonormal space of
homoscedastic eigentensors
The boss is going to have my neck if I
can’t get this Hadoop iPhone app ready in
time for Strata
I’m making an Excel VBA script to access our Oracle database and find the mean of the revenue column!
Data science (noun): Statistics done wrong
Data scienceThe application of scientific experimentation (hypothesis testing, model generation, statistical analysis) in problem-agnostic ways. !!Not data science!{infographics, apps, site architecture, sending JSON thingies around, Javascript frameworks, web analytics, plotting tweets on maps, cloud storage, domains that end in .io, any idea/thing/product that touches data}!
Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!
Optimization What’s the best the can happen?
Predictive Modeling What will happen next?
Forecasting/extrapolation What if these trends continue?
Statistical analysis Why is this happening?
Alerts What actions are needed?
Query/drill down What exactly is the problem?
Ad hoc reports How many, how often, where?
Standard reports What happened?
Gain
Soph
istic
atio
n
Analytics
Access and reporting
Source: Competing on Analytics, Davenport/Harris, 2007!
When to use data!
Asking specific questions is mostly harmless!– How many users bought shampoo X at store Y last quarter?!
Prediction is not a free lunch!– Being data-driven and wrong is easy and bad!– Fancy models should serve fancy questions!
• Don’t forecast something that can be measured!
Human knowledge precedes machine knowledge!– Sometimes black boxes work!– Often, they don’t: earthquakes, finance models, etc.!
When to use data!
Human experts are good at generalization!!Human experts are bad at!
– Accurate predictions!– Estimating the uncertainty of their predictions!– Making the same prediction under the same evidence!– Updating predictions in the face of new evidence!– Ignoring unrelated evidence!
http://www.nytimes.com/interactive/science/rock-paper-scissors.html!
We need to teach the computer to generalize
laptop:~ wcuk$ RUN IT’S A BEAR -bash: BEAR: threat not found
…without overfitting
laptop:~ wcuk$ RUN IT’S A BEAR run: Must specify one of –black –grizzly –teddy laptop:~ wcuk$ RUN IT’S A BEAR -grizzly run: Are you sure you want to run? (y/n) y run: Enter the bear’s name: Rupert run: Is it Rupert with the scar on his ear? He’s cool. He’s more of a salmon kind of bear. (y/n): n run:...RUN!!!!!!!
“If you wish to make an apple pie from scratch, you must first invent the universe.” – Carl Sagan!
Storing data!
Binary! Text! Database!
Reading data into a useful format!
We overcomplicate storage and formats!– Databases are quite often a bad choice!– Most data science is a batch process on tabular data!– Your debugging cycle should be fast!
Why text?!– Simple!– Universal!– Fast (to read/write/debug)!– Transparent!
Most data is not useful for scientific experimentation!Too “macro” (lacking causal detail)! Meant for human consumption!
Structured data is not always machine ready !Game 1!
Seat 1: Solracca ($95.30 in chips) Seat 2: BrickT63 ($127.10 in chips)
Seat 3: sven160482 ($184.30 in chips) Seat 4: Adelantez ($103 in chips)
Seat 6: manfred zeal ($155.50 in chips) Solracca: posts small blind $0.50
BrickT63: posts big blind $1 *** HOLE CARDS ***
sven160482: raises $1 to $2
Adelantez: raises $5.50 to $7.50 manfred zeal: folds
Solracca: folds BrickT63: folds
sven160482: folds Uncalled bet ($5.50) returned to Adelantez
Adelantez collected $5.50 from pot *** SUMMARY ***
Total pot $5.50 | Rake $0 Seat 4: Adelantez collected ($5.50)
Game 2!Seat 1: Kingcovey ($108.65 in chips) Seat 3: VoronIN_exe ($119.80 in chips) Seat 4: ehle123 ($104 in chips) Seat 5: MercuriusAA ($107.60 in chips) Seat 6: budapestkin ($133.15 in chips) budapestkin: posts small blind $0.50 Kingcovey: posts big blind $1 *** HOLE CARDS *** VoronIN_exe: raises $2 to $3
ehle123: folds MercuriusAA: folds budapestkin: calls $2.50 Kingcovey: folds *** FLOP *** [7c Tc Ks] budapestkin: checks VoronIN_exe: bets $4.45 budapestkin: calls $4.45 *** TURN *** [7c Tc Ks] [8c] budapestkin: checks VoronIN_exe: checks *** RIVER *** [7c Tc Ks 8c] [Kc] budapestkin: bets $11
VoronIN_exe: folds Uncalled bet ($11) returned to budapestkin budapestkin collected $15.15 from pot *** SUMMARY *** Total pot $15.90 | Rake $0.75 Seat 6: budapestkin collected ($15.15)
A word of caution on scraping!• Scraping is time intensive, unleveraged, brittle!• Before you code, research existing libraries!!
– Will solve 95% of the problems you don’t even know you will have!– E.g. web scraping using python’s BeautifulSoup!
page = urllib2.urlopen("http://www.kaggle.com/competitions") soup = BeautifulSoup(page.read()) allLinks = soup.find_all('a') allLinks = uniqify(allLinks) for link in allLinks: match = (re.search('^/c/.*', link.get('href'))) if match:
fileName = link.get('href'); fileName = fileName.replace('/','_') + ".zip" fileName = fileName[3:] getStuff(fileName, "http://www.kaggle.com" + link.get("href") + "/publicleaderboarddata.zip")
Excel has a time and place!– Looking at data!– Pivot tables!– Quick plots to verify things!
Never:!– Pass spreadsheets around!– “Code” in Excel!– Create workflows that require copy/
pasting data around!
Excel!
Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!
Command line!
Glossary!
features = attributes = independent variables!
targets = gold standard = ground truth = dependent variable(s)!
training set = data & targets use to train a model!
validation set = data & targets used as feedback in model training!
test set = separate data & targets used only to evaluate the model!
cross validation = partitioning the training set to estimate how well a
model will generalize!
Train!
Test!
Read! Feature Extraction! Learn!
Generalize!
Bayes theorem!
How to update beliefs in the face of evidence?!For proposition A and evidence B:!
– P(A) = prior (belief in A)!– P(B) = evidence!– P(A | B) = posterior (belief in A given B)!– P(B | A) = likelihood!
P (A|B) =P (B|A)P (A)
P (B)
P (female|long hair) =P (long hair|female)P (female)
P (long hair)
R!
MATLAB!
Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!
Visualization!
Speak the language of your audience!– Use simple plots!– Use units that matter (dollars, time, widgets)!– Include the units!!– Don’t use acronyms!!
!Most visualization should be internal facing (am I doing this right?) and not external facing (hey check this out!)!
• Plotting raw features!• Looking for outliers,
anomalies, correlation!
• Verifying feature selection or dimensionality reduction!
• Looking at manifold density!• Looking at class separation!
• Babysitting model performance!• Looking for optima!• Watching for sensitivity to initial
conditions, perturbations!
• Summarizing!• Checking the result is reasonable!• Comparisons to the alternative!
Your job is to solve a problem!– Sell the message, not the graphic!
Avoid chartjunk!“The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.” –Edward Tufte!
source: http://i.dailymail.co.uk/i/pix/2012/03/21/article-2118152-124602BE000005DC-0_964x528.jpg
source: http://www.fivethirtyeight.com/2009/10/older-and-wealthier-people-are-more.html
Election fraud: 2D histograms of the number of units for a given voter turnout (x axis) and the percentage of votes (y axis) for the winning party!
source: http://www.pnas.org/content/early/2012/09/20/1210722109.abstract
ggplot2!
Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!
Make a spam detector!
The data represents a corpus of emails. Some are spam and some are normal.!• Due to time constraints, feature extraction is done for you:!
– train.csv - contains 600 emails x 100 features!– train_labels.csv – contains the 600 training labels (1 = spam, 0 =
normal)!– test.csv - contains 4000 emails x 100 features!
• Submit a file with each of the 4000 predictions on a separate line (in the same order as test.csv).!– No header is necessary!– Predictions can be continuous numbers or 0/1 labels!
Return% ProductID Dept Price MFR 1.94 54323 Household 54.95 USA
0.023 92356 Household 9.95 USA 0.8 78023 Computer 4.5 China
0.01 12340 Audio 109.99 China 0.41 31240 Audio 29.99 Taiwan 0.97 12351 Hardware 54.95 Mexico
0.0115 90141 Hardware 4.99 USA 0.4 81240 Hardware 6.55 Taiwan
0.03 14896 Computer 211.99 Korea 0.205 62132 Computer 1100 USA
1.6878 54323 Audio 34.99 USA 0.0345 92356 Audio 7.99 USA
0.64 78023 Household 229.9 Brazil 0.72 12340 Audio 19.95 Mexico 0.41 31240 Computer 6.99 Taiwan 1.94 54323 Hardware 11.99 Taiwan
0.023 92356 Household 2.05 USA 0.08 78023 Computer 99.99 USA 2.09 12340 Computer 129.99 China 1.1 31240 Audio 18.99 China
How the leaderboard works!
Return% ProductID Dept Price MFR 1.94 54323 Household 54.95 USA
0.023 92356 Household 9.95 USA 0.8 78023 Computer 4.5 China
0.01 12340 Audio 109.99 China 0.41 31240 Audio 29.99 Taiwan 0.97 12351 Hardware 54.95 Mexico
0.0115 90141 Hardware 4.99 USA 0.4 81240 Hardware 6.55 Taiwan
0.03 14896 Computer 211.99 Korea 0.205 62132 Computer 1100 USA
1.6878 54323 Audio 34.99 USA 0.0345 92356 Audio 7.99 USA
0.64 78023 Household 229.9 Brazil 0.72 12340 Audio 19.95 Mexico 0.41 31240 Computer 6.99 Taiwan 1.94 54323 Hardware 11.99 Taiwan
0.023 92356 Household 2.05 USA 0.08 78023 Computer 99.99 USA 2.09 12340 Computer 129.99 China 1.1 31240 Audio 18.99 China
Training
Test
How the leaderboard works!
Solution “Ground Truth”
Return% ProductID Dept Price MFR 1.94 54323 Household 54.95 USA
0.023 92356 Household 9.95 USA 0.8 78023 Computer 4.5 China
0.01 12340 Audio 109.99 China 0.41 31240 Audio 29.99 Taiwan 0.97 12351 Hardware 54.95 Mexico
0.0115 90141 Hardware 4.99 USA 0.4 81240 Hardware 6.55 Taiwan
0.03 14896 Computer 211.99 Korea 0.205 62132 Computer 1100 USA
1.6878 54323 Audio 34.99 USA 0.0345 92356 Audio 7.99 USA
0.64 78023 Household 229.9 Brazil ? 12340 Audio 19.95 Mexico ? 31240 Computer 6.99 Taiwan ? 54323 Hardware 11.99 Taiwan ? 92356 Household 2.05 USA ? 78023 Computer 99.99 USA ? 12340 Computer 129.99 China ? 31240 Audio 18.99 China
Training
Test
How the leaderboard works!
Solution “Ground Truth”
Return% ProductID Dept Price MFR 1.94 54323 Household 54.95 USA
0.023 92356 Household 9.95 USA 0.8 78023 Computer 4.5 China
0.01 12340 Audio 109.99 China 0.41 31240 Audio 29.99 Taiwan 0.97 12351 Hardware 54.95 Mexico
0.0115 90141 Hardware 4.99 USA 0.4 81240 Hardware 6.55 Taiwan
0.03 14896 Computer 211.99 Korea 0.205 62132 Computer 1100 USA
1.6878 54323 Audio 34.99 USA 0.0345 92356 Audio 7.99 USA
0.64 78023 Household 229.9 Brazil 0.03 12340 Audio 19.95 Mexico
1.298 31240 Computer 6.99 Taiwan 0.94 54323 Hardware 11.99 Taiwan 0.04 92356 Household 2.05 USA 0.36 78023 Computer 99.99 USA 1.2 12340 Computer 129.99 China
0.02 31240 Audio 18.99 China
Training
Test
How the leaderboard works!
Submission
Return% ProductID Dept Price MFR 1.94 54323 Household 54.95 USA
0.023 92356 Household 9.95 USA 0.8 78023 Computer 4.5 China
0.01 12340 Audio 109.99 China 0.41 31240 Audio 29.99 Taiwan 0.97 12351 Hardware 54.95 Mexico
0.0115 90141 Hardware 4.99 USA 0.4 81240 Hardware 6.55 Taiwan
0.03 14896 Computer 211.99 Korea 0.205 62132 Computer 1100 USA
1.6878 54323 Audio 34.99 USA 0.0345 92356 Audio 7.99 USA
0.64 78023 Household 229.9 Brazil 0.03 12340 Audio 19.95 Mexico
1.298 31240 Computer 6.99 Taiwan 0.94 54323 Hardware 11.99 Taiwan 0.04 92356 Household 2.05 USA 0.36 78023 Computer 99.99 USA 1.2 12340 Computer 129.99 China
0.02 31240 Audio 18.99 China
Training
Test
How the leaderboard works!
Submission
Public Leaderboard Private Leaderboard
Area under the receiver-operating characteristic curve !
Example Model!
Think about!
• Missing values!• Noise!• Combinations of features!• Transformations of features (e.g. log)!• Combinations of methods!• Overfitting!• Binary vs. continuous predictions!• How good is a good spam detector?!