Download - Just the basics_strata_2013
![Page 1: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/1.jpg)
Photo by mikebaird, www.flickr.com/photos/mikebaird
Just the Basics: Core Data Science Skills William Cukierski, [email protected]!Ben [email protected]!
![Page 2: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/2.jpg)
JUST the basics!
We mean the basics!– Ask dumb questions!
(we’ll give dumb answers)!– We can’t be comprehensive, but
we can omit pretense and jargon!
– Expect a little Python, R, Matlab, Excel, command line, hand-waving!
![Page 3: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/3.jpg)
Pronounced Kah-gull (as in waggle),not Kegel (as in bagel)!
![Page 4: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/4.jpg)
Before we get started!
You’ll need a Kaggle account!www.kaggle.com/account/register!
!
Create a team for the competition!www.kaggle.com/c/just-the-basics-strata-2013!Add (Strata) to the end of your team name!!e.g. – William Cukierski (Strata) !
![Page 5: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/5.jpg)
Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!
![Page 6: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/6.jpg)
Will background!Physics & Biomedical Engineering!
– Studied machine learning for diagnosis of pathology images!
– Constantly reinventing sophomore-level CS concepts!
Former “successful” machine learning competitor!
– Successful?!• Finished near top?!• Got me a job?!• Fooled people into believing I
understand stats(a.k.a. “data scientist”)!
![Page 7: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/7.jpg)
Biomedical Engineering & Electrical Engineering!
– Applied machine learning to improve brain-computer interface!
– Software development in various languages / domains!
Machine learning competitions!– Top finishes in many 2010-2011!– Teamed up with Will on several!– Switched to the dark side, spent much
of the past year designing competitions at Kaggle!
Ben Background!
Driving a Brain-Controlled Wheelchair
![Page 8: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/8.jpg)
The unfortunate hype of modern analytics!• BIG DATA!!• Every second 6.2 trillion exabytes of data are being collected!• Need shared vocabulary, shared protocols!• Need to leverage!
– weather reports!– surveys!– text documents!– human genomes!– regulatory information!– cell phone logs!– satellite surveillance !– etc.!– etc.!– etc.!
![Page 9: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/9.jpg)
What do we do about it?!
• Create committees, consortiums, taxonomies, platforms, frameworks, clouds!
• Create acronyms for our committees, consortiums, taxonomies, platforms, frameworks, clouds!
• Go to conferences to promote and learn about our acronym’d things!
• And if time permits and the mood strikes?!
work
![Page 10: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/10.jpg)
![Page 11: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/11.jpg)
I’m ready to leave now !
![Page 12: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/12.jpg)
Big Data Barry!Lives by the Shirky Principle:!
Preserving the problem to which he is the solution!
Favorite talking points!
Data provenance, data warehousing, data privacy, data regulations, data silos, need for standards, need for standards on standards of standards, lack of data correctness, need for communication!
Source: http://mojette.deviantart.com/!
![Page 13: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/13.jpg)
Listen, I’ve been in this field for 22 years. The Bayesian guys in the modeling group are never gonna talk to the IT guys because they don’t speak the same language. In my 22 years of experience, what we need are tighter standards around what the processes should be for requesting data, how that data should be stored, and who should have access to the data. Also privacy.
Privacy is a thing about which I have no clue, but nonetheless I’m compelled to steamroll even the most
benign use of our data for anything beyond occupying a database. Oh, and speaking of databases and my 22
years of experience, we need stricter governance about the schemas a policies that inform the ways the data gets
federated, so the model guys will stop trying to implement things that’ll never work.…!
![Page 14: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/14.jpg)
Seriously,guys, let me out !
![Page 15: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/15.jpg)
The plight of the data scientist!
Job description:!Data Scientists (n.) Person who is better at statistics than any software engineer and better at software engineering than any statistician.!!Job reality:!Data Scientists (n.) Person who is worse at statistics than any statistician and worse at software engineering than any software engineer.!!!
![Page 16: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/16.jpg)
![Page 17: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/17.jpg)
This problem can only be solved by an 8th-order
kernel projection onto an orthonormal space of
homoscedastic eigentensors
The boss is going to have my neck if I
can’t get this Hadoop iPhone app ready in
time for Strata
I’m making an Excel VBA script to access our Oracle database and find the mean of the revenue column!
Data science (noun): Statistics done wrong
![Page 18: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/18.jpg)
Data scienceThe application of scientific experimentation (hypothesis testing, model generation, statistical analysis) in problem-agnostic ways. !!Not data science!{infographics, apps, site architecture, sending JSON thingies around, Javascript frameworks, web analytics, plotting tweets on maps, cloud storage, domains that end in .io, any idea/thing/product that touches data}!
![Page 19: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/19.jpg)
Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!
![Page 20: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/20.jpg)
Optimization What’s the best the can happen?
Predictive Modeling What will happen next?
Forecasting/extrapolation What if these trends continue?
Statistical analysis Why is this happening?
Alerts What actions are needed?
Query/drill down What exactly is the problem?
Ad hoc reports How many, how often, where?
Standard reports What happened?
Gain
Soph
istic
atio
n
Analytics
Access and reporting
Source: Competing on Analytics, Davenport/Harris, 2007!
![Page 21: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/21.jpg)
When to use data!
Asking specific questions is mostly harmless!– How many users bought shampoo X at store Y last quarter?!
Prediction is not a free lunch!– Being data-driven and wrong is easy and bad!– Fancy models should serve fancy questions!
• Don’t forecast something that can be measured!
Human knowledge precedes machine knowledge!– Sometimes black boxes work!– Often, they don’t: earthquakes, finance models, etc.!
![Page 22: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/22.jpg)
When to use data!
Human experts are good at generalization!!Human experts are bad at!
– Accurate predictions!– Estimating the uncertainty of their predictions!– Making the same prediction under the same evidence!– Updating predictions in the face of new evidence!– Ignoring unrelated evidence!
![Page 23: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/23.jpg)
http://www.nytimes.com/interactive/science/rock-paper-scissors.html!
![Page 24: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/24.jpg)
We need to teach the computer to generalize
laptop:~ wcuk$ RUN IT’S A BEAR -bash: BEAR: threat not found
![Page 25: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/25.jpg)
…without overfitting
laptop:~ wcuk$ RUN IT’S A BEAR run: Must specify one of –black –grizzly –teddy laptop:~ wcuk$ RUN IT’S A BEAR -grizzly run: Are you sure you want to run? (y/n) y run: Enter the bear’s name: Rupert run: Is it Rupert with the scar on his ear? He’s cool. He’s more of a salmon kind of bear. (y/n): n run:...RUN!!!!!!!
![Page 26: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/26.jpg)
“If you wish to make an apple pie from scratch, you must first invent the universe.” – Carl Sagan!
Storing data!
Binary! Text! Database!
![Page 27: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/27.jpg)
Reading data into a useful format!
We overcomplicate storage and formats!– Databases are quite often a bad choice!– Most data science is a batch process on tabular data!– Your debugging cycle should be fast!
Why text?!– Simple!– Universal!– Fast (to read/write/debug)!– Transparent!
![Page 28: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/28.jpg)
Most data is not useful for scientific experimentation!Too “macro” (lacking causal detail)! Meant for human consumption!
![Page 29: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/29.jpg)
Structured data is not always machine ready !Game 1!
Seat 1: Solracca ($95.30 in chips) Seat 2: BrickT63 ($127.10 in chips)
Seat 3: sven160482 ($184.30 in chips) Seat 4: Adelantez ($103 in chips)
Seat 6: manfred zeal ($155.50 in chips) Solracca: posts small blind $0.50
BrickT63: posts big blind $1 *** HOLE CARDS ***
sven160482: raises $1 to $2
Adelantez: raises $5.50 to $7.50 manfred zeal: folds
Solracca: folds BrickT63: folds
sven160482: folds Uncalled bet ($5.50) returned to Adelantez
Adelantez collected $5.50 from pot *** SUMMARY ***
Total pot $5.50 | Rake $0 Seat 4: Adelantez collected ($5.50)
Game 2!Seat 1: Kingcovey ($108.65 in chips) Seat 3: VoronIN_exe ($119.80 in chips) Seat 4: ehle123 ($104 in chips) Seat 5: MercuriusAA ($107.60 in chips) Seat 6: budapestkin ($133.15 in chips) budapestkin: posts small blind $0.50 Kingcovey: posts big blind $1 *** HOLE CARDS *** VoronIN_exe: raises $2 to $3
ehle123: folds MercuriusAA: folds budapestkin: calls $2.50 Kingcovey: folds *** FLOP *** [7c Tc Ks] budapestkin: checks VoronIN_exe: bets $4.45 budapestkin: calls $4.45 *** TURN *** [7c Tc Ks] [8c] budapestkin: checks VoronIN_exe: checks *** RIVER *** [7c Tc Ks 8c] [Kc] budapestkin: bets $11
VoronIN_exe: folds Uncalled bet ($11) returned to budapestkin budapestkin collected $15.15 from pot *** SUMMARY *** Total pot $15.90 | Rake $0.75 Seat 6: budapestkin collected ($15.15)
![Page 30: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/30.jpg)
A word of caution on scraping!• Scraping is time intensive, unleveraged, brittle!• Before you code, research existing libraries!!
– Will solve 95% of the problems you don’t even know you will have!– E.g. web scraping using python’s BeautifulSoup!
page = urllib2.urlopen("http://www.kaggle.com/competitions") soup = BeautifulSoup(page.read()) allLinks = soup.find_all('a') allLinks = uniqify(allLinks) for link in allLinks: match = (re.search('^/c/.*', link.get('href'))) if match:
fileName = link.get('href'); fileName = fileName.replace('/','_') + ".zip" fileName = fileName[3:] getStuff(fileName, "http://www.kaggle.com" + link.get("href") + "/publicleaderboarddata.zip")
![Page 31: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/31.jpg)
Excel has a time and place!– Looking at data!– Pivot tables!– Quick plots to verify things!
Never:!– Pass spreadsheets around!– “Code” in Excel!– Create workflows that require copy/
pasting data around!
![Page 32: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/32.jpg)
Excel!
![Page 33: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/33.jpg)
Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!
![Page 34: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/34.jpg)
Command line!
![Page 35: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/35.jpg)
Glossary!
features = attributes = independent variables!
targets = gold standard = ground truth = dependent variable(s)!
training set = data & targets use to train a model!
validation set = data & targets used as feedback in model training!
test set = separate data & targets used only to evaluate the model!
cross validation = partitioning the training set to estimate how well a
model will generalize!
![Page 36: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/36.jpg)
Train!
Test!
Read! Feature Extraction! Learn!
Generalize!
![Page 37: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/37.jpg)
Bayes theorem!
How to update beliefs in the face of evidence?!For proposition A and evidence B:!
– P(A) = prior (belief in A)!– P(B) = evidence!– P(A | B) = posterior (belief in A given B)!– P(B | A) = likelihood!
P (A|B) =P (B|A)P (A)
P (B)
P (female|long hair) =P (long hair|female)P (female)
P (long hair)
![Page 38: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/38.jpg)
R!
![Page 39: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/39.jpg)
MATLAB!
![Page 40: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/40.jpg)
Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!
![Page 41: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/41.jpg)
![Page 42: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/42.jpg)
Visualization!
Speak the language of your audience!– Use simple plots!– Use units that matter (dollars, time, widgets)!– Include the units!!– Don’t use acronyms!!
!Most visualization should be internal facing (am I doing this right?) and not external facing (hey check this out!)!
![Page 43: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/43.jpg)
• Plotting raw features!• Looking for outliers,
anomalies, correlation!
• Verifying feature selection or dimensionality reduction!
• Looking at manifold density!• Looking at class separation!
• Babysitting model performance!• Looking for optima!• Watching for sensitivity to initial
conditions, perturbations!
• Summarizing!• Checking the result is reasonable!• Comparisons to the alternative!
![Page 44: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/44.jpg)
Your job is to solve a problem!– Sell the message, not the graphic!
Avoid chartjunk!“The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.” –Edward Tufte!
![Page 45: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/45.jpg)
source: http://i.dailymail.co.uk/i/pix/2012/03/21/article-2118152-124602BE000005DC-0_964x528.jpg
![Page 46: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/46.jpg)
source: http://www.fivethirtyeight.com/2009/10/older-and-wealthier-people-are-more.html
![Page 47: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/47.jpg)
Election fraud: 2D histograms of the number of units for a given voter turnout (x axis) and the percentage of votes (y axis) for the winning party!
source: http://www.pnas.org/content/early/2012/09/20/1210722109.abstract
![Page 48: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/48.jpg)
ggplot2!
![Page 49: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/49.jpg)
Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!
![Page 50: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/50.jpg)
Make a spam detector!
The data represents a corpus of emails. Some are spam and some are normal.!• Due to time constraints, feature extraction is done for you:!
– train.csv - contains 600 emails x 100 features!– train_labels.csv – contains the 600 training labels (1 = spam, 0 =
normal)!– test.csv - contains 4000 emails x 100 features!
• Submit a file with each of the 4000 predictions on a separate line (in the same order as test.csv).!– No header is necessary!– Predictions can be continuous numbers or 0/1 labels!
![Page 51: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/51.jpg)
Return% ProductID Dept Price MFR 1.94 54323 Household 54.95 USA
0.023 92356 Household 9.95 USA 0.8 78023 Computer 4.5 China
0.01 12340 Audio 109.99 China 0.41 31240 Audio 29.99 Taiwan 0.97 12351 Hardware 54.95 Mexico
0.0115 90141 Hardware 4.99 USA 0.4 81240 Hardware 6.55 Taiwan
0.03 14896 Computer 211.99 Korea 0.205 62132 Computer 1100 USA
1.6878 54323 Audio 34.99 USA 0.0345 92356 Audio 7.99 USA
0.64 78023 Household 229.9 Brazil 0.72 12340 Audio 19.95 Mexico 0.41 31240 Computer 6.99 Taiwan 1.94 54323 Hardware 11.99 Taiwan
0.023 92356 Household 2.05 USA 0.08 78023 Computer 99.99 USA 2.09 12340 Computer 129.99 China 1.1 31240 Audio 18.99 China
How the leaderboard works!
![Page 52: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/52.jpg)
Return% ProductID Dept Price MFR 1.94 54323 Household 54.95 USA
0.023 92356 Household 9.95 USA 0.8 78023 Computer 4.5 China
0.01 12340 Audio 109.99 China 0.41 31240 Audio 29.99 Taiwan 0.97 12351 Hardware 54.95 Mexico
0.0115 90141 Hardware 4.99 USA 0.4 81240 Hardware 6.55 Taiwan
0.03 14896 Computer 211.99 Korea 0.205 62132 Computer 1100 USA
1.6878 54323 Audio 34.99 USA 0.0345 92356 Audio 7.99 USA
0.64 78023 Household 229.9 Brazil 0.72 12340 Audio 19.95 Mexico 0.41 31240 Computer 6.99 Taiwan 1.94 54323 Hardware 11.99 Taiwan
0.023 92356 Household 2.05 USA 0.08 78023 Computer 99.99 USA 2.09 12340 Computer 129.99 China 1.1 31240 Audio 18.99 China
Training
Test
How the leaderboard works!
Solution “Ground Truth”
![Page 53: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/53.jpg)
Return% ProductID Dept Price MFR 1.94 54323 Household 54.95 USA
0.023 92356 Household 9.95 USA 0.8 78023 Computer 4.5 China
0.01 12340 Audio 109.99 China 0.41 31240 Audio 29.99 Taiwan 0.97 12351 Hardware 54.95 Mexico
0.0115 90141 Hardware 4.99 USA 0.4 81240 Hardware 6.55 Taiwan
0.03 14896 Computer 211.99 Korea 0.205 62132 Computer 1100 USA
1.6878 54323 Audio 34.99 USA 0.0345 92356 Audio 7.99 USA
0.64 78023 Household 229.9 Brazil ? 12340 Audio 19.95 Mexico ? 31240 Computer 6.99 Taiwan ? 54323 Hardware 11.99 Taiwan ? 92356 Household 2.05 USA ? 78023 Computer 99.99 USA ? 12340 Computer 129.99 China ? 31240 Audio 18.99 China
Training
Test
How the leaderboard works!
Solution “Ground Truth”
![Page 54: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/54.jpg)
Return% ProductID Dept Price MFR 1.94 54323 Household 54.95 USA
0.023 92356 Household 9.95 USA 0.8 78023 Computer 4.5 China
0.01 12340 Audio 109.99 China 0.41 31240 Audio 29.99 Taiwan 0.97 12351 Hardware 54.95 Mexico
0.0115 90141 Hardware 4.99 USA 0.4 81240 Hardware 6.55 Taiwan
0.03 14896 Computer 211.99 Korea 0.205 62132 Computer 1100 USA
1.6878 54323 Audio 34.99 USA 0.0345 92356 Audio 7.99 USA
0.64 78023 Household 229.9 Brazil 0.03 12340 Audio 19.95 Mexico
1.298 31240 Computer 6.99 Taiwan 0.94 54323 Hardware 11.99 Taiwan 0.04 92356 Household 2.05 USA 0.36 78023 Computer 99.99 USA 1.2 12340 Computer 129.99 China
0.02 31240 Audio 18.99 China
Training
Test
How the leaderboard works!
Submission
![Page 55: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/55.jpg)
Return% ProductID Dept Price MFR 1.94 54323 Household 54.95 USA
0.023 92356 Household 9.95 USA 0.8 78023 Computer 4.5 China
0.01 12340 Audio 109.99 China 0.41 31240 Audio 29.99 Taiwan 0.97 12351 Hardware 54.95 Mexico
0.0115 90141 Hardware 4.99 USA 0.4 81240 Hardware 6.55 Taiwan
0.03 14896 Computer 211.99 Korea 0.205 62132 Computer 1100 USA
1.6878 54323 Audio 34.99 USA 0.0345 92356 Audio 7.99 USA
0.64 78023 Household 229.9 Brazil 0.03 12340 Audio 19.95 Mexico
1.298 31240 Computer 6.99 Taiwan 0.94 54323 Hardware 11.99 Taiwan 0.04 92356 Household 2.05 USA 0.36 78023 Computer 99.99 USA 1.2 12340 Computer 129.99 China
0.02 31240 Audio 18.99 China
Training
Test
How the leaderboard works!
Submission
Public Leaderboard Private Leaderboard
![Page 56: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/56.jpg)
Area under the receiver-operating characteristic curve !
![Page 57: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/57.jpg)
Example Model!
![Page 58: Just the basics_strata_2013](https://reader031.vdocument.in/reader031/viewer/2022031903/55a4927f1a28abad7c8b45c0/html5/thumbnails/58.jpg)
Think about!
• Missing values!• Noise!• Combinations of features!• Transformations of features (e.g. log)!• Combinations of methods!• Overfitting!• Binary vs. continuous predictions!• How good is a good spam detector?!