data science: the main course @ kcdc 2016
TRANSCRIPT
![Page 1: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/1.jpg)
DATA SCIENCE: THE MAIN COURSE
I Can Science Data, and So Can You!
Arthur Doler @arthurdoler [email protected]
![Page 2: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/2.jpg)
TITANIUM SPONSORS
Platinum Sponsors
Gold Sponsors
![Page 3: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/3.jpg)
HOW MANY APPETIZERS HAVE YOU EATEN?
![Page 4: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/4.jpg)
Sources: Mediawiki, Publicdomainpictures.net
![Page 5: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/5.jpg)
SO WE’RE SKIPPING RIGHT TO THE MAIN COURSEYOU HAVE THE DATA
YOU HAVE THE POWER
![Page 6: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/6.jpg)
Sources: Mattel, he-manreviewed.net
![Page 7: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/7.jpg)
WHAT’S FOR DINNER
![Page 8: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/8.jpg)
Picking your problem
Using Knitr/R Markdown
Building a linear predictor
Making a predictive, repeatable document
![Page 9: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/9.jpg)
WHAT’S NOT FOR DINNER
![Page 10: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/10.jpg)
Learning R
Exhaustive discussion of statistics
Exhaustive discussion of regression modeling
Ways to run R in production
![Page 11: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/11.jpg)
STEP 0: KNOW YOUR RECIPE FOR REPEATABILITY
Learn to Knit you some R
![Page 12: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/12.jpg)
knitr ≈ Sweave + cacheSweave + pgfSweave + weaver + animation::saveLatex +
R2HTML::RweaveHTML + highlight::HighlightWeaveLatex + 0.2 * brew + 0.1 *
SweaveListingUtils + more
![Page 13: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/13.jpg)
Source: Reddit
![Page 14: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/14.jpg)
R Code
Markup
R Code
Markup
Markup
![Page 15: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/15.jpg)
WHAT?! WHY IS THIS A GOOD IDEA?
![Page 16: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/16.jpg)
Do you love me?
YN
![Page 17: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/17.jpg)
LET’S GO FIND THAT RECIPE!
Source: Reddit
![Page 18: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/18.jpg)
STEP 1: SHOP FOR YOUR INGREDIENTS
Finding the question to ask
![Page 19: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/19.jpg)
WHAT ARE YOU TRYING TO DO?
![Page 20: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/20.jpg)
Finding or proving a correlation
Looking for outliers
Building a predictive model
![Page 21: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/21.jpg)
LET’S BUILD A LINEAR PREDICTIVE MODEL
![Page 22: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/22.jpg)
![Page 23: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/23.jpg)
Source: Wikipedia
![Page 24: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/24.jpg)
WHAT ARE YOUR VARIABLES?
![Page 25: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/25.jpg)
• Material Category• Material ID• Time-to-Incapacitation• 1000 / Time-To-Incapacitation• Carbon Monoxide• Hydrogen Cyanide• Hydrogen Sulfide• Hydrochloric Acid• Hydrobromic Acid• Nitrogen Dioxide• Sulfur Dioxide
![Page 26: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/26.jpg)
WHAT DO YOU CARE ABOUT?
![Page 27: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/27.jpg)
![Page 28: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/28.jpg)
FORMULATE YOUR QUESTION
![Page 29: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/29.jpg)
LET’S HEAD TO THE STORE!
Source: Reddit
![Page 30: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/30.jpg)
STEP 2: GET YOUR MISE EN PLACE
Dividing your data
![Page 31: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/31.jpg)
WHERE IS THE VALUE IN A PREDICTIVE MODEL?
![Page 32: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/32.jpg)
WE BUILD OUR MODEL WITH A TRAINING SET
![Page 33: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/33.jpg)
PARTITIONING YOUR DATA PREVENTS OVERTRAINING
![Page 34: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/34.jpg)
²⁄³ Training¹⁄³ Test
½ Training¼ Test¼ Validation
![Page 35: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/35.jpg)
LET’S MEASURE EVERYTHING OUT!
Source: Reddit
![Page 36: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/36.jpg)
STEP 3: COOK UP YOUR PREDICTOR
Training your model
![Page 37: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/37.jpg)
ONE WARNING FIRST
DO YOU NEED TO UNDERSTAND YOUR PREDICTOR?
![Page 38: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/38.jpg)
LET’S GO COOK UP THE MODEL!
Source: Reddit
![Page 39: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/39.jpg)
WHY DID 1000/TIME_TO_INCAPACITATION WORK BETTER?
![Page 40: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/40.jpg)
STEP 3A: TRIM THE FATEliminating Outliers
![Page 41: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/41.jpg)
LET’S GO CUT!
Source: Reddit
![Page 42: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/42.jpg)
STEP 4: GARNISH WITH GRAPHICS
Adding visualizations to your report
![Page 43: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/43.jpg)
plot ggplot2
Source: Wikimedia
![Page 44: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/44.jpg)
LET’S FINISH UP THAT REPORT!
Source: Reddit
![Page 45: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/45.jpg)
1. Know Your Recipe for Repeatability2. Shop for Your Column Ingredients3. Get your Data Divided4. Cook Up Your Predictor
1. Trim the Outlier Fat5. Garnish with Graphics
![Page 46: Data Science: The Main Course @ KCDC 2016](https://reader036.vdocument.in/reader036/viewer/2022070518/58e586c01a28abbf5d8b61c7/html5/thumbnails/46.jpg)
QUESTIONS?Source: Reddit