© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
So You Want to be a Data Scientist
Stephen Coggeshall April 21, 2016
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
Outline
2
• What, why big data
• What is data science
• What are the necessary skills
• Description of the interview process
• Description of the typical work for data scientists
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
WHAT IS BIG DATA Big Data is a term that describes large volumes of high velocity, complex and variable data that require advance techniques and technologies to enable the capture, storage, distribution, management and analysis of the information.
3
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
DATA SIZES ARE GROWING SUBSTANTIALLY
4
0
10,000
20,000
30,000
40,000
50,000
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Global digital data (in Exabytes)
Sources: IDC
By 2020, digital records will be 44 times larger than in 2009
0
10
20
30
40
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
The Digital Universe: 50-fold Growth from the Beginning of 2010 to the end of 2020
(000
’s)
(Exabytes)
Sources: IDC’s Digital Universe, sponsored by EMC, December 2012
0
400
800
1,200
1,600
2,000
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
Enterprise Unstructured DataEnterprise Structured Data
Exabytes
Data growth exhibits “Moore’s Law” doubling every 1 to 2 years
1970 1980 1990 2000 2010Stor
ed D
igita
l Inf
orm
atio
n(E
xaby
tes)
• Digital data is growing at 62% yoy vs. structured data at 22%• 80% of this new data is digital data that is
complex to analyze in its native structure
Web Application Data
Business Transaction Data
78% CAGR 2011–2016
0.6 1.32.4
4.26.9
10.8
0
4
8
12
16
2011 2012 2013 2014 2015 2016
Exabytes per Month
0.24 0.51.2
2.2
3.8
6.3
01234567
2010 2011 2012 2013 2014 2015Tera
byte
s per
mon
th (M
M)
0
10,000
20,000
30,000
40,000
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Exab
ytes
Worldwide Corporate Data Growth
Structured data Unstructured Text
IT.com analyses the80% of data growth:
U nstructured text
Sources: IDC, The Digital Universe 2010
Huge Growth of Data
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
5
DATA IS COLLECTED EVERYWHERE
Proliferation of Sensors (IOT)
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
• Why - exponential growth of data and compute power
• What is data science? – The scientific discipline around typically large and difficult data sets
to solve complex business problems
• It is applied in many industries, including – Financial services – Healthcare – Insurance – Telecomunication – Social networks – Gaming – Government
Data Science
6
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
Risk Management • Scores for credit, fraud, bankruptcy, actuarial… Marketing • Product offers, pricing optimization, customer analysis… Financial Engineering • Econometrics, derivative pricing, algorithmic trading… Social Networks • Network analysis, recommender systems, sentiment analysis… Operations Optimization • Production lines, chemical processes, manufacturing… Transportation • Traffic flow, smart grid, route/price optimization, logistics… Search • Online searches, document matching, link analysis… Health • Bioinformatics, drug discovery, image analysis… …
Examples of Data Science Applications
7
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
Business Analytics Hierarchy
Descriptive BI dashboards, reporting What’s going on?
Explanatory Why are things happening?
BI/BA analysis, slicing data different ways, diagnostics
Predictive What will happen? Forecasts, predictive models, machine learning
Prescriptive Optimization of actions, recommendations
What should I do with my current
businesses?
Strategic What new business opportunities do I
have?
New products & services around existing data, new data to acquire
Valu
e
Question How and what
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
• Strong math and statistics
• Strong programming & software best practices
• Passion for data
• Machine learning methods
• Big data technologies (scalable file systems and computing
architectures)
• Scientific exploration, discovery and testing discipline
• Curiosity, skepticism, detail oriented
• Desire to learn
Desired Data Science Background and Skills
9
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
• Lots of data manipulation, cleaning, exploration, analysis, discovery. Frequently deal with messy data
• Problem definition: – Fully understand business problem and goals – Understand details of the underlying data – Frame the problem: time series? predictive model?
simulation? structure discovery? – What data is available? What are the inputs/outputs?
• Build preliminary models, iterate with business • Finalize, implement • Monitor, rebuild as needed
Typical Data Scientist Work
10
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
• Solid foundation in basics: math, statistics, programming • Buy a few books. Possibles:
– Elements of Statistical Learning, Hastie, Tibshirani, Friedman – Pattern Recognition and Machine Learning, Bishop – Pattern Classification, Duda, Hart, Stork – Foundations of Predictive Analytics, Wu, Coggeshall
• Many excellent courses on Coursera. Chose the technical ones.
• Internship if possible, but not necessary. • Now you know the basics and can converse. Interview
for beginner position.
How to Get Into Data Science as a Career
11
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
• Resumes selected for interviews • Typically two phone interviews
– Discuss resume, questions about ML, data, statistics, programming, math…
• In-house interview – Usually starts with a presentation, typically dissertation
work – Series of individual and group interviews, ~45 mins – Same Q’s as phone interview, but in more detail
• The group meets next day, go around room, – Technically adept, smart, experience, culture fit?
Description of the Interview Process
12
© 2016 ID Analytics, Inc. Confidential
Primary IDA Color Palette
R0 G53 B95
Navy Blue
R0
G114 B156
Blue
R122 G134 B142
Light Gray
R102 G49 B144
Purple
Orange
R191 G191 B191
R84 G148 B63
Green
R106 G154 B194
Light Blue
Secondary Color Palette - Medium
R47
G163 B255
R250 G187 B119
R147 G201 B129
R42 G197 B255
R154 G155 B157
R165 G113 B206
R165 G194 B218
Gray
R216 G216 B216
Navy Blue Blue
Light Gray Purple
Orange
Green Light Blue
Gray
R247 G143 B30
• Sensors everywhere, data collection everywhere, compute power growing exponentially
• Machine learning algorithms continue to grow in power and complexity – Rise in deep learning, convolutional neural nets, distributed/
scalable machine learning
• Businesses relying more and more on data science
• The world is awash in data. Need people to transform data into intelligence into actions.
Summary: Data Science is a Surging Field!
13