big data talent in academic and industry r&d
TRANSCRIPT
A Confluence of Big Data Skills in
Academic and Industry R&D
Bill Howe, PhDAssociate Director
University of Washington eScience Institute
The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
“All across our campus, the process of discovery will
increasingly rely on researchers’ ability to extract
knowledge from vast amounts of data… In order to
remain at the forefront, UW must be a leader in
advancing these techniques and technologies, and in
making [them] accessible to researchers in the
broadest imaginable range of fields.”
2005-2008
In other words:
• Data-intensive research will be ubiquitous
• It’s about intellectual infrastructure and software infrastructure,
not only computational infrastructure
http://escience.washington.edu
A 5-year, US $37.8 million cross-institutional
collaboration to create a data science environment
4
2014
5
“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”-- Jeff Hammerbacher, co-founder, Cloudera
5/7/2015 Bill Howe, UW 6
Jake Vanderplas
5/7/2015 Bill Howe, UW 7
…the new breed of scientist must be a broadly-
trained expert in statistics, in computing, in
algorithm-building, in software design
The skills required to be a successful scientific
researcher are increasingly indistinguishable from
the skills required to be successful in industry.
Jake Vanderplas
5/7/2015 Bill Howe, UW 8
“Data Science” is not the only example…
• Strong Math + PhD Quant, on Wall Street
• Strong “Data” + PhD Data Scientist, anywhere
5/7/2015 Bill Howe, UW 9
increased
statistical rigor and
data-driven
decision-making
increased
sophistication in
the use and
development of
software
Industry
Academia
5/7/2015 Bill Howe, UW 11
Maximiliaan Schillebeeckx, Brett Maricque & Cory Lewis
Nature Biotechnology 31, 938–941 (2013) doi:10.1038/nbt.2706
WHAT SKILLS ARE NEEDED?
5/7/2015 Bill Howe, UW 13
5/7/2015 Bill Howe, UW 14
Drew Conway’s Data Science Venn Diagram
5/7/2015 Bill Howe, UW 15
5/7/2015 Bill Howe, UW 18
“I worry that the Data Scientist role is like
the mythical “webmaster” of the 90s:
master of all trades.”
-- Aaron Kimball, CTO of Zymergen,
formerly CTO of Wibidata, formerly
co-founder of Cloudera
5/7/2015 Bill Howe, UW eScience 19
tools principles
desktop cloud
data structures statistics
hackers analysts
What to look for in data science skills
5/7/2015 Bill Howe, UW 20
Cambrian Explosion of Big Data Systems tools principles
5/7/2015 Bill Howe, UW 22
What are the abstractions of
data science?
“Data Jujitsu”
“Data Wrangling”
“Data Munging”
Translation: “We have no idea what
this is all about”
tools principles
5/7/2015 Bill Howe, UW 23
1850s: matrices and linear algebra (today: engineers and scientists)
1950s: arrays and custom algorithms (today: C/Fortran performance junkies)
1950s: s-expressions and pure functions (today: language purists)
1960s: objects and methods (today: software engineers)
1970s: files and scripts (today: system administrators)
1970s: relations and relational algebra (today: industry data pros)
1980s: data frames and functions (today: statisticians)
2000s: key-value pairs + one of the above (today: NoSQL hipsters)
But what are the abstractions of
data science?
tools principles
5/7/2015 Bill Howe, UW 24
“80% of analytics is sums and averages”
-- Aaron Kimball, wibidata
data structures statistics
“The intuition behind this ought to be very simple: Mr. Obama
is maintaining leads in the polls in Ohio and other states that
are sufficient for him to win 270 electoral votes.”
Nate Silver, Oct. 26, 2012
“…the argument we’re making is exceedingly simple. Here it
is: Obama’s ahead in Ohio.”
Nate Silver, Nov. 2, 2012
“The bar set by the competition was invitingly low. Someone could
look like a genius simply by doing some fairly basic research into
what really has predictive power in a political campaign.”
Nate Silver, Nov. 10, 2012
DailyBeast
fivethirtyeight.com
fivethirtyeight.com
source: randy stewart
Nate Silver
data structures statistics
Data Science Workflow
5/7/2015 Bill Howe, UW 26
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
Academia puts far too much
emphasis on this step
data structures statistics
Problem
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
data structures statistics
“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of
scale, file handling, and feature
engineering.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
desk cloud
…up to 1 GB (volume)
…up to 10 data sources (variety)
…up to 1% churn/day (velocity)
…up to 1% bad data (veracity)
…up to 10 collaborators
5/7/2015 Bill Howe, UW 30/57
With “manual” approaches,
you can comfortably handle…
But we’re seeing a 10x-100x increase in every
dimension, even under modest assumptions
desk cloud data
structuresstatistics
US faces shortage of 140,000 to 190,000 people “with
deep analytical skills, as well as 1.5 million managers
and analysts with the know-how to use the analysis of
big data to make effective decisions.”
5/7/2015 Bill Howe, UW 31
--Mckinsey Global Institute
hackers analysts
Where do you store your data?
src: Conversations with Research Leaders (2008)
src: Faculty Technology Survey (2011)
5%
6%
12%
27%
41%
66%
87%
0% 20% 40% 60% 80% 100%
Other
Department-managed data center
External (non-UW) data center
Server managed by research group
Department-managed server
External device (hard drive, thumb drive)
My computer
Lewis et al 2011
Conversations with DS Hiring Managers
• “How to ask the right questions and communicate
results”
– DS: "I tried three methods, two didn't work, achieved 80%
accuracy”
– Manager: “Ok, so….what do we do?”
• “Can you properly tell a story with the data, and
properly persuade people?”
• "For my team, engineering/stats skills need to be
good, not great."
5/7/2015 Bill Howe, UW 35
hackers analysts
If I had to pick 2…
• Experimental Design
– How to design a statistical test?
– How to interpret significance of a test?
– A/B tests
– More complicated sampling methods
– Sources of bias
– Skewed data
• SQL and Databases
– Mentioned on nearly evey DS job description
– Why? Easy scalability, production data sources, IT integration
5/7/2015 Bill Howe, UW 36
http://cds.nyu.edu/ http://bids.berkeley.edu/ http://escience.washington.edu/
5/7/2015 Bill Howe, UW 38
http://escience.washington.edu
Data Scientist and Research Scientist positions available
Who We Are Join Us