the art of data analysis
DESCRIPTION
I conduct workshops on The Art of Data Analysis for corporate clients and at conferences. I recently did the workshop at the Fifth Elephant, a conference on Data in Bangalore. These are the slides I used for that workshop. For corporate clients, I custom develop case studies that are relevant to their company/industry. For more details, contact me at karthik DOT shashidhar AT gmail DOT comTRANSCRIPT
© Karthik Shashidhar
Introduction
Six-step process
Common Pitfalls
Case Study
© Karthik Shashidhar
Why do you need this workshop?
We are moving to an increasingly data-driven world
Ability to use data for day-to-day decision-making can prove to be a massive competitive advantage
This workshop equips managers with basic tools for dealing with data
© Karthik Shashidhar
Who needs this workshop?
Sales ManagersWhat is the optimal level of sales commissions in order to maximize
profitability?
Production Managers
How do we set daily production targets given probabilities of line shut downs?
HR Managers What are the factors that determine employee attrition?
This workshop is suitable for personnel in middle to senior management roles across functions
© Karthik Shashidhar
Introduction
Six-step process
Common Pitfalls
Case Study
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
A structured, iterative approach to data-driven decision making
© Karthik Shashidhar
Introduction
Six-step process
Common Pitfalls
Case Study
© Karthik Shashidhar
The Rs. 32 Poverty Line
Based on data from the 66th NSSO Survey, the Planning Commission fixed the “Poverty Line” at Rs. 32 per person
per day for people living in urban areas. This has led to much controversy and protests. The Prime Minister has
asked for your inputs. What do you recommend?
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
For your reference
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
How would you frame the problem statement for this one?
• Your client may not have framed the question precisely. You need to do that job and frame a precise problem statement
• “Solving this problem” should tell you everything you want to know from your analysis
• Be concise, so that you remain focused towards answering your question
• Frame your question such that it has an objective answer. Yes/No questions or questions with numerical answers are preferred
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
Has the poverty line been set too low at Rs. 32 per day?
• This problem statement has an objective answer (yes/no)
• The solution to this will be necessary and sufficient to answer the question our client (the PM) demands
• The question addresses directly the situation (people complaining that the poverty line has been set too low)
• This problem statement is to the point and doesn’t take on additional responsibilities (such as defining an alternate poverty line)
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
What problems do we need to solve in order to solve the
main problem?
• The set of “level two problems” must be precise and complete, in that: • The combination of solution of all
level two problems leads to the solution of the main problem
• The solution of each level two problem directly impacts the main problem
• Once again, it is key to frame problems concisely and with objective answers
• We need not stop at two levels. Some level two problems might require solution of deeper problems. Add them to the list of sub-problems
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
What do we need to know to answer “Has the poverty line been
set too low at Rs. 32 per day?”
• How is “poverty line” defined?• What are the implications of poverty
line?• What is the distribution of income in
India? • Does the distribution of income vary
across states? If it varies significantly does it make sense to have a state-wise poverty line?
• What are the essential goods that most people need?
• For a given income level, what essential goods can a person afford?
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
Problems generate sub-problems, and some of these will lead to
hypotheses.
• Hypothesis1: There is significant difference in income level across states
• Hypothesis2: Essential goods are those that the poorest people consume. Also, their use flattens out as income goes up
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
Some problems, however, are direct, and don’t need hypotheses. Some are qualitative while others
need data
• Question1: How is “poverty line” defined?• Poverty line is the minimum
income level that is deemed adequate
• If a family is “below poverty line” it qualifies for additional state benefits
• Question2: What is the distribution of incomes in each state?
• Question3: Is there some kind of a threshold about the proportion of population that can be below poverty line?
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
What data do you need here?
• It is important to frame problem and break it down into components before listing data requirements, else data could bias you
• Define data requirements in a general fashion, to allow you to easily access proxies
• Remember to gather data that both answers your questions and will allow you to test your hypotheses
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
Once you’ve identified data requirements, identify sources and
gather data
• Here we need• Distribution of a measure of
income for India• Distribution of a measure of
income for each state• Spending patterns for different
income levels• Data on household sizes in
different states
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
Once you’ve identified data requirements, identify sources and
gather data
• The National Sample Survey Organization (NSSO) conducts surveys every 5 years about income and expenditure, so we could perhaps use this
• However, income data gathered from surveys are notorious with respect to quality
• Poor have little savings so their total consumption is a better indicator of income than the income data
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
Data cleaning is an ugly but important step
• It is important to make sure names from data procured from different sources match• For example, some government
sites say “AndhraPradesh”, while others say “Andhra Pradesh”. Fails if you want to do a join
• If data set is small, go through it once to check numbers for consistency. For example, if you have data on percentages, make sure it adds up to 100%
• For larger data sets, try write scripts to do basic cleaning
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
Understand and prepare data before you dive into analysis
• Get a general feel for the numbers before getting into the analysis
• Simple visualization techniques such as scatter plots and density plots help
• Use simple summary statistics (mean, median, SD, quartiles) to get a better feel for the data
• Check out what different functional forms of your data look like
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
While testing hypotheses, be on the lookout for anything interesting/unusual
• It is impossible to generate all possible hypotheses before you begin the analysis
• Usually, as you test out some hypotheses, something in the data will stand out which will lead to further hypotheses
• It is ok to generate these hypotheses, which is what makes it an iterative process
• However, one needs to be careful to not stray from the original objective – each new hypothesis should directly tie in to the original question
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
Consolidate results
• Build up your case in a bottom-up manner
• Sometimes different pieces of analysis can throw up contradictory inferences. Check, and reconcile before you integrate
• Make sure all components of the solution that you required are available
• Don’t include results in the final analysis unless it makes a definite contribution to the final solution
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
Use graphics intelligently!
• A picture is worth a thousand words, so use clear and easy-to-use visualizations to communicate your findings
• Use visualizations that make the solution self-evident, rather than something that requires a lot of explanation
• Use your graphics to communicate, not to confuse. If the intent of a graphic is to confuse, it is better to leave out that graphic
• Sometimes all it takes to solve the problem is to visualize the data from a different perspective!
© Karthik Shashidhar
Frame a clear and concise problem statement
Break down your problem into smaller problems, and then use those to generate hypotheses
Gather, clean and prepare data
Test hypotheses. In the process, generate additional hypotheses
Consolidate results to solve the main problem
Make the data tell a story
This graphic shows the decile in which Rs. 32 per day (Rs. 960 per month) would fall in each state
© Karthik Shashidhar
Introduction
Six-step process
Common Pitfalls
Case Study
Data-driven inference is fraught with pitfalls. Drawing the wrong conclusion out of
data is easier than drawing the right conclusion.
© Karthik Shashidhar
Beware of Outliers
Correlation does not imply causality
Start with getting a feel for the data
Don’t simply throw everything
into the mix
Beware of anecdotal evidence
Don’t overfit models
Contradictory inferences from
same data
Don’t over-complicate
graphicsModels can misbehave
Graphics can deceive
© Karthik Shashidhar
Outliers can significantly distort
inferences
© Karthik Shashidhar
“Throwing everything into the
mix” may not always produce an
accurate model
© Karthik Shashidhar
According to this regression, the tallest person should have an extremely large right foot and a tiny left foot! That makes no sense!
It could lead to multicollinearity,
for example
© Karthik Shashidhar
It helps to keep your models as simple as possible. A simple rule of thumb – a good model is one that can be easily explained in simple English
Over-fitting can lead to spurious
models
© Karthik Shashidhar
People are prone to doing regressions without actually looking at the data. Here, a simple linear regression gives a
reasonable fit (R^2 = 42%). However, a simple scatter plot would suggest a clear Y=
1/X kind of relationship which the regression completely misses out on
Diving into model fitting without first understanding the
data can lead to suboptimal results
© Karthik Shashidhar
Contradictory inferences can be derived from the
same data
© Karthik Shashidhar
0 2 4 6 8 10 12 14 160
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12 14 1680
90
100
110
120
130
140
150
Choice of axes and scales can have a
significant impact on the message your graphic conveys
© Karthik Shashidhar
Correlation does not imply causality
© Karthik Shashidhar
Mistaking correlation for causality can lead to hilarious conclusions
© Karthik Shashidhar
Readers get turned off by overly
complicated graphics
© Karthik Shashidhar
Anecdotal/ insufficient data can
lead to false conclusions
© Karthik Shashidhar
A model is just that: a model. It is
not a substitute for reality
The Art of Data Analysis will be further illustrated by means of a detailed Case Study relevant to your
company/industry
For a half-day workshop on The Art of Data Analysis (including a case study), contact Karthik Shashidhar at
© Karthik Shashidhar