the art of data analysis

39
The Art Of Data Analysis Karthik Shashidhar Quant Consultant [email protected] © Karthik Shashidhar

Upload: karthik-shashidhar

Post on 09-May-2015

904 views

Category:

Technology


2 download

DESCRIPTION

I conduct workshops on The Art of Data Analysis for corporate clients and at conferences. I recently did the workshop at the Fifth Elephant, a conference on Data in Bangalore. These are the slides I used for that workshop. For corporate clients, I custom develop case studies that are relevant to their company/industry. For more details, contact me at karthik DOT shashidhar AT gmail DOT com

TRANSCRIPT

Page 1: The art of data analysis

© Karthik Shashidhar

The Art Of Data Analysis

Karthik ShashidharQuant [email protected]

Page 2: The art of data analysis

© Karthik Shashidhar

Introduction

Six-step process

Common Pitfalls

Case Study

Page 3: The art of data analysis

© Karthik Shashidhar

Why do you need this workshop?

We are moving to an increasingly data-driven world

Ability to use data for day-to-day decision-making can prove to be a massive competitive advantage

This workshop equips managers with basic tools for dealing with data

Page 4: The art of data analysis

© Karthik Shashidhar

Who needs this workshop?

Sales ManagersWhat is the optimal level of sales commissions in order to maximize

profitability?

Production Managers

How do we set daily production targets given probabilities of line shut downs?

HR Managers What are the factors that determine employee attrition?

This workshop is suitable for personnel in middle to senior management roles across functions

Page 5: The art of data analysis

© Karthik Shashidhar

Introduction

Six-step process

Common Pitfalls

Case Study

Page 6: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

A structured, iterative approach to data-driven decision making

Page 7: The art of data analysis

© Karthik Shashidhar

Introduction

Six-step process

Common Pitfalls

Case Study

Page 8: The art of data analysis

© Karthik Shashidhar

The Rs. 32 Poverty Line

Based on data from the 66th NSSO Survey, the Planning Commission fixed the “Poverty Line” at Rs. 32 per person

per day for people living in urban areas. This has led to much controversy and protests. The Prime Minister has

asked for your inputs. What do you recommend?

Page 9: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

For your reference

Page 10: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

How would you frame the problem statement for this one?

• Your client may not have framed the question precisely. You need to do that job and frame a precise problem statement

• “Solving this problem” should tell you everything you want to know from your analysis

• Be concise, so that you remain focused towards answering your question

• Frame your question such that it has an objective answer. Yes/No questions or questions with numerical answers are preferred

Page 11: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

Has the poverty line been set too low at Rs. 32 per day?

• This problem statement has an objective answer (yes/no)

• The solution to this will be necessary and sufficient to answer the question our client (the PM) demands

• The question addresses directly the situation (people complaining that the poverty line has been set too low)

• This problem statement is to the point and doesn’t take on additional responsibilities (such as defining an alternate poverty line)

Page 12: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

What problems do we need to solve in order to solve the

main problem?

• The set of “level two problems” must be precise and complete, in that: • The combination of solution of all

level two problems leads to the solution of the main problem

• The solution of each level two problem directly impacts the main problem

• Once again, it is key to frame problems concisely and with objective answers

• We need not stop at two levels. Some level two problems might require solution of deeper problems. Add them to the list of sub-problems

Page 13: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

What do we need to know to answer “Has the poverty line been

set too low at Rs. 32 per day?”

• How is “poverty line” defined?• What are the implications of poverty

line?• What is the distribution of income in

India? • Does the distribution of income vary

across states? If it varies significantly does it make sense to have a state-wise poverty line?

• What are the essential goods that most people need?

• For a given income level, what essential goods can a person afford?

Page 14: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

Problems generate sub-problems, and some of these will lead to

hypotheses.

• Hypothesis1: There is significant difference in income level across states

• Hypothesis2: Essential goods are those that the poorest people consume. Also, their use flattens out as income goes up

Page 15: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

Some problems, however, are direct, and don’t need hypotheses. Some are qualitative while others

need data

• Question1: How is “poverty line” defined?• Poverty line is the minimum

income level that is deemed adequate

• If a family is “below poverty line” it qualifies for additional state benefits

• Question2: What is the distribution of incomes in each state?

• Question3: Is there some kind of a threshold about the proportion of population that can be below poverty line?

Page 16: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

What data do you need here?

• It is important to frame problem and break it down into components before listing data requirements, else data could bias you

• Define data requirements in a general fashion, to allow you to easily access proxies

• Remember to gather data that both answers your questions and will allow you to test your hypotheses

Page 17: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

Once you’ve identified data requirements, identify sources and

gather data

• Here we need• Distribution of a measure of

income for India• Distribution of a measure of

income for each state• Spending patterns for different

income levels• Data on household sizes in

different states

Page 18: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

Once you’ve identified data requirements, identify sources and

gather data

• The National Sample Survey Organization (NSSO) conducts surveys every 5 years about income and expenditure, so we could perhaps use this

• However, income data gathered from surveys are notorious with respect to quality

• Poor have little savings so their total consumption is a better indicator of income than the income data

Page 19: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

Data cleaning is an ugly but important step

• It is important to make sure names from data procured from different sources match• For example, some government

sites say “AndhraPradesh”, while others say “Andhra Pradesh”. Fails if you want to do a join

• If data set is small, go through it once to check numbers for consistency. For example, if you have data on percentages, make sure it adds up to 100%

• For larger data sets, try write scripts to do basic cleaning

Page 20: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

Understand and prepare data before you dive into analysis

• Get a general feel for the numbers before getting into the analysis

• Simple visualization techniques such as scatter plots and density plots help

• Use simple summary statistics (mean, median, SD, quartiles) to get a better feel for the data

• Check out what different functional forms of your data look like

Page 21: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

While testing hypotheses, be on the lookout for anything interesting/unusual

• It is impossible to generate all possible hypotheses before you begin the analysis

• Usually, as you test out some hypotheses, something in the data will stand out which will lead to further hypotheses

• It is ok to generate these hypotheses, which is what makes it an iterative process

• However, one needs to be careful to not stray from the original objective – each new hypothesis should directly tie in to the original question

Page 22: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

Consolidate results

• Build up your case in a bottom-up manner

• Sometimes different pieces of analysis can throw up contradictory inferences. Check, and reconcile before you integrate

• Make sure all components of the solution that you required are available

• Don’t include results in the final analysis unless it makes a definite contribution to the final solution

Page 23: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

Use graphics intelligently!

• A picture is worth a thousand words, so use clear and easy-to-use visualizations to communicate your findings

• Use visualizations that make the solution self-evident, rather than something that requires a lot of explanation

• Use your graphics to communicate, not to confuse. If the intent of a graphic is to confuse, it is better to leave out that graphic

• Sometimes all it takes to solve the problem is to visualize the data from a different perspective!

Page 24: The art of data analysis

© Karthik Shashidhar

Frame a clear and concise problem statement

Break down your problem into smaller problems, and then use those to generate hypotheses

Gather, clean and prepare data

Test hypotheses. In the process, generate additional hypotheses

Consolidate results to solve the main problem

Make the data tell a story

This graphic shows the decile in which Rs. 32 per day (Rs. 960 per month) would fall in each state

Page 25: The art of data analysis

© Karthik Shashidhar

Introduction

Six-step process

Common Pitfalls

Case Study

Page 26: The art of data analysis

Data-driven inference is fraught with pitfalls. Drawing the wrong conclusion out of

data is easier than drawing the right conclusion.

© Karthik Shashidhar

Beware of Outliers

Correlation does not imply causality

Start with getting a feel for the data

Don’t simply throw everything

into the mix

Beware of anecdotal evidence

Don’t overfit models

Contradictory inferences from

same data

Don’t over-complicate

graphicsModels can misbehave

Graphics can deceive

Page 27: The art of data analysis

© Karthik Shashidhar

Outliers can significantly distort

inferences

Page 28: The art of data analysis

© Karthik Shashidhar

“Throwing everything into the

mix” may not always produce an

accurate model

Page 29: The art of data analysis

© Karthik Shashidhar

According to this regression, the tallest person should have an extremely large right foot and a tiny left foot! That makes no sense!

It could lead to multicollinearity,

for example

Page 30: The art of data analysis

© Karthik Shashidhar

It helps to keep your models as simple as possible. A simple rule of thumb – a good model is one that can be easily explained in simple English

Over-fitting can lead to spurious

models

Page 31: The art of data analysis

© Karthik Shashidhar

People are prone to doing regressions without actually looking at the data. Here, a simple linear regression gives a

reasonable fit (R^2 = 42%). However, a simple scatter plot would suggest a clear Y=

1/X kind of relationship which the regression completely misses out on

Diving into model fitting without first understanding the

data can lead to suboptimal results

Page 32: The art of data analysis

© Karthik Shashidhar

Contradictory inferences can be derived from the

same data

Page 33: The art of data analysis

© Karthik Shashidhar

0 2 4 6 8 10 12 14 160

20

40

60

80

100

120

140

160

0 2 4 6 8 10 12 14 1680

90

100

110

120

130

140

150

Choice of axes and scales can have a

significant impact on the message your graphic conveys

Page 34: The art of data analysis

© Karthik Shashidhar

Correlation does not imply causality

Page 35: The art of data analysis

© Karthik Shashidhar

Mistaking correlation for causality can lead to hilarious conclusions

Page 36: The art of data analysis

© Karthik Shashidhar

Readers get turned off by overly

complicated graphics

Page 37: The art of data analysis

© Karthik Shashidhar

Anecdotal/ insufficient data can

lead to false conclusions

Page 38: The art of data analysis

© Karthik Shashidhar

A model is just that: a model. It is

not a substitute for reality

Page 39: The art of data analysis

The Art of Data Analysis will be further illustrated by means of a detailed Case Study relevant to your

company/industry

For a half-day workshop on The Art of Data Analysis (including a case study), contact Karthik Shashidhar at

[email protected]

© Karthik Shashidhar