lies, damn lies and anti statistics
Post on 17-Oct-2014
1.568 views
DESCRIPTION
TRANSCRIPT
Lies, Damn Lies and Anti-Statistics
Alan McSweeney
May 18, 2010 2
Objective
• Introduce the concept of distorting “anti-statistics”, illustrate how “anti-statistics” can be identified and define how statistics should be constructed to yield insight and meaning
May 18, 2010 3
Statistics
• A statistic has two roles - primary and secondary
− Primary - to summarise and describe the data while preserving information and reducing the volume of raw data
− Secondary - to provide and enable insight
• Where an alleged statistic does not perform these functions it is an “anti-statistic”
− Distorting the underlying information (raw data), either deliberately or accidentally
− Not providing insight or providing an inaccurate view of the underlying information
• Most people are scared of large sets of numbers
− The use of anti-statistics uses this fear
May 18, 2010 4
Statistics and Anti-Statistics
• Statistics
• Descriptive
• Insightful
• Informative
• Enlightening
• Anti-Statistics
• Distorting
• Promoting Misinterpretation
• Misinformative
• Concealing
May 18, 2010 5
Statistics - Primary Function
• To describe the data while preserving information and reducing the volume of raw data
• This means taking a large amount of raw data, producing descriptive summaries while not losing or distorting the underlying raw data
• More important function of a statistic
May 18, 2010 6
Statistics - Secondary Function
• To provide and enable insight
• By reducing the volume of raw data, you can gain insight into what the data means
− Enabling you to see the wood from the trees, know the amount and type of wood and make decisions about the use of the wood
• Secondary function if primary function satisfied
May 18, 2010 7
Data, Information, Knowledge and Action Cycle
• Good statistics provide information that creates knowledge and enables correct actions
Data
Action
Knowledge
Information
May 18, 2010 8
Information – Lots of It
May 18, 2010 9
Sample Information
• 4,000 numbers representing the annual salaries of individuals
− Sample data only
• 100% of the information is available here
• Very hard to see patterns, understand the situation, gain insight and make effective decisions and understand their consequences
• The numbers do not lie but they are innocent creatures and can be made to lie
• Need techniques that extract meaning and provide insight without losing the information the data represents
May 18, 2010 10
Statistics
• I can take all this …
• … And give you one derived number (average)
− 107941.931
May 18, 2010 11
Statistic
• 4,000 numbers reduced to 1
• Reduced the amount of data by 99.975% (another “statistic”)
• But I have lost information
• Average value of 107941.931 is at best a simplistic view of the data and at worst a distortion that misrepresents the source data
• If I use the average without looking to understand the raw data in more detail I am potentially creating a distortion
May 18, 2010 12
More Statistics
• Be careful what statistics are used
• Do not generate statistics just because you can
• The use of statistics can give a false impression of certainty or meaning where there is none
97909.5This the number in the middle where, half the numbers have values that are greater than the median and half have values that are less – also called the 50th percentile
Median
23958The most frequently occurring value Mode
0.731A measure of the asymmetry of a distribution around the average where a positive value indicates a distribution with an asymmetric tail extending toward more positive values and a negative value indicates a distribution with an asymmetric tail extending toward more negative values
Skewness
0.112Value that describes the relative peakedness or flatness of a distribution where a positive value indicates a relatively peaked distribution and a negative value indicates a relatively flat distribution
Kurtosis
59904.19A measure of how widely values are dispersed from the average value Standard Deviation
107941.93Sum of all the values divided by the number of valuesAverage
May 18, 2010 13
Interpreting the Statistics
• I now know that the data is skewed towards lower values and has a heavy tail indicating a small number of people earning large salaries
97909.5
23958
0.731
0.112
59904.19
107941.93
Value InterpretationStatistic
When the median is less than the average, it means the data is unequally distributed with a heavy tail extending toward more higher values
In a large set of data where only a small number of data values are the same, this is meaningless
The positive values indicates a distribution with an unequal andheavy tail extending toward more higher values
The positive value indicates that there is a peak in the data
The high standard deviation indicates the underlying data is spread across a wide range of values
The average is higher than the median indicating that the data is dispersed unequally towards higher values
Median
Mode
Skewness
Kurtosis
Standard Deviation
Average
May 18, 2010 14
Let’s Take a Look at the Data
0
10
20
30
40
50
60
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
220000
240000
260000
280000
300000
Annual Salary
Nu
mb
er
of
Pe
op
le
May 18, 2010 15
Let’s Take a Look at the Data
• Characteristics
− Increases quickly from zero
− Distribution skewed to the left
− Clustered around lower values
− Gradual drop from peak
− Heavy tail
• This type of data distribution is very common
0
10
20
30
40
50
60
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
220000
240000
260000
280000
300000
Annual Salary
Nu
mb
er
of
Pe
op
le
Increases quickly
from zero
Clustered around
lower valuesGradual drop
from peak
Heavy tail
Distribution skewed to the left
May 18, 2010 16
Statistics
• The usefulness of a statistic depends on the underlying data
• Average really only makes sense when the data is symmetrically/equally distributed
− Otherwise, the average is distorted because of unequal distribution of data
• Deviation also really only makes sense when the data is symmetrically distributed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-5
-4.5
-4.1
-3.6
-3.2
-2.7
-2.2
-1.8
-1.3
-0.9
-0.4
0.0
6
0.5
2
0.9
8
1.4
4
1.9
2.3
6
2.8
2
3.2
8
3.7
4
4.2
4.6
6
May 18, 2010 17
Statistics
• Be careful of obscure statistics such as Kurtosis and Skewness
• They have a use but the meaning is quite specific and may not be appropriate
May 18, 2010 18
Descriptive Statistics
• Look for statistics that contain− Measures of data location and clustering
− Measures of dispersion and variability
− Measures of association
• Look at the underlying data, how it was collected, what it measures− If the data is of poor quality or measures the wrong values, any
derived information will have very limited worth
• There are lots of statistics that can be produced from the raw data− Produce only meaningful statistics
− Do not throw statistics at the data
May 18, 2010 19
Some Common Descriptive and Summarising Statistics
Correlation has a specific meaning that may not be relevant to the data
CorrelationAssociation
Value below which a certain percent of the data fall
Percentiles
Measure of the "peakedness” and the length of the tail of the distribution of the data
Kurtosis
Measure of the asymmetry of the distribution of the data
Skewness
The spread of the data valuesRange
Square root of the VarianceStandard Deviation
Measure of the amount of variation within the data
VarianceDispersion, Variability and Shape
The most commonly occurring valueMode
The 50th percentileMedian
Average of centralised subset of dataTruncated/Interpercentile Average
Average of values weighted according to a value such as their importance
Weighted Average
Simple averageAverageData location and Clustering
DescriptionStatisticStatistic Type
May 18, 2010 20
Another Look at the Sample Data
• This shows the salaries of cumulative percentages of the people surveyed
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
220000
240000
260000
280000
300000
320000
0% 5%
10% 15% 20% 25% 30% 35% 40% 45% 50% 55%
60%
65%
70%
75%
80%
85%
90%
95%100
%
Percentage Earning Up to Salary Amount
An
nu
al
Sa
lary
May 18, 2010 21
Another Look at the Sample Data
0.6%
2.1%
3.3%
4.8%
5.9%
6.7%
7.1%
7.1%
7.0%
6.7%
6.2%
5.8%
5.2%
4.7%
4.2%
3.7%
3.2%
2.8%
2.4%
2.1%
1.7%
1.4%
1.2%
1.0%
0.8%
0.7%
0.6%
0.5%
0.4%
0.4%
0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0%
0 - 10000
10000 - 20000
20000 - 30000
30000 - 40000
40000 - 50000
50000 - 60000
60000 - 70000
70000 - 80000
80000 - 90000
90000 - 100000
100000 - 110000
110000 - 120000
120000 - 130000
130000 - 140000
140000 - 150000
150000 - 160000
160000 - 170000
170000 - 180000
180000 - 190000
190000 - 200000
200000 - 210000
210000 - 220000
220000 - 230000
230000 - 240000
240000 - 250000
250000 - 260000
260000 - 270000
270000 - 280000
280000 - 290000
290000 - 300000
Sa
lary
Ra
ng
e
Percentage of People
24
83
133
193
237
268
283
285
280
267
249
230
209
187
166
146
128
112
96
84
67
55
47
38
32
27
22
20
17
15
0 50 100 150 200 250 300
0 - 10000
10000 - 20000
20000 - 30000
30000 - 40000
40000 - 50000
50000 - 60000
60000 - 70000
70000 - 80000
80000 - 90000
90000 - 100000
100000 - 110000
110000 - 120000
120000 - 130000
130000 - 140000
140000 - 150000
150000 - 160000
160000 - 170000
170000 - 180000
180000 - 190000
190000 - 200000
200000 - 210000
210000 - 220000
220000 - 230000
230000 - 240000
240000 - 250000
250000 - 260000
260000 - 270000
270000 - 280000
280000 - 290000
290000 - 300000
Sa
lary
Ra
ng
e
Number of People
May 18, 2010 22
Percentiles
• Percentile of a set of data is the number or value below which that percent of data lies
• Median = 50th percentile
− Value below which 50% of data lies
• Quartiles are percentiles for 25%, 50% and 75%
• Percentiles are useful in summarising data
May 18, 2010 23
Percentiles for Sample Data
• This … • … becomes this …
• 4,000 numbers reduced to 10 numbers
− 10% of people earn 38,332 or less
− 20% of people earn 54,834 or less
− 10% of people earn between 192,871 and 299,433
• Successfully reduced the volume of data while preserving more information
May 18, 2010 24
Anti-Statistics
• Unfortunately everywhere
• Take a number of general forms or types such as
− Statement based on measurement of incorrect value
− Statement without scale or reference
− Statement based on grouping of categories (with possible distortion of categories)
− Statements based on inaccurate on unspecified association or correlation
May 18, 2010 25
Sample Type 1 Anti-Statistic
• Chimpanzee DNA is 99.7% the same as Human DNA
• What does this statement mean?− Do chimpanzees make cars/houses/PCs/etc. that are 99.7% as
good as those made by humans?
• If the statement is true then what is being measured may be invalid, such as
• 000000000000000000000000 and 000000000000000000000001
• These numbers are 99% the same based on the length of the lines in their characters
− Or• A lot of DNA is not involved in the development process and this is being
included in measurements
− Or• A small change in DNA has a substantial impact on what is produced
May 18, 2010 26
Sample Type 2 Anti-Statistic
• Statements of the form
− X is the greatest cause of Y, such as
• Car crashes are the greatest cause of deaths among males in their 20s and 30s
• Meaningless because there is no scale or reference point
• Statement creates an impression of scale and severity that is at best not justified or at worst incorrect
• Take a look at the underlying life expectancy data
May 18, 2010 27
Type 2 Anti-Statistic
• Probability of a person dying within a year at each year of life
• Probability of a person dying within a year for first 35 years
0
0.1
0.2
0.3
0.4
0.5
0.6
05 Y
ears10 Y
ears15 Y
ears20 Y
ears25 Y
ears30 Y
ears35 Y
ears40 Y
ears45 Y
ears50 Y
ears55 Y
ears60 Y
ears65 Y
ears70 Y
ears75 Y
ears80 Y
ears85 Y
ears90 Y
ears95 Y
ears
100 Years
105 Years
Pro
ba
bil
ity
of
Dy
ing
Wit
hin
On
e Y
ea
r
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
0 5
Years
10
Years
15
Years
20
Years
25
Years
30
Years
35
Years
Pro
ba
bil
ity
of
Dy
ing
Wit
hin
On
e Y
ea
r
May 18, 2010 28
Type 2 Anti-Statistic
• The underlying life expectancy data shows that young people have very little chance of dying
• Death rates are uniformly very low after the first year of life until about age 50
• So a statement such as
− Car crashes are the greatest cause of deaths among males in their 20s and 30s
• Will inevitably be true because nothing else really kills young males
− Death due to illness is uncommon among this group so any other cause will dominate
May 18, 2010 29
Sample Type 3 Anti-Statistic
• Statements of the form− N% of people do/have done X at least N times/with defined frequency
− Typically arise as the results of tendentious surveys designed to create a false impression of severity
• Such as− 75% of people admit to X up to N times a year
• No indication of how the 75% is spread across the range of 1 to N times
− 65% of people admit to having a negative experience up to N times due to X• No indication of the spread of negative experiences across the range of 1 to N
• Generally a result of combining the responses to two or more questions or categories− Have often have you done/experienced X?
• Once
• Twice
• Three times
• …
May 18, 2010 30
Type 3 Anti-Statistic
• Have often have you done/experienced X?
− Once
− Twice
− Three times
− 4-8 times
− 8-12 times
• Have often have you done/experienced X?
− 45%
− 10%
− 8%
− 5%
− 2%
• Total of these is 75%
• Statement that 75% of people have done/experienced X up to 12 times a year distorts the distribution of the underlying data that is skewed towards lower rates of occurrence
May 18, 2010 31
Sample Type 4 Anti-Statistic
• Statements of the form− Taking /doing A makes you N% more likely to be/experience B
• Two issues− Causation – is there a real causal relationship− Degree of causation – how strong is the causal relationship
• An association does not imply a causation − A might cause B− B might cause A − A might cause B and B might cause A − A might cause C that might cause B− A might cause C that might cause D … that might cause B− A might cause C that might cause B and A might cause D that might not cause B but A-C-
D causation is greater than A-D-B negative causation− Measuring error− Random data that was skewed− Deliberate or malicious misrepresentation
• Cause might be partial or contributory
• Be careful of any statement of a relationship that does not demonstrate how causation happens
May 18, 2010 32
Association and Causation Scenarios
A B
A B
Causes or Influences
Causes or Influences
A BCauses or Influences
A B
Causes or Influences
C
A B
Causes or Influences
C D
A B
C
D
Negatively Causes or Influences
Causes or Influences
May 18, 2010 33
Association and Causation
Takes or Does
A B
DTaking or Doing
D Affects or Causes B
• Very common scenario where an association or causation is asserted
May 18, 2010 34
Association and Causation
Takes or Does
A B
C
DTaking or Doing D Has Little or No Effect or
Influence on B or Even Negatively Impacts B
Is a Member of
a Group
E
Members of Group C Have
a Greater Tendency to Take or do D
Members of Group C Also Take or Do E
Taking or Doing E Affects or Causes
B
• The real association or causation is actually along the lines of:
May 18, 2010 35
Type 4 Anti-Statistic
• Occurs very frequently
• A percentage association can give a false sense of certainty
− Just measures the looseness of association
• Often misrepresents the degree of causation
• Unless the precise nature of the causative relationship can be defined, take with a large dose of salt
May 18, 2010 36
Summary
• Statistics are designed to provide insight without distorting the meaning of the underlying data or losing information
• Anti-statistics are used to distort the underlying data to create false impressions
• So there are Lies, Damn Lies and Anti-Statistics