etc1000 topic 1

Upload: jacky3141

Post on 08-Feb-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/22/2019 ETC1000 Topic 1

    1/25

    ETC1000/ETW1000/ETX9000 Business and Economic Statistics

    LECTURE NOTES

    Topic 1: Knowing What is Happening

    1. Introduction: What is Econometrics and Business Statistics (and Why Should I Learn It)?Econometrics and business statistics is about getting the best information from data. Whenyou have good information, you can use it to make good decisions. The econo andbusiness parts mean that the data is economic or business data, and/or the decision is aneconomic or business decision.

    How can data be used in business and economic decision making? Consider somedecisions that business and government may need to make:

    Business decisionse.g. A telecommunications company is trying to decide whether to branch out into

    a new demographic of mobile phone userstargeting teenagers. Will it beeconomical? Will teenagers take them up?

    Economic policy decisionse.g. The government is trying to decide whether to continue to support

    children/young adults who have grown up in state care after they turn 18. Howdo these young adults survive after they leave care? Do they end up without

    jobs and in poor health once the support is removed, and thus incur higher costto the government than if they were supported until they were 25?

    How do we answer these questions in order to make such decisions?

    Business managers and policymakers need data: they need evidence which can help themunderstand the ways things are, weigh up options, identify problems, look at consequences,evaluate performance, etc. It is not enough to go on intuition, or on what is commonlyaccepted wisdom. Often this can be wrong or misleading.

    What can you expect to get out of this topic?

    Learn how to present and summarise data in a meaningful way.

    Learn how to use data to make informed decisions.

    Develop critical and analytical thinking about data and decisions made from it.

    2. Collecting Relevant Data

    2.1 Recording Business Activities

    With so much business activity taking place via computers, businesses already have databeing collected on many aspects of their activitiessales, costs, etc. This historical datacan be invaluable in picking trends and planning for the future.

    e.g. A telecommunications company may be interested in how their mobile phonesales are goinghow many phones are sold per month, and has this begun to

  • 7/22/2019 ETC1000 Topic 1

    2/25

    2

    slow over the last few years as the market has become saturated? Is there aclear pattern of busy months and quiet months?

    Here is the data they were able to obtain from their business records.

    Mobile Phone Sales (Thousands)

    0

    100

    200

    300

    400

    500

    600

    700

    800

    Q1-96

    Q3-96

    Q1-97

    Q3-97

    Q1-98

    Q3-98

    Q1-99

    Q3-99

    Q1-00

    Q3-00

    Q1-01

    Q3-01

    Q1-02

    Q3-02

    Q1-03

    Q3-03

    Q1-04

    Q3-04

    Q1-05

    Q3-05

    Q1-06

    Q3-06

    From this plot of sales over time, we can see that mobile phone sales have changed a lotover the last 10 years. In the December quarter of 2005 sales were more than twice that inthe December quarter of 1996. This would suggest there is a general increase in sales overthe 10 years. Looking a little more closely at the plot, we can see that there are someregular peaks in sales in the December quarter of each year, followed by a couple of low-sales quartersa phenomenon known as seasonality. The peak in the December quarterof each year would suggest that mobile phones tend to be a common Christmas present inAustralia!

    2.2 Surveys and Sampling

    A sample survey is a very common source of data. Surveys can be a useful way of gainingspecific information.

    e.g. Tourism: You want to assess customer satisfaction with the service theyreceive at your tourist resort.

    Retail sales: You want to understand spending patterns of people in aparticular product area so that you can target advertising better.

    Local Council: You want to assess community needs so that you can ensurethat facilities being provided are appropriate.

    There are two big issues in collecting survey data.

    (1) Survey Design

    The results of a survey are only as good as the design of the questions / form. Forexample, the wording of questions is vital.

    A bad question:In light of recent claims that the Premier has engaged in corrupt practices,who do you think is the more honest leader: the Premier or the Leader of theOpposition?

  • 7/22/2019 ETC1000 Topic 1

    3/25

    3

    Wording designed to influence towards a particular outcome.

    This could be worded better:Who do you think is the more honest leader: the Premier or the Leader of theOpposition?

    (2) Sample Design

    Obviously you cannot survey everyone in the population, so you need to take a sampleit would take too long to do the surveys, not everyone would be willing toparticipate, time and cost in processing would be too great.

    e.g. You might ask a small group of visitors to your tourist resort to complete aquestionnaire at the end of their visit. The hope is that this small sample ofrespondents will be representative of the wider population of visitors to theresort.

    We need to ensure that our sample is representative of the population. When aSAMPLE produces results which are not representative of the POPULATION, we saythat the sample is BIASED.

    In particular, there are 2 sources of bias in sample design.

    Selection Bias

    e.g. If I choose my sample of tourists according to my impressions of people, Imight choose those who look friendly and willing to participatethey wouldbe more likely to say yes to completing the questionnaire. BUT they could

    give a distorted picture of customer satisfaction. I may end up choosing allthe people who had a good time, who had no complaints, and not select anygrumpy, unhappy people. The results of my survey will then show things tobe better than they actually are.

    This bias is known as SELECTION BIAS.

    How do we ensure that our sample is representative?

    The key is RANDOMNESS: selecting the sample in some kind of random way. Thisreduces the possibility that the sample may not represent the population well.

    e.g. Each hour, randomly choose a number between 1 and 60. If in this hour Idraw No. 11, say, then at 11 minutes past the hour, I ask the next customerwho comes through the door to complete the questionnaire.

    Non-Response Bias

    Even if I take a random sample to avoid selection bias, I can still encounter a secondsource of bias: non-response bias.

    Whenever you do a survey, not all those who you ask to complete the survey willrespond. Non-response can be a problem in two ways:

    One, it makes the sample smaller than we might like (e.g. you may post out a surveyto 200 customers and only get 30 responsesa sample of size 30 is pretty small).

  • 7/22/2019 ETC1000 Topic 1

    4/25

    4

    Secondly, there may be a bias inherent in who responds and who doesnt. Suppose,for example, I ask for feedback on this subject via a voluntary surveystudents cancomplete the survey if they wish. What might happen is that only very unhappystudents bother completing the surveythey have something to complain about.Those who are generally happy dont bother filling out the survey. So my surveyresults will be biasedthey will suggest things are much worse that they actually are.

    This bias is known as NON-RESPONSE BIAS.

    3. Summarising Data in Meaningful Ways

    We need to summarise data in ways that are appropriate for the type of data you have:

    Numerical or Quantitative DataData that takes a numerical value.

    Numerical or quantitative data can be of two types:

    (1) Discrete

    e.g. Number of people in a household, number of times youve been tothe doctor in the last year, number of children in school, etc.

    (2) Continuous

    e.g. Your height, value of the Consumer Price Index (CPI), theunemployment rate, etc.

    Categorical or Qualitative DataData that do not take on numerical values, but

    can be classified into distinct categories (e.g. country of birth, gender, day of theweek, etc.)

    3.1 Tables and Charts for Numerical Data

    Suppose you have data on income levels of every household in a particular suburb: thereare 20,000 data points. You want some easy way of capturing the characteristics ofhousehold incomes in that suburb. A snapshot of the data is given below.

  • 7/22/2019 ETC1000 Topic 1

    5/25

    5

    The first thing to do with any sort of data is work out the question that you want to answer.For example, using this data you may be interested in what the income distribution is forthis suburb, and you could ask a question like are households in this suburb mostlyaffluent, or is there an uneven distribution of income? One of the best ways of answeringthis question using a table is by creating a FREQUENCY DISTRIBUTION: a table whichshows the number of households earning income within particular ranges.

    e.g.

    That is, we look at the incomes of these 20,000 households and count the number ofhouseholds with incomes in each class range. So in this case, there are 14 households whoearn between $10,001 and $20,000 per annum. In tutorials next week you will learn how toautomatically create frequency distributions like this in Excel usingData Analysis,

    Histogram from the Data Tab.

    We can also take this information and present it in a more visually appealing manner as aHISTOGRAM. e.g.

  • 7/22/2019 ETC1000 Topic 1

    6/25

    6

    A histogram will be produced by Excel if you tick Chart OutputinData Analysis, Histogram.

  • 7/22/2019 ETC1000 Topic 1

    7/25

    7

    Make sure your histogram is well presented:

    Headings and labels for axes, including units of data.

    No gaps between the barsthis is not the default Excel outputyou will see in

    tutorials how to fix this.

    BUT: reading from charts tends to be more approximate than reading from tables.

    The problem with the table and chart above is that in its current form, the numbers are notparticularly meaningful. Is 14 households a large or a small number? A more meaningfultable or chart would be one that presents the frequency column as a percentage of the20,000 households.

    e.g.

    This allows us to say things like 0.07% of households earn between $10,001 and $20,000 p.a.

    Some points about creating frequency distributions:

    Dont have too few classes (ranges are too broad, and useful information is lost),but dont have too many (too much detail, hard to form overall impressions).Somewhere between 5 and 15 is normal, depending how much original data youhave, and how much it varies.

    Select sensible class boundariesnice round numbers (you will find that Excelsautomatic class (bin) ranges dont do this, and so its better to enter your own).

    Giving cumulative frequencies and cumulative percentages is also a usefuladdition. It allows us to say things like 20.15% of households earn $40,000 orless p.a..

  • 7/22/2019 ETC1000 Topic 1

    8/25

    8

    e.g.

    3.2 Tables and Charts for Categorical Data

    Suppose we had some qualitative information on a bunch of people in our suburb,including information on whether the individual has been diagnosed with particularmedical conditions. Specifically, individuals were asked to indicate their primary medicalcondition out of Asthma, Cancer, Depression, Diabetes, Heart Disease or None of Above.A snapshot of the data is shown below.

    Since individuals indicate only one medical condition, they can be categorised into one of 6categories related to medical condition: Asthma, Cancer, Depression, Diabetes, HeartDisease or None of Above.

  • 7/22/2019 ETC1000 Topic 1

    9/25

    9

    How can we organise and summarise this categorical data? We can also create a frequencydistribution to see how many people are in each category. In Excel, you would use adifferent function: From the Insert tab, selectPivot Table.You will look at pivot tables intutorials.

    The frequency distribution for this example is as follows:

    Sum of Victorians

    Medical Condition Grand Total

    Asthma 628

    Cancer 575

    Depression 720

    Diabetes 322

    Heart Disease 348

    None of Above 4871

    Grand Total 7464

    From this table we can see that, for example, 322 individuals out of the 7,464 respondentsin the sample have diabetes as their primary medical condition.

    A BAR CHART is the most common way of presenting this kind of categorical data. It iseasy to read and interpret.

    Medical Conditions in Victoria

    0

    1000

    2000

    3000

    4000

    5000

    6000

    Asthma Cancer Depression Diabetes Heart

    Disease

    None of

    Above

    Diagnosis

    NumberofIndividuals

    Note here that our categories are in alphabetical order. Since the categories have no naturalordering, it really wouldnt matter if we, say, put diabetes first.You may even like to orderthe categories from most frequent to least frequentthis special presentation, called aPARETO CHART, helps to distinguish the vital few from the trivial many. Be sure touse bars and not a line graph which joins the categories, as with no natural ordering itdoesnt make sense to join the dots!

    As with the numerical data case, this information is often more useful in percentages. Thisis quick to do in Excel, once the pivot table has been constructed: right-click on the table,

    select Field settings and choose Options. You can then Show data as % of column,and it will automatically change the frequencies to percentages.

  • 7/22/2019 ETC1000 Topic 1

    10/25

    10

    Sum of Victorians

    Medical Condition Grand Total

    Asthma 8.41%

    Cancer 7.70%

    Depression 9.65%

    Diabetes 4.31%

    Heart Disease 4.66%

    None of Above 65.26%

    Grand Total 100.00%

    So, 4.31% of the individuals reported diabetes as their primary medical condition.

    Since the categories are mutually exclusive (individuals can only be in one category) andexhaustive (there are no other possible categoriesthe total is 100%), a PIE CHART canbe used to present the same information:

    Medical Conditions in Victoria

    Asthma

    Cancer

    Depression

    Diabetes

    Heart DiseaseNone of Above

    The pie chart is often popularit is visually more appealing, and also shows nicely howthe overall population is divided up into its various categories. However, the bar chart iseasier to read accuratelyit is easier to judge length of lines in a bar chart than angles /areas in a pie chart.

    3.3 Tables for Bivariate Data

    So far we have been looking at just one characteristic of interest for the datamedicalcondition. i.e. We have been analysing univariate data. Univariate data often provides adescription that is too simplistic and can even be misleading if there are other factors atwork behind the univariate categories. It may be more useful to look at two characteristics,and look at frequencies in each pair-wise category. i.e. We want to look at bivariate data.

    Lets consider an example. Suppose the government may want to find out more about the

    people with each medical condition, so that they can come up with policies aimed atreducing the prevalence of the condition.

  • 7/22/2019 ETC1000 Topic 1

    11/25

    11

    e.g. If people who exercise more have lower incidence of illness, the governmentmay push for people to become more active. (After all, expenditure on healthcare is an important component of the government budget, and so it is in thegovernments best interests, both financially and socially, to monitor andreduce the prevalence of medical conditions.)

    We have information on how much exercise each individual does. So, we have data foreach individual on medical condition AND on exerciseModerate to Frequent Exercise orMinimal Exercise. For each individual we have a pair of data points: we have bivariatedata.

    How do we present and summarise bivariate data? We use a CONTINGENCY TABLE. Acontingency table is a two-way frequency distribution. It can be produced as a pivot tablein Excel. Youll create some pivot tables in tutorials next week.

    e.g.

    Sum of Victorians Exercise

    Medical ConditionsModerate to

    Frequent Exercise Minimal Exercise Grand Total

    Asthma 254 374 628

    Cancer 172 403 575

    Depression 249 471 720

    Diabetes 80 242 322

    Heart Disease 74 274 348

    None of Above 1861 3010 4871

    Grand Total 2690 4774 7464

    You will notice that we have split the medical condition frequency distribution intocategories based on exercise. For example, we can see the same total values that we sawin the univariate table: out of the sample of 7,464 people, there were 322 people with

    diabetes as their primary medical condition. Further, 242 people both suffer from DiabetesANDdo Minimal Exercise.

  • 7/22/2019 ETC1000 Topic 1

    12/25

    12

    Why would this bivariate presentation be more useful? Because it allows us to see whetherthere are any differences in medical conditions across the different categories of exercise.In particular, the bivariate table suggests that a large number of those with Diabetes doMinimal Exercise. This information would be much more interesting to the governmentitindicates a possible relationship between diabetes and exercise, and suggests a policy anglethat the government might take to reduce the prevalence of diabetes. By allowing us to seea second dimension to the data, the bivariate pivot table allows us to get a much richerpicture of the data than the univariate pivot table.

    You will notice that in each category of medical condition, there are many more peoplewho do Minimal Exercise. But, we need to be careful here: because there are more peoplein the Minimal Exercise group in total, naturally there will be more people in this categorywith medical conditions. Again, we need to use percentages.

    We could use the following % of row pivot table:

    Sum of Victorians Exercise

    Medical Conditions Frequent toModerate Exercise Minimal Exercise GrandTotal

    Asthma 40.45% 59.55% 100.00%

    Cancer 29.91% 70.09% 100.00%

    Depression 34.58% 65.42% 100.00%

    Diabetes 24.84% 75.16% 100.00%

    Heart Disease 21.26% 78.74% 100.00%

    None of Above 38.21% 61.79% 100.00%

    Grand Total 36.04% 63.96% 100.00%

    From this table, we can see that 75.16% of people with Diabetes as their primary medicalcondition do Minimal Exercise.

    BUT: this doesnt really tell us anything more meaningful than the previous contingencytable, given that we are interested in whether more exercise improves health. Again, thereare more people in the minimal exercise group, thus the percentage will always be thelargest in the minimal exercise column.

    To answer our question, we should use the % of column contingency table:

    Sum of Victorians Exercise

    Medical ConditionsFrequent to

    Moderate Exercise Minimal Exercise Grand Total

    Asthma 9.44% 7.83% 8.41%

    Cancer 6.39% 8.44% 7.70%

    Depression 9.26% 9.87% 9.65%

    Diabetes 2.97% 5.07% 4.31%

    Heart Disease 2.75% 5.74% 4.66%

    None of Above 69.18% 63.05% 65.26%

    Grand Total 100.00% 100.00% 100.00%

    This pivot table effectively takes into account (controls for/ standardises according to)differences in population size of each exercise category.

    From this table, we can say things like 5.07% of the people who do Minimal Exercisehave

    Diabetes as their primary medical condition. Looking at the rate of Diabetes in the overallsample (4.31%, as shown in the Grand Total column and as we saw in the univariate table)this would suggest a connection between Diabetes and the level of exercise a personundertakes.

  • 7/22/2019 ETC1000 Topic 1

    13/25

    13

    If we believe that lack of exercise contributes to development of diabetes, a strategy for thegovernment might be to promote a more active lifestyle amongst people in the suburb.

    3.4 Descriptive (Summary) Statistics

    When we have a set of numerical or quantitative data, it is common to try and summarisecharacteristics of the data with what we call summary measures.

    Getting some idea of the general characteristics of the data is a good starting point. We cando this quite quickly and easily using Excel using the Data tab,Data Analysis, DescriptiveStatistics, and selecting the Summary Statisticsbox.

    Using the data on 20,000 households in our suburb, we get the following output:

    It is worth knowing how these statistics are calculated, what they mean and theirlimitations. But before we start, lets introduce some notation:

    Consider the following representation:

    1

    n

    i

    i

    X

    This means sum theXs fromX1toXn. The capital Greek letter , pronounced sigma,is used widely in mathematics and statistics as shorthand for sum a set of values asdescribed by the notation following it. The i indicates the element/observation number,while the value specified below is the term to begin with and the value above is theterm to finish with.

    That is:

    5

    1 2 3 4 5

    1

    i

    i

    X X X X X X

    and

  • 7/22/2019 ETC1000 Topic 1

    14/25

    14

    4

    3 4

    3

    2 2ii

    X X X

    .

    You will have some practice with summation operators in tutorials next week.

    Now, back to the summary statistics:

    (1) Mean

    The MEAN is the arithmetic average of the ndata points:

    X

    X

    n

    i

    i

    n

    1

    orX X X

    n

    n1 2

    So, for our 20 000 data points the formula becomes:

    20000

    20000

    1

    i

    iX

    X or20000

    2000021 XXX

    The mean is the most common measure of central tendency. Generally it is a goodmeasure. BUT it can be affected by a few extreme values.

    e.g. If you are looking at average household income in a particular suburb, theremay be one household that earns an extremely high income which pushes upthe mean, so that it gives a misleading picture of where incomes of most

    households in the suburb sit.

    (2) Standard Errorignore

    (3) Median

    The MEDIAN is the middle number when the data is ordered from smallest tobiggest. 50% of values are below it, and 50% above it.

    The median has an advantage over the mean in that it is not affected by a fewextreme values. It does a good job of telling us what the typical income of ahousehold is in that particular suburb. BUT the median can also be misleadingit

    takes no account of how the data is distributed around the middle.

    e.g. Consider 5 housing properties that are up for sale in Clayton with prices at$270,000; $320,000; $460,000; $470,000 and $480,000. In Wantirna South,5 similar housing properties up for sale are priced at $450,000; $450,000;$460,000; $850,000 and $1,000,000.

    Both suburbs have a median price of $460 000, so we would say a typicalhousing property has the same price in each suburb. But clearly, mosthousing properties in Wantirna South are more costly than that in Clayton.The median doesnt give the full picture. Information about the range ofproperty prices within each suburb does not enter into the computation of themedian.

  • 7/22/2019 ETC1000 Topic 1

    15/25

    15

    (4) Mode

    The MODE is the most frequently occurring value. It is sometimes a usefulsummary measure, for either numerical or categorical data.

    e.g. You may have data on the number of people in each household in a suburb.The mode is 4, telling us the most common household size is 4 people.

    Sometimes the mode is of little interest if there are few repeated values. It needsdata which, by nature, has frequent repeats.

    (5) Standard Deviation

    and

    (6) Sample Variance

    The VARIANCE and STANDARD DEVIATION are the most commonly usedmeasures of variation or spread in numerical data. Often it is interesting to knowhow spread out the data is.

    e.g. We usually have an average result in this subject of around 65%, which isvery good. But we also need to know how spread out the results are: doesvirtually everyone get between 55 and 75 (i.e. good pass rate, but hard toscore well), or are results more spread (indicating that a significant numberfail, and some score very high marks).

    The variance measures the average of the squared variation about the mean:

    s

    X X

    n

    i

    i

    n

    2

    2

    1

    1

    .

    i.e. We are looking for some indication of how much the data varies around the

    mean. If the data is all close to the mean, then X Xi will be small for

    all the values (that is, all i), and hence the average of their squares will besmall. Conversely, data which varies greatly above & below the mean will

    have big X Xi values, hence big variance.

    N.B. Why divide by n - 1 and not n? This is asamplevariance, and is being usedto estimateapopulationvariance. It turns out that dividing by n- 1 gives abetter estimate of the population variance than dividing bynwell thinkmore about this in the next topic.

    Standard deviation is just the square root of the variance. It is much easier tointerpret than the variance:

    Strictly speaking,sis the square root of the average of the squared deviations fromthe mean.

    But a more understandable interpretation is: SomeXvalues are above the mean;

    others are below the mean. sis an estimate of the average amount that theXs varyfrom the mean, either above or below. This isnt exactly correct, but its at leastunderstandable!

    (7) Kurtosis - ignore

  • 7/22/2019 ETC1000 Topic 1

    16/25

    16

    (8) Skewness

    SKEWNESS tells us about shape of the distribution. In particular, it tells us abouthow the data is distributed around the mean. Consider three typical histograms,each with the same mean and standard deviation:

    Symmetriczero skewness

    Negative skewness Positive skewness

    Skewness describes the degree and direction of asymmetry in the data. The first(upper-most) histogram represents a symmetric distribution. The skewness measurewould be zero in this case. The histogram is identical either side of the mean.

    The skewed portion is the long, thin part of the curve. The lower left-hand-sidehistogram is what we call negatively skewedgiving a negative number for this

    skewness measure. There is a long left hand tailformed by a scattering of somevery small values, but the bulk of the data sits further to the right. This data mightrepresent exam marks in this subjectmost people score around 55%-80%, withvery few above this. But there is a long tail of people to the left who get below55%, marks as low as 10% or 20% do happen.

    The lower right-hand-side histogram is positively skeweda long tail to the right.This data could be, perhaps, household income levels. Most earn $30,000 to$50,000, with very few earning much less than this. But there is a long tail ofhouseholds to the right who earn much more$60,000, $80,000, $100,000 ormore.

    0

    10

    20

    30

    40

    50

    60

    70

    0 -1 0 1 1- 20 2 1- 30 3 1- 40 4 1- 50 5 1- 60 6 1- 70 7 1- 80 8 1- 90 9 1- 10 0 Mo re

    Frequency

    0

    10

    20

    30

    40

    50

    60

    70

    0 -1 0 1 1- 20 2 1- 30 3 1- 40 4 1- 50 5 1- 60 6 1- 70 7 1- 80 8 1- 90 9 1- 10 0 Mo re

    Frequency

    0

    10

    20

    30

    40

    50

    60

    70

    0- 10 1 1- 20 21 -30 3 1- 40 4 1- 50 51 -60 6 1- 70 71- 80 81 -90 91 -100 Mo re

    Frequency

  • 7/22/2019 ETC1000 Topic 1

    17/25

    17

    (9) Range

    The RANGE is another measure of spread or variation. It is simply the differencebetween the maximum and minimum values in our data set.

    (10) Minimum

    The smallest value in our data set.

    (11) Maximum

    The largest value in our data set.

    (12) Sum

    The sum of all values in the data setnot usually very interesting.

    (13) Count

    The number of values in the data set.

    Note that the bigger the sample is, the closer we are to having the population. So, ifwe only have a few values to summarise, we may have a sample that is notrepresentative of the true population, and consequently our descriptive statisticsmay not be meaningful.

    3.5 Analysis of Variance (ANOVA)

    We have just seen how summary statistics can be calculated from numerical data: e.g. wesaw that mean income in our suburb was $44395. But what if we had bivariate data in thiscasee.g. for each household, we have data on income (a numerical variable) and alsodata on household type (a categorical variable). That is, a snapshot of the data would looklike this:

    With this sort of data, we could calculate summary statistics for income for each different

    household type. To do this using Excels Descriptive Statistics tool, we would need to firstsort the data by household type and rearrange it as follows (just the first 5 rows are shownto save space):

  • 7/22/2019 ETC1000 Topic 1

    18/25

    18

    We can then obtain summary statistics for each of these household types as presented in thecolumns. If there are a different number of observations in each category, then youll needto run the descriptive statistics for each category separately.

    Here we see that mean household income is highest among couples without children, andlowest amongst single adults without children. Generally, however, there is quite a large

    degree of variation in incomes for households comprising of couples, compared to singles(this would, of course, be due to the fact that couples could have 1 or 2 working adults).We can get a similar sort of output in one go using ANOVA: Single Factor in Excels DataAnalysis tool. ANOVA stands for ANALYSIS OF VARIANCE, and it is primarily used tocompare the mean of a numerical variable across groups defined by a categorical variable.In our case, wed be comparing mean income across each household type group.

    Heres the output we get when we do this:

  • 7/22/2019 ETC1000 Topic 1

    19/25

    19

    Notice that the first block of summary output gives some (but not all) of the statistics thatwe obtained from the Descriptive Statistics. We could interpret these as we did previously.

    The second block of output performs what is known as Analysis of Variance or ANOVA.When we perform ANOVA, we are essentially trying to figure out what the differentsources of variation are in the data. In other words, why are all the incomes not exactly thesame? They clearly are not all the samethere is variationin household incomes.

    Heres how it works. Income varies. It varies by an amount given by its variance, the sumof squared deviations from its overall mean:

    Total Variation in Income (X) = Var(X) = 2

    1

    n

    i

    i

    X X

    In Excels ANOVA output, the total variation in income is given in the SS Total column(7.488 x 1012).

    This total variation in income is then decomposed into two parts (sources of variation):

    (1) Between Groups Variation

    This is the variation that is explained by membership to the groups in this case, itis the variation in income that can be attributed to household structure.

    A measure of the amount of Between Groups Variation is the Between Groups Sum ofSquares (under the SS column in the Excel output). It is a measure of how much thegroup means vary from the overall mean:

    2

    1

    k

    j j

    j

    n X X

    where: kis the number of groupsnjis the number of observations in groupj

    jX is the mean of groupj

    X is the overall mean

    (2) Within Groups Variation

    This is the variation that is not explained by membership to the groups. That is, it is

    the variation in income due to other factors (e.g. education of household members),including chance or randomness.

    2

    1 1

    jnk

    ij j

    j i

    X X

    where:ijX is the ith observation onXin groupj

    We can compare the amount of explained variation with the amount of unexplainedvariation. But to do this we need to convert the above into average measures. We do this by

    dividing by the dfcolumn. The result is theMScolumn.

    TheMSBetween Groups in our example is 6.704 x 1011while theMSWithin Groups is2.739 x 108the variation in income attributable to household structure is more than 2400

  • 7/22/2019 ETC1000 Topic 1

    20/25

    20

    times the amount due to other factors (the F column gives us this ratio). This certainlyseems like the explained part is much larger that the unexplained part, so we wouldconsider household structure to be an important factor in determining household income. Inparticular, it suggests that mean household income is not the same across groups defined byhousehold type.

    So, how much larger does theMSBetween Groups need to be than theMSWithin Groupsfor us to conclude that membership to the groups is an important source of variation?The amount differs depending on a couple of factorshow many observations we haveand how many groups there are. Well tell you how this works in Topic3, but for now, theratio is given in the columnF crit. That is, ifFis bigger thanF critin the Excel output,then wed conclude that theMSBetween Groups is indeed bigger than theMSWithinGroups, and thus membership to the groups is an importantsource of variation.

    In our example, the Excel output gives us:

    Between Groups / Between Groups Between Groups2447.751

    Within Groups / Within Groups Within Groups

    SS df MS F

    SS df MS

    and:

    2.605F crit

    So since 2447.751 2.605F we say that the explained variation is much bigger than

    the unexplained variation, and therefore household structure is indeed an important factorin determining household income.

    We will build on this idea further in Topic 3.

    3.6 Statistics for More than One Variable: Making Appropriate Comparisons

    To make meaningful comparisons between different variables, e.g. incomes in differentsuburbs, malnutrition rates among children in different developing countries, prices ofshares in different industries, etc., it is important to have standardised data sets/units.

    Specifically, this might involve dividing the variable of interest by:

    The total, to convert magnitudes into percentages (to allow for different totalmagnitudes).

    e.g. Shares in Macquarie Bank are much more profitable than shares in WestpacBank: Macquarie shares grew by $3.30 this year, while Westpac shares onlygrew by $1.26.

    But: at the beginning of the year, a share in Macquarie cost $69.00, while ashare in Westpac cost $26.52. We should convert the price growth into apercentage return in order to make an appropriate comparison of the moreprofitable investment:

    Return on Macquarie share = $3.30 / 69.00 = 0.048, i.e. 4.8%.Return on Westpac share = $1.26 / 26.52 = 0.048, i.e. 4.8%.So in fact, even though the price increases are different in dollar terms, bothshares earned the same return as a percentage on original investment.

  • 7/22/2019 ETC1000 Topic 1

    21/25

    21

    The consumer price index (CPI) (to remove the effects of inflation)

    e.g. Working families have never been better off. Average weekly earningshave gone up 400% in the last 25 years.

    Average Weekly Earnings

    Total Earnings, All Employees

    $/week

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05

    But: as well as wages, prices have gone up over the 25-year period, so a faircomparison of how better off families are should adjust for the effects ofinflation. i.e. can they actually buy any more with these extra dollars?

    The CPI is an index that measures the cost of living, and it is often used toremove the effects of inflation from monetary data spanning across time.

    When we divide nominal wages (the dollar amount you get on your pay sheet)by the CPI, we get what is known as the REAL WAGE. It is obtained by thefollowing:

    Real Wage =Nominal Wage

    CPI100

    The real wage tells what our wage would be if prices had remained fixed atwhat they were in the base period for the CPI (in Australia this is currently1989/90). Changes reflect a capacity to buy more or less items, hence theterm real wage.

  • 7/22/2019 ETC1000 Topic 1

    22/25

    22

    Average Weekly Earnings

    Total Earnings, All Employees

    Nominal and Real, $/week

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05

    No minal Real

    Population (to adjust for different sized populations).

    e.g. Economic growth, as measured by the growth rate in real Gross DomesticProduct (GDP) has been 3.2% this past year.

    But: population has grown by around 2%, so actual per capita growth has beenmuch smaller.

    e.g. Chinas energy use is growing 10 timesfaster than Australias.

    But: population is 50 times bigger, soper capita, Australias energy use isgrowing faster.

    We capture this by calculating per capita energy consumption:

    per capita energy consumption = consumption / population.

    This tells us consumption per person. Growth in this variable over timeindicates actual growth in energy use per person, rather than growth in thecombination of energy use and population.

    3.7 Summary

    Looking at these descriptive statistics and graphs is oftena useful first step in getting afeel for the data you are looking at. Lets go through an example to show how the toolsyou have learnt in this topic can be used to report and draw conclusions about a set of data.

    Suppose we are interested in evaluating a job search program undertaken by Centrelink.We have data on 100 individuals, some of which took part in the job search program, andsome who didnt. We also have data on whether the individual found work within a 6-month period, and if so, their annual wage.

    Heres a snapshot of the data:

  • 7/22/2019 ETC1000 Topic 1

    23/25

    23

    IndividualParticipate in JobSearch Training

    BecomeEmployed Income

    1 no no -

    2 no no -

    3 yes yes 432304 no yes 27635

    5 no yes 62046

    6 no yes 45157

    7 yes yes 55970

    8 no no -

    9 no no -

    10 no no -

    11 no no -

    12 no no -

    13 no no -

    14 no no -

    15 no yes 66181

    16 no no -

    17 no no -

    18 no yes 49030

    19 no yes 50117

    20 yes yes 47100

    First, we could use a pivot table to see whether those who participated in the program weremore likely to get a job than those who didnt.

    Employed?

    Participate in Job Search Training? No Yes Grand Total

    No 50 26 76

    Yes 1 23 24

    Grand Total 51 49 100

    From this table, we can see that 24% of the individuals participated in the job searchprogram, and virtually all (23 out of 2496%) of them found work. Amongst those whodid not participate in the training, only 26/76 or 34% found work.

    These proportions would be easier to see as a %Row, as by dividing by the row totals we

    effectively account for the different number of people in each group (participants in thetraining and non-participants).

    That is, wed construct:

    Employed?

    Participate in Job Search Training? No Yes Grand Total

    No 66% 34% 100%

    Yes 4% 96% 100%

    Grand Total 51% 49% 100%

  • 7/22/2019 ETC1000 Topic 1

    24/25

    24

    But just because there was a high success rate in employment doesnt mean those peoplewho got jobs have high-paying jobs. We can look at some descriptive statistics of incomefor the two groups: job search training participants and non-participants. The output isgiven below:

    Participants in JobSearch Training Non Participants

    Mean 39314.87 41034.78

    Standard Error 2861.283 3643.973

    Median 42519.76 40000

    Mode #N/A 40000

    Standard Deviation 13722.23 18580.69

    Sample Variance 1.88E+08 3.45E+08

    Kurtosis -0.36927 3.286039

    Skewness -0.67963 1.57031

    Range 45866.31 81156.34

    Minimum 10819.11 18843.66

    Maximum 56685.42 100000

    Sum 904242.1 1066904

    Count 23 26

    Note that these descriptive statistics were calculated for the employed onlythe Counts arethe same as the total number of employed in each participation group.

    The alternative output to the above is the ANOVA output below:

    Anova: Single Factor

    SUMMARY

    Groups Count Sum Average Variance

    Participate 23 904242.12 39314.87478 188299665.2

    Do Not Participate 26 1066904.166 41034.7756 345241991

    ANOVA

    Source of Variation SS df MS F P-value F crit

    Between Groups 36100391.21 1 36100391.21 0.132829645 0.717150714 4.047099759

    Within Groups 12773642409 47 271779625.7

    Total 12809742801 48

    The statistics highlight that mean income is similar across the two groupsparticipantsearn on average $39,315 per annum, while the non-participants earn slightly more onaverage at $41,035 per annum. In particular, the ratio of explained variation to unexplainedvariation is too small for participation in the program to be deemed an important source ofvariation income (the ratio of 0.133 is much smaller than the critical value of 4.047).

    The medians are also somewhat similar, with 50% of participants earning above and below$42,520 and non-participants, 50% above and below $40,000. No mode is available forincome of participants - there are no 2 incomes exactly the same in this group, whereas an

    income of $40,000 comes up most often amongst non-participantsa value quite close tothe mean and median.

  • 7/22/2019 ETC1000 Topic 1

    25/25

    There is some difference in variation between the 2 groups, with incomes of participantsvarying around the mean by $13,722 on average, while income of non-participants tend tovary more around their meanon average, incomes vary by $18,581 above and below themean. This characteristic can also be seen in the values for the range, minimum andmaximumincomes of participants range from $10,819 to $56,685 per annum, while thosefor non-participants span a much larger range: $18,843 to $100,000. Incomes ofparticipants tend be symmetrically and more tightly varying around the mean than incomesof non-participantsmost tend to sit around the $40,000 mark, but a few particularly largevalues like $100,000 do happen. This few large values seem to drag out the mean and givethe distribution a moderately large skew to the right (hence the positive skewnesscoefficient of 1.57).

    Summing up the findings of the analysis, we could conclude:

    The job search program seems to be successful in getting people jobs.

    The jobs participants get do not seem to be more highly paying than what the non-participants get.

    There is much more variation in income outcomes of non-participants compared toparticipants, but most people found jobs with incomes around the $40,000 mark.

    N.B. What determined who participated in the program and who didnt?