final assesment qrm

1

STUDENT EXAMINATION NUMBER Y1401956

MODULE NO: MAN00029M MODULE TITLE: Quantitative Methods & Data Analysis Module Tutor: Dr. Harry Venables Essay Title: Final assessment

Word Count: ___2688_________

2

Task 1

In order to start performing any manipulation with data the Data View and

Variable View in SPSS should comply with the rules so that SPSS output would

compute properly.

Name Label Values Type New

type

Rationale

Obs Id number None Nominal Scale Variable items that are not

measurable but are

numeric like ID numbers

and phone numbers (can

also be Nominal).

Gender Gender 0=Female,

1=Male

Nominal Nominal Variable items are all

numbers that represent

categories and have no

order to them, e.g. 1-Blue

Car, 2-Cat, 0-Male, 11-

Female, etc.

Age Age(years) None Nominal Scale Variable items are all

measurable numbers e.g.

height in cm.

Status Marital Status 1=Single,

2=Married,3=

Divorced,

4=Widowed





3


Female, etc.

Occupa

tion

Occupation 1=Student,2

=Employed,

3=Self-

employed,

4=Retired






Female, etc.

AvgMon

thlySpe

nding

AverageMonthly

Spend (GBP)

None Nominal Scale

(custom

currency

)

Variable items are all

measurable monetary

values.

Monthly

Visits

Number of Monthly

visits

None Nominal Scale Variable items are all


height in cm.

Distanc

e

Distance Travelled

(miles)

None Nominal Scale Variable items are all


height in cm.

Car Vehicle Ownership 0=No,1=Yes Nominal Nominal Variable items are all





Female, etc.

Appreci Customer 1=Very Low, Nominal Ordinal Variable items are

4

ation Appreciation 2=Low,

3=Indifferent,

4=High,

5=Very High.


some form of ranking or

order, e.g. Likert scale

values 1-5, 1-7.

Task 2

A. The bar chart indicates the target consumers of the FreshCo retail centre and

consumer’s two characteristics are analysed: status and occupation. So, cross-

tabulation (Table 2.1) is used in order to analyse two variables and produce an

appropriate bar chart.

Table 2.1

Table 2.2

Table 2.2 shows that there are 201 repondents and 2 modes.

5

From this chart a conclusion could be drawn that the majority of FreshCo’s

cosnumers are employed (116 out of 201 repsondents) and married ( 89 out of 201

reposndents) . Thus, Bimodal attribute is married and employed because of occuring

most frequently ( appendix 2)( Field,2009:21).

B. The target-consumer analysis contains previous charachteristics such as status

and occupation and the differeneces between them but in regard to car

owership.

7

Divorced that are either employed or self-employed and widowed people that

are retired are groups that don’t own a car. Divorced people,especially self-

employed sub-group is the largests group that doesn’t own a car. Employed and

married on the contrary is the largest group to own a car. Single students are the

second largest and single employed is the smallest group to own a car.

C. Considering the fact the majority of FreshCo’s cutomers are married employed

car owners potential issues such as enough numbers of parking slots could

arise. Also, the retial’s convenient opening hours could make a significant

difference for working idividuals. Marrital status can also indicate the presence of

children and need for children facilities such as playgrounds and food courts on

the site.

8

Task 3

A. Consumer spending in regard to consumer charachteristics.

Extreme values (outliers) occur for student males and self - employed females.

Outliers are the extreme values that deviate from the rest of the responses. In this

case three respondents have outstanding answers on the average monthly

spending. The numbers over outliners indicate the row - number of the respondent

(SAGE, 2015). In the self-employed female group one person has higher monthly

spendings (522.59 $) than the rest. In the male student group two respondents

spend more than the rest of the group (219.84 $ and 225.84 $).

Medians are dispersed in terms of occupation. Whereas, in terms of gender

medians are not significantly different (they overlap). Both employed and self-

9

employed males and females spend more money than other groups( the difference

is significant because their confidence intervals don’t overlap). There is also some

difference between the employed and self-employed group because boxes of these

groups almost don’t overlap (with employed and self-employed women there is less

difference in spending because the boxes slightly overlap). Lower median position

shows lower spending for the self-employed group than employed group that has

higher median position. The interquartile ranges are of slightly different length and

have different positions which indicates different dispersion of data between the two

groups (self-employed group is smaller) (Field, 2009:100-2).

There are no significant differences between expenditure of students and

retired individuals because their boxes overlap. However, the spread of student

interquartile range differs across gender. The male group is smaller and less likely to

spend more money than women (Ibid).

B. Average monthly expenditure according to level of appreciation.

10

The average expenditure for customers with ‘very high’ customer appreciation

differs significantly from the rest because the median and the box (incl. confidence

intervals) are far away from the rest and don’t overlap. Interquartile range of the ‘very

low’ and ‘low’ appreciation is very different from all the rest which indicates a wider

dispersion of data (Field, 2009:101-2).

Some box plots show the skewness of data and lack of symmetry which

needs to be observed more closely through a statistical test.

12

Descriptive test shows the means as well as medians of ‘low’, ‘very low’,

‘indifferent’ and ‘high’ appreciation are not significantly different from each other.

However, the mean (as well as median- 199.9549) of ‘very high’ appreciation is

significantly lower (197.8941) and differs from the rest.

Standard deviation from the mean also differs for ‘very low’ and ‘low’ which

numerically shows a wider variety of indicators deviating from the mean. The

interquartile range of these both groups also significantly differs from the rest.

We can also observe slight positive skewness for ‘high’ and ‘indifferent’ groups and

slight negative skewness for the rest that indicates slight asymmetry of the data

distribution (Field, 2009:19).

Customers that tend to spend the least amount have the highest customer

appreciation. Customers that spend the most are indifferent or have low or very low

appreciation.

Task 4: Distribution of customer’s monthly expenditure.

13

According to the histogram the data for average monthly spending is not

normally distributed. We observe a flat distribution with a negative skew. The bars

are out from the normal curve and have an obvious split in two. In a Normal Q-Q Plot

we see some deviation towards the tail. Normal QQ-Plot is a chart of the observed

values plotted against normalized expected values. The data values are pretty far

away from the line and even cross it which shows that distribution is not normal. The

spending data only around 180.000 spending value and 5300.00 spending value is

normally distributed. Generally, values don’t follow the normal distribution. Detrended

14

plot is another view of the first that detrends the line. It shows even more closely the

abnormality of distribution.

The box plot doesn’t show any outliers. Central section of the data is not

centrally distributed because the median is not centrally placed.

Distribution of distance travelled

The Distance travelled data seems more normally distributed. However, if we

see the Normal Q-Q Plot than we can see slight deviation towards the end of the tail.

Detrended Normal Q-Q plot shows a closer look which reveals that the data is not as

15

normally distributed as it looks like. The box plot has three outliers. Median is almost

centrally placed, so, central section of data is almost centrally distributed.

The normality of distribution is hard to indicate without carrying out the test of

normality.

From the table we can spot skewness which indicates abnormal distribution in

both cases. In the first case more (-.900) than in the other (.612).

16

According to Shapiro-Wilk test (which is more reliable), Sig. (p < than

0.05) shows that the data in both average monthly spending and distance travelled is

not normally distributed. The null hypothesis here is that the data that is given has no

difference from that of the normal distribution. The hypothesis test rejects it.

Significance p-value is less than level of significance. In this case .000 and .001 are

smaller than 5% (0.05); therefore, the null hypothesis is rejected and we conclude

that the data is not normally distributed and that Distance travelled has less of

normality deviation than the monthly spending data.

Task 5: Significance of age-gender difference.

a. It is assumed that the data is normally distributed, which suggests a

parametric test in a form of a T-test. T-test is used when there are “two

experimental conditions and different participants used in each condition” (Field,

2009:334).

The null hypothesis (H0) is that there is no significant differences

between the age and gender variables. Alternative hypothesis (H1) would be that

there is a significant differences between same variables.

P-value indicates level of probability at which we accept or reject the

hypothesis (Ibid). P value has to be linked to the direction of hypothesis we are

testing. If probability p < 0.05 (5%) it means that the H0 is rejected and the variances

have significant difference. If p> 0.05 (5%) the H0 is not rejected. After the analysis

of variances the second step involves the analysis of means. If the previous test

doesn’t show significant differences and we do not reject H0 of the previous test then

we should look at the first row of the Independent Sample Test.

B.

17

There was 126 female respondents and 75 male respondents. According

to group statistics males have higher age average (41.53 years old) than women

(37.65).

In order to carry out the analysis of the test and see if variances are

different in different groups we should look at the Levene’s test for the p-value (Sig.).

In this case p= .248 > 0.05 (5%); so, we accept (or rather not reject) the null

hypothesis (H0). Thus, there is no significant difference between the variances of the

groups. Accordingly, we look at the first row (Equal variances assumed) of the T-test

for equality means. Second row is disregarded (Field, 2009:340). P=.000 < 5% (Sig.

2-tailed); so, we reject the null hypothesis for the mean variable. This means that

there is a difference in the mean between the groups, so, we have to look at the

mean differences row.

To conclude, there is significant difference in the mean but not in the

variance. Significance measure shows that there is a difference in average age for

different sexes. The mean difference is negative which means group 2 (males) is the

largest group.

18

The normality test was also carried out to support the T-test and reveal

detailed data on age average across sexes and differences between these

averages. The test below supports the rationale behind choosing the parametric test

over non-parametric test due to normality of distribution.

19

The normality table supports the assumption that the data is normally distributed

(and that the T-test is appropriate). H0 is that data has no significant difference from

normal distribution. P values for both males and females are bigger than 5%.

(p=.511> 0.05 and p=.135>0.05) which ensures the normality of distribution. Charts

below also support the perfect normality of distribution which means that T-test was

used correctly.

21

Task 6

The task investigates the customers feedback connected to the

customer’s level of appreciation. It also compares the level of appreciation across

different genders of consumers. Appropriate test of association would be Chi-square

test ( for two or more samples) that is used when one group is dependent on the

other in order to measure relationship between the attribute variable (investigates

relationship among attribute variables, usually nominal and ordinal variables that can

be grouped or ranked) (Venables,2015,w3 p3).

22

Firstly, because we have two unrelated samples we need to make a

Crosstabs table and indicate the null hypothesis (H0) and an alternative hypothesis

(H1).

H0 - would be that customer appreciation does not depend on gender (gender

influences on customer appreciation level).

H1 - would be that customer appreciation depends on gender.

Count or observed frequency are results from variable groups. Expected

count or frequency is calculated in the table by using row and columns totals.

Expected frequencies in each cell have to be higher than 5 to avoid misleading

results, so there would be no issues in the count (Bryman, Cramer,2009:155).

In the table above standardised risiduals are within +/- 3 gap which

shows reliability of the test and its normality.

Accroding to the table, however, it is hard to tell the customer

appreciation level depending on gender because the number of female reposndents

is higher (126 total) than of male respondents (75 total). So, the dependancy is not

evident without the Chi-square.

23

Looking at the Pearson Chi-square test P= .505 > 0.05 (5%), so, we do

not reject the H0 and conclude that customer appreciation does not depend on

gender. These two groups are independent of one another.

Task 7: Customer behaviour

A. Measuring variables against each other.

24

Correlation indicates the direction and strenght of the reltionship between

variables. It shows interdependence of variables and observes direct, null and

inverse relationships.Each point represents respondents position in relation to the

two varibales being measured (Bryman, cramer, 2009: 212).

In this matrix plot we can see the majority of scattered patterns with

random distribution and some weak form of correlation except one case with an

obvious inverse curvilinear and negative relationship (Bryman, cramer,2009:215).

The diagonal with no data are values against themselves which indicates perfect

correlation (where p=0). If we look at the lower triangle ( which mirros the upper

triangle) we can see potentially strong correlation between Distance Travelled and

Number of Monthly Visits data because the scatter is very close. The rest of data has

25

random patterns and distribution without any direction which indicates weak

correlation or lack of such. To conclude, from SPSS test we can interpret that with

the decrease of distance travelled there is an increase of monthly visits.

B. Before applying the Pearson’ Correlation test we should make sure that

the graph is linear because according to the scattered matrix plot the two

variables (Monthly Visits and Distance Travelled) have a curvilinear relationship

(shape of the relationship is not straight and curves at some point), so it is non-

linear; thus, “it is not appropriate to apply a measure of linear correlation like

Pearson’s r” (Bryman, Crymer, 2009: 214).

In order to use Pearson test the correlation should be linear and the two

selected variables should be normally distributed. Firstly, we need to transform

an independent variable into a logarithmic scale to perform a valid Pearson

correlation test (Ibid) and test the assumption of normal distribution. Otherwise,

the outcome would be insignificant and could show errors.

Testing normality (appendix 2)

According to Schapiro-Wilk test of normality where p < 0.05 shows the

data on both average distance travelled (Sig. = .001) and number of monthly

visits (Sig. = .000) is not normally distributed. This rejects the null hypothesis

26

which states that the data that is given has no difference from that of the

normal distribution.

In this case two variables are not linear or normally distributed. Despite

the adjustment of transformation of logarithmic scale the test might not

provide a meaningful outcome.

Performing linear correlation

In order to measure correlation we have to explore covariance that

indicates how variables vary together. Pearson’s correlation coefficient (P) describes

covariance. If P=1, then it means that there is absolute positive correlation between

the two variables x and y. If P<0 then, there is a direct relationship between the

variables x and y. If P=0 then there is no direct relationship between variables; and if

P<0 then there is an inverse relationship present between the variables. We use a

Pearson test also because it’s a continuous data (Venables, 2015, l4, p2).

As we can observe in the table by looking at P there is an inverse

relationship between the two variables since P<0. They also have a strong negative

relationship because p= - .871 (close to -1) (Bryman, Cramer, 2009:217).

27

Regressions

Regression analyzes the cause-effect relationship between multiple

variables taking into account the accuracy of measures and outliers (Ibid, 229).

Null hypothesis is that the regression is not significant. Alternative hypothesis is that

the regression is significant.

Large values of R square value indicates that the regression model fits

the data, small values indicate poor explanation (Field, 2009:268). R2=.556 which

proves that the regression model fits the data and the regression line fits the scatter

plot.

If we look at ANOVA test there is a relationship between the two

variables because ( sig. dfference) P<5%. Thus, the results in the coefficients table

are valid and the model is appropriate. SSm is large (SSR smaller) which means that

the model is able to exlain variable’s behaviour better than its mean.

28

H0 here is that the constant does not play a significant role within the

model. Same for the number of visits. P value for both is less than 5% and rejects

the hypothesis. So, we accept the model’s prediction. Respectively, β indicates that

with the increase of 1 mile there is a decrease of number of visits (-.574 number of

visits per 1 mile).

C. Multiple regression analysis

Small values of R square value indicates that the regression model gives

a poor explanation of data that is less likely to fit the data. R2=.011 which is small

and proves that the regression line doesn’t fit the data or the scatter plot (Ibid).

29

With the ANOVA test it is evident that the coefficient table is not reliable

because the P value is bigger than 5% which indicates non-reliability of the

regression model.

β cannot be taken into account because each of P values are higher than

5% which accepts the null hypothesis that age, distance and number of visits don’t

play a significant role within the model. Other independent variable should be

introduced in order to predict customer’s monthly expenditure.

30

Appendices

Appendix 1

Appendix 2

32

Works Cited

Bryman A., Cramer D., Quantitative Data Analysis with SPSS 14, 15 & 16: A Guide

for Social Scientists.

Field A., 2009, Discovering Statistics Using SPSS, 3d edition, SAGE Publication Ltd

SAGE publications, 2015, Identifying and Addressing Outliers, Module 5

Available at: http://www.sagepub.com/upm-data/52387_MOD_5.pdf

Accessed on 10/05/2015

Venables, 2015, Quantitative Methods and Data Analysis (i) (MAN00029M) P/G

Module, University of York

http://www.sagepub.com/upm-data/52387_MOD_5.pdf

final assesment qrm

Documents