final assesment qrm
TRANSCRIPT
1
STUDENT EXAMINATION NUMBER Y1401956
MODULE NO: MAN00029M MODULE TITLE: Quantitative Methods & Data Analysis Module Tutor: Dr. Harry Venables Essay Title: Final assessment
Word Count: ___2688_________
2
Task 1
In order to start performing any manipulation with data the Data View and
Variable View in SPSS should comply with the rules so that SPSS output would
compute properly.
Name Label Values Type New
type
Rationale
Obs Id number None Nominal Scale Variable items that are not
measurable but are
numeric like ID numbers
and phone numbers (can
also be Nominal).
Gender Gender 0=Female,
1=Male
Nominal Nominal Variable items are all
numbers that represent
categories and have no
order to them, e.g. 1-Blue
Car, 2-Cat, 0-Male, 11-
Female, etc.
Age Age(years) None Nominal Scale Variable items are all
measurable numbers e.g.
height in cm.
Status Marital Status 1=Single,
2=Married,3=
Divorced,
4=Widowed
Nominal Nominal Variable items are all
numbers that represent
categories and have no
order to them, e.g. 1-Blue
3
Car, 2-Cat, 0-Male, 11-
Female, etc.
Occupa
tion
Occupation 1=Student,2
=Employed,
3=Self-
employed,
4=Retired
Nominal Nominal Variable items are all
numbers that represent
categories and have no
order to them, e.g. 1-Blue
Car, 2-Cat, 0-Male, 11-
Female, etc.
AvgMon
thlySpe
nding
AverageMonthly
Spend (GBP)
None Nominal Scale
(custom
currency
)
Variable items are all
measurable monetary
values.
Monthly
Visits
Number of Monthly
visits
None Nominal Scale Variable items are all
measurable numbers e.g.
height in cm.
Distanc
e
Distance Travelled
(miles)
None Nominal Scale Variable items are all
measurable numbers e.g.
height in cm.
Car Vehicle Ownership 0=No,1=Yes Nominal Nominal Variable items are all
numbers that represent
categories and have no
order to them, e.g. 1-Blue
Car, 2-Cat, 0-Male, 11-
Female, etc.
Appreci Customer 1=Very Low, Nominal Ordinal Variable items are
4
ation Appreciation 2=Low,
3=Indifferent,
4=High,
5=Very High.
numbers that represent
some form of ranking or
order, e.g. Likert scale
values 1-5, 1-7.
Task 2
A. The bar chart indicates the target consumers of the FreshCo retail centre and
consumer’s two characteristics are analysed: status and occupation. So, cross-
tabulation (Table 2.1) is used in order to analyse two variables and produce an
appropriate bar chart.
Table 2.1
Table 2.2
Table 2.2 shows that there are 201 repondents and 2 modes.
5
From this chart a conclusion could be drawn that the majority of FreshCo’s
cosnumers are employed (116 out of 201 repsondents) and married ( 89 out of 201
reposndents) . Thus, Bimodal attribute is married and employed because of occuring
most frequently ( appendix 2)( Field,2009:21).
B. The target-consumer analysis contains previous charachteristics such as status
and occupation and the differeneces between them but in regard to car
owership.
6
7
Divorced that are either employed or self-employed and widowed people that
are retired are groups that don’t own a car. Divorced people,especially self-
employed sub-group is the largests group that doesn’t own a car. Employed and
married on the contrary is the largest group to own a car. Single students are the
second largest and single employed is the smallest group to own a car.
C. Considering the fact the majority of FreshCo’s cutomers are married employed
car owners potential issues such as enough numbers of parking slots could
arise. Also, the retial’s convenient opening hours could make a significant
difference for working idividuals. Marrital status can also indicate the presence of
children and need for children facilities such as playgrounds and food courts on
the site.
8
Task 3
A. Consumer spending in regard to consumer charachteristics.
Extreme values (outliers) occur for student males and self - employed females.
Outliers are the extreme values that deviate from the rest of the responses. In this
case three respondents have outstanding answers on the average monthly
spending. The numbers over outliners indicate the row - number of the respondent
(SAGE, 2015). In the self-employed female group one person has higher monthly
spendings (522.59 $) than the rest. In the male student group two respondents
spend more than the rest of the group (219.84 $ and 225.84 $).
Medians are dispersed in terms of occupation. Whereas, in terms of gender
medians are not significantly different (they overlap). Both employed and self-
9
employed males and females spend more money than other groups( the difference
is significant because their confidence intervals don’t overlap). There is also some
difference between the employed and self-employed group because boxes of these
groups almost don’t overlap (with employed and self-employed women there is less
difference in spending because the boxes slightly overlap). Lower median position
shows lower spending for the self-employed group than employed group that has
higher median position. The interquartile ranges are of slightly different length and
have different positions which indicates different dispersion of data between the two
groups (self-employed group is smaller) (Field, 2009:100-2).
There are no significant differences between expenditure of students and
retired individuals because their boxes overlap. However, the spread of student
interquartile range differs across gender. The male group is smaller and less likely to
spend more money than women (Ibid).
B. Average monthly expenditure according to level of appreciation.
10
The average expenditure for customers with ‘very high’ customer appreciation
differs significantly from the rest because the median and the box (incl. confidence
intervals) are far away from the rest and don’t overlap. Interquartile range of the ‘very
low’ and ‘low’ appreciation is very different from all the rest which indicates a wider
dispersion of data (Field, 2009:101-2).
Some box plots show the skewness of data and lack of symmetry which
needs to be observed more closely through a statistical test.
11
12
Descriptive test shows the means as well as medians of ‘low’, ‘very low’,
‘indifferent’ and ‘high’ appreciation are not significantly different from each other.
However, the mean (as well as median- 199.9549) of ‘very high’ appreciation is
significantly lower (197.8941) and differs from the rest.
Standard deviation from the mean also differs for ‘very low’ and ‘low’ which
numerically shows a wider variety of indicators deviating from the mean. The
interquartile range of these both groups also significantly differs from the rest.
We can also observe slight positive skewness for ‘high’ and ‘indifferent’ groups and
slight negative skewness for the rest that indicates slight asymmetry of the data
distribution (Field, 2009:19).
Customers that tend to spend the least amount have the highest customer
appreciation. Customers that spend the most are indifferent or have low or very low
appreciation.
Task 4: Distribution of customer’s monthly expenditure.
13
According to the histogram the data for average monthly spending is not
normally distributed. We observe a flat distribution with a negative skew. The bars
are out from the normal curve and have an obvious split in two. In a Normal Q-Q Plot
we see some deviation towards the tail. Normal QQ-Plot is a chart of the observed
values plotted against normalized expected values. The data values are pretty far
away from the line and even cross it which shows that distribution is not normal. The
spending data only around 180.000 spending value and 5300.00 spending value is
normally distributed. Generally, values don’t follow the normal distribution. Detrended
14
plot is another view of the first that detrends the line. It shows even more closely the
abnormality of distribution.
The box plot doesn’t show any outliers. Central section of the data is not
centrally distributed because the median is not centrally placed.
Distribution of distance travelled
The Distance travelled data seems more normally distributed. However, if we
see the Normal Q-Q Plot than we can see slight deviation towards the end of the tail.
Detrended Normal Q-Q plot shows a closer look which reveals that the data is not as
15
normally distributed as it looks like. The box plot has three outliers. Median is almost
centrally placed, so, central section of data is almost centrally distributed.
The normality of distribution is hard to indicate without carrying out the test of
normality.
From the table we can spot skewness which indicates abnormal distribution in
both cases. In the first case more (-.900) than in the other (.612).
16
According to Shapiro-Wilk test (which is more reliable), Sig. (p < than
0.05) shows that the data in both average monthly spending and distance travelled is
not normally distributed. The null hypothesis here is that the data that is given has no
difference from that of the normal distribution. The hypothesis test rejects it.
Significance p-value is less than level of significance. In this case .000 and .001 are
smaller than 5% (0.05); therefore, the null hypothesis is rejected and we conclude
that the data is not normally distributed and that Distance travelled has less of
normality deviation than the monthly spending data.
Task 5: Significance of age-gender difference.
a. It is assumed that the data is normally distributed, which suggests a
parametric test in a form of a T-test. T-test is used when there are “two
experimental conditions and different participants used in each condition” (Field,
2009:334).
The null hypothesis (H0) is that there is no significant differences
between the age and gender variables. Alternative hypothesis (H1) would be that
there is a significant differences between same variables.
P-value indicates level of probability at which we accept or reject the
hypothesis (Ibid). P value has to be linked to the direction of hypothesis we are
testing. If probability p < 0.05 (5%) it means that the H0 is rejected and the variances
have significant difference. If p> 0.05 (5%) the H0 is not rejected. After the analysis
of variances the second step involves the analysis of means. If the previous test
doesn’t show significant differences and we do not reject H0 of the previous test then
we should look at the first row of the Independent Sample Test.
B.
17
There was 126 female respondents and 75 male respondents. According
to group statistics males have higher age average (41.53 years old) than women
(37.65).
In order to carry out the analysis of the test and see if variances are
different in different groups we should look at the Levene’s test for the p-value (Sig.).
In this case p= .248 > 0.05 (5%); so, we accept (or rather not reject) the null
hypothesis (H0). Thus, there is no significant difference between the variances of the
groups. Accordingly, we look at the first row (Equal variances assumed) of the T-test
for equality means. Second row is disregarded (Field, 2009:340). P=.000 < 5% (Sig.
2-tailed); so, we reject the null hypothesis for the mean variable. This means that
there is a difference in the mean between the groups, so, we have to look at the
mean differences row.
To conclude, there is significant difference in the mean but not in the
variance. Significance measure shows that there is a difference in average age for
different sexes. The mean difference is negative which means group 2 (males) is the
largest group.
18
The normality test was also carried out to support the T-test and reveal
detailed data on age average across sexes and differences between these
averages. The test below supports the rationale behind choosing the parametric test
over non-parametric test due to normality of distribution.
19
The normality table supports the assumption that the data is normally distributed
(and that the T-test is appropriate). H0 is that data has no significant difference from
normal distribution. P values for both males and females are bigger than 5%.
(p=.511> 0.05 and p=.135>0.05) which ensures the normality of distribution. Charts
below also support the perfect normality of distribution which means that T-test was
used correctly.
20
21
Task 6
The task investigates the customers feedback connected to the
customer’s level of appreciation. It also compares the level of appreciation across
different genders of consumers. Appropriate test of association would be Chi-square
test ( for two or more samples) that is used when one group is dependent on the
other in order to measure relationship between the attribute variable (investigates
relationship among attribute variables, usually nominal and ordinal variables that can
be grouped or ranked) (Venables,2015,w3 p3).
22
Firstly, because we have two unrelated samples we need to make a
Crosstabs table and indicate the null hypothesis (H0) and an alternative hypothesis
(H1).
H0 - would be that customer appreciation does not depend on gender (gender
influences on customer appreciation level).
H1 - would be that customer appreciation depends on gender.
Count or observed frequency are results from variable groups. Expected
count or frequency is calculated in the table by using row and columns totals.
Expected frequencies in each cell have to be higher than 5 to avoid misleading
results, so there would be no issues in the count (Bryman, Cramer,2009:155).
In the table above standardised risiduals are within +/- 3 gap which
shows reliability of the test and its normality.
Accroding to the table, however, it is hard to tell the customer
appreciation level depending on gender because the number of female reposndents
is higher (126 total) than of male respondents (75 total). So, the dependancy is not
evident without the Chi-square.
23
Looking at the Pearson Chi-square test P= .505 > 0.05 (5%), so, we do
not reject the H0 and conclude that customer appreciation does not depend on
gender. These two groups are independent of one another.
Task 7: Customer behaviour
A. Measuring variables against each other.
24
Correlation indicates the direction and strenght of the reltionship between
variables. It shows interdependence of variables and observes direct, null and
inverse relationships.Each point represents respondents position in relation to the
two varibales being measured (Bryman, cramer, 2009: 212).
In this matrix plot we can see the majority of scattered patterns with
random distribution and some weak form of correlation except one case with an
obvious inverse curvilinear and negative relationship (Bryman, cramer,2009:215).
The diagonal with no data are values against themselves which indicates perfect
correlation (where p=0). If we look at the lower triangle ( which mirros the upper
triangle) we can see potentially strong correlation between Distance Travelled and
Number of Monthly Visits data because the scatter is very close. The rest of data has
25
random patterns and distribution without any direction which indicates weak
correlation or lack of such. To conclude, from SPSS test we can interpret that with
the decrease of distance travelled there is an increase of monthly visits.
B. Before applying the Pearson’ Correlation test we should make sure that
the graph is linear because according to the scattered matrix plot the two
variables (Monthly Visits and Distance Travelled) have a curvilinear relationship
(shape of the relationship is not straight and curves at some point), so it is non-
linear; thus, “it is not appropriate to apply a measure of linear correlation like
Pearson’s r” (Bryman, Crymer, 2009: 214).
In order to use Pearson test the correlation should be linear and the two
selected variables should be normally distributed. Firstly, we need to transform
an independent variable into a logarithmic scale to perform a valid Pearson
correlation test (Ibid) and test the assumption of normal distribution. Otherwise,
the outcome would be insignificant and could show errors.
Testing normality (appendix 2)
According to Schapiro-Wilk test of normality where p < 0.05 shows the
data on both average distance travelled (Sig. = .001) and number of monthly
visits (Sig. = .000) is not normally distributed. This rejects the null hypothesis
26
which states that the data that is given has no difference from that of the
normal distribution.
In this case two variables are not linear or normally distributed. Despite
the adjustment of transformation of logarithmic scale the test might not
provide a meaningful outcome.
Performing linear correlation
In order to measure correlation we have to explore covariance that
indicates how variables vary together. Pearson’s correlation coefficient (P) describes
covariance. If P=1, then it means that there is absolute positive correlation between
the two variables x and y. If P<0 then, there is a direct relationship between the
variables x and y. If P=0 then there is no direct relationship between variables; and if
P<0 then there is an inverse relationship present between the variables. We use a
Pearson test also because it’s a continuous data (Venables, 2015, l4, p2).
As we can observe in the table by looking at P there is an inverse
relationship between the two variables since P<0. They also have a strong negative
relationship because p= - .871 (close to -1) (Bryman, Cramer, 2009:217).
27
Regressions
Regression analyzes the cause-effect relationship between multiple
variables taking into account the accuracy of measures and outliers (Ibid, 229).
Null hypothesis is that the regression is not significant. Alternative hypothesis is that
the regression is significant.
Large values of R square value indicates that the regression model fits
the data, small values indicate poor explanation (Field, 2009:268). R2=.556 which
proves that the regression model fits the data and the regression line fits the scatter
plot.
If we look at ANOVA test there is a relationship between the two
variables because ( sig. dfference) P<5%. Thus, the results in the coefficients table
are valid and the model is appropriate. SSm is large (SSR smaller) which means that
the model is able to exlain variable’s behaviour better than its mean.
28
H0 here is that the constant does not play a significant role within the
model. Same for the number of visits. P value for both is less than 5% and rejects
the hypothesis. So, we accept the model’s prediction. Respectively, β indicates that
with the increase of 1 mile there is a decrease of number of visits (-.574 number of
visits per 1 mile).
C. Multiple regression analysis
Small values of R square value indicates that the regression model gives
a poor explanation of data that is less likely to fit the data. R2=.011 which is small
and proves that the regression line doesn’t fit the data or the scatter plot (Ibid).
29
With the ANOVA test it is evident that the coefficient table is not reliable
because the P value is bigger than 5% which indicates non-reliability of the
regression model.
β cannot be taken into account because each of P values are higher than
5% which accepts the null hypothesis that age, distance and number of visits don’t
play a significant role within the model. Other independent variable should be
introduced in order to predict customer’s monthly expenditure.
30
Appendices
Appendix 1
Appendix 2
31
32
Works Cited
Bryman A., Cramer D., Quantitative Data Analysis with SPSS 14, 15 & 16: A Guide
for Social Scientists.
Field A., 2009, Discovering Statistics Using SPSS, 3d edition, SAGE Publication Ltd
SAGE publications, 2015, Identifying and Addressing Outliers, Module 5
Available at: http://www.sagepub.com/upm-data/52387_MOD_5.pdf
Accessed on 10/05/2015
Venables, 2015, Quantitative Methods and Data Analysis (i) (MAN00029M) P/G
Module, University of York