prac 1 1725 revised

6
H istogram oftraveltimes Minutes Density 0 20 40 60 80 100 120 0.00 0.01 0.02 0.03 0.04 Fig 1 H istogram :Hours w orked 1.5 H istogram :Those w ho do paid w ork 0 0.15 Dalip Bhadare 200704052 Dr Arief Gusnanto MATH1725 Practical 1: Analysis of student data Introduction This practical uses data obtained from maths students enrolled in MATH1725 at The University of Leeds. The student data obtained lists gender, the time it takes to travel to university, and the number of hours worked during the week. Using this data to plot histograms and boxplots, the results can be analysed and commented on. This can also be done by comparing the mean and spread of the data – standard deviation. Fig 1 is a histogram of the time it takes each student to travel to the university. Each response given was rounded to the nearest 5 minutes however when plotting the histogram this would mean that anyone that answered with 0 minutes would be included in the 0-5 mins bin. To avoid this, break points of 2.5, 7.5, 12.5, 17.5 and so on are used so that you can see from the first column in Fig 1 who lives on campus. The histogram shows that most people live within twenty minutes of university and that 2.5-7.5 minutes is the modal group, as this has the biggest area which represents the biggest frequency. The next histogram (Fig 2) shows the number of hours worked by each student but this time the responses were given to the nearest hour and so the break points of each bin are 0.5, 1.5, 2.5 and so on. It is clear to see from this histogram that most students do no work so as another comparison Fig 3 shows a histogram of only those who do paid work. Fig 3 shows that of those that do work almost everyone does around 15 hours or less

Upload: dalip-bhadare

Post on 15-Aug-2015

84 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: prac 1 1725 revised

Histogram of travel times

Minutes

Den

sity

0 20 40 60 80 100 120

0.00

0.01

0.02

0.03

0.04

Histogram: Hours worked

Hours

De

nsi

ty

0 5 10 15 20 25 30 35

0.0

0.5

1.0

1.5

Histogram: Those who do paid work

Hours

Den

sity

0 5 10 15 20 25 30 35

0.00

0.05

0.10

0.15

Dalip Bhadare 200704052 Dr Arief Gusnanto

MATH1725 Practical 1: Analysis of student data

IntroductionThis practical uses data obtained from maths students enrolled in MATH1725 at The University of Leeds. The student data obtained lists gender, the time it takes to travel to university, and the number of hours worked during the week. Using this data to plot histograms and boxplots, the results can be analysed and commented on. This can also be done by comparing the mean and spread of the data – standard deviation.

Fig 1 is a histogram of the time it takes each student to travel to the university. Each response given was rounded to the nearest 5 minutes however when plotting the histogram this would mean that anyone that answered with 0 minutes would be included in the 0-5 mins bin. To avoid this, break points of 2.5, 7.5, 12.5, 17.5 and so on are used so that you can see from the first column in Fig 1 who lives on campus.

The histogram shows that most people live within twenty minutes of university and that 2.5-7.5 minutes is the modal group, as this has the biggest area which represents the biggest frequency.

The next histogram (Fig 2) shows the number of hours worked by each student but this time the responses were given to the nearest hour and so the break points of each bin are 0.5, 1.5, 2.5 and so on. It is clear to see from this histogram that most students do no work so as another comparison Fig 3 shows a histogram of only those who do paid work. Fig 3 shows that of those that do work almost everyone does around 15 hours or less which is to be expected considering they go to university during the day as well. However according to this histogram there are some people who are spending just as much time at work as they are at university!

Fig 1

Fig 2Fig 3

Fig 1

Fig 2Fig 3

Page 2: prac 1 1725 revised

0 5 10 15 20 25 30 35

02

04

06

08

01

00

12

0

Scatter plot of travel time to worked hours

Hours per week

Min

ute

s

Dalip Bhadare 200704052 Dr Arief Gusnanto

The summary statistics for both travel times and work hours are shown by Fig 4. This does make for harder reading because not all responses can be viewed however it does show each quartile and the mean. Considering the third quartile is at 25 minutes and the maximum value is at 120 minutes this support s Fig 1 that there are not many people taking longer than 25 minutes to travel to university. Standard deviation of 15.6 shows that the spread of the data is generally around the mean and is not too large. The summary statistics for work hours strongly supports Fig 2 in that most students do not work at all, hence why all quartiles are zero and the mean is very low. A small standard deviation shows that the data is positioned closely around the mean.

A better observation may be to find the summary statistics of only those who did paid work and compare this with Fig 3. The summary stats for workers show that of those who work the sample mean is 9.941 and this is also close to the median of 8.0. The small standard deviation again shows that the spread of the data is close to the mean and the upper quartile agrees with Fig 3 that of those who work most do less than 13.5 hours.

Fig 5 shows a scatter plot of the travel time to hours worked per week. There is no clear correlation here between the two variables but it does show that there are not many that do work (supported by the large peak in Fig 2) and of those that do very few do more than 15 hours as suggested before by Figs 3 and 4. In addition to this the scatter plot supports Fig 1 in saying that almost everyone lives within 60 minutes of the university.

The boxplot of travel times again supports Fig 1 that most people take 20 minutes or under to travel to university but it also shows that the mean value is at about 15 minutes whereas this cannot be seen by observing the histogram from Fig 1. The boxplot also shows the upper and lower quartiles giving us a visual of how the spread of the data is approximately within 5 minutes of the mean whereas this cannot be seen from the histogram. In addition to this the boxplot has labelled anyone taking longer than 45 minutes as an outlier which supports the view from Fig 1. However the boxplot does

Fig 4

Fig 5

Fig 4

Fig 5

Page 3: prac 1 1725 revised

Dalip Bhadare 200704052 Dr Arief Gusnanto

not show any frequencies, we only get a summarised view of where that data is whereas from Fig 1 we can clearly see from the area of each bin the frequency density of each group. From this we can see that the group with the biggest area is those taking 5 minutes to travel to university and this cannot be seen from Fig 6. Fig 7 is a comparison between each gender of travel times, but there is no clear difference. In fact they are very similar with the females having a slightly larger range.

In Fig 8 compares the difference between male and female travel times. There are no major differences, but the first bin shows more males live on campus and a slightly higher mean and larger frequency density at 40+ minutes for females may suggest slightly more females take longer to travel. Both histograms are very similar and this is backed by the fact that they have the same interquartile range and median value.

Fig 9 suggests that more females work than males do. This can be seen by the frequency density bins of female worked hours are slightly larger than the male ones. This statement is also backed up by the fact that the mean for the female worked hours is almost an hour more than the male mean.

Fig 7Fig 6

Histogram: Travel times - male

Minutes

De

nsi

ty

0 20 40 60 80 100 120

0.0

00

.01

0.0

20

.03

0.0

4

Histogram: Travel times - female

Minutes

De

nsi

ty

0 20 40 60 80 100 120

0.0

00

.01

0.0

20

.03

0.0

4

Fig 8

Histogram: Hours worked - male

Hours

De

nsi

ty

0 5 10 15 20 25 30 35

0.0

0.5

1.0

1.5

Histogram: Hours worked - female

Hours

De

nsi

ty

0 5 10 15 20 25 30 35

0.0

0.5

1.0

1.5

Fig 8Fig 8

Page 4: prac 1 1725 revised

Dalip Bhadare 200704052 Dr Arief Gusnanto

Fig 10 shows histograms of different sample sizes using the data from hours worked. The histogram for sample size n=10 is not close to the normal distribution as it may have been expected to be. However Fig 10 shows that as we increase the sample size from n=10, to n=20 and then n=40 the histogram looks more and more like the normal distribution.

The mean of the histograms is similar all no matter what the sample size is and is close to the mean from the original data from Fig 4. The variance however does get smaller as the sample size increases meaning that the spread of the data is smaller as the sample size

increases.

Conclusion

To conclude there are a number of ways to summarise and analyse the data collected but all miss some factors out such as you can’t see the mean from a histogram, and you cannot see the frequency density or overall distribution from a boxplot. The best way is to plot all the data and obtain the key features from each one. So we can see the overall distribution of travel times from the histograms are skewed to the left as frequency density gets smaller when time increases. The scatter graph allows us to observe any correlation between variables, and in this case there was none between travel time and hours worked. The boxplot gives us the median values and interquartile range, as well as pointing out any outliers. The data can then be re-evaluated to observe any difference between sexes in the variables; in this case there was no major

Histogram of v10

v10

Fre

quen

cy

0 2 4 6 8

020

4060

80

Histogram of v20

v20

Fre

quen

cy

0 1 2 3 4 5 6

010

2030

40

Histogram of v40

v40

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.50

1020

3040

5060

Fig 10Fig 10Fig 10Fig 10

Page 5: prac 1 1725 revised

Dalip Bhadare 200704052 Dr Arief Gusnanto

difference between travel times, but women were observed to work more hours, as can be seen from the histogram Fig 9. Finally by using sample sizes we can evaluate whether the sample mean is close to the normal distribution. Fig 10 suggests that as you increase the sample size the histogram tends more to the normal distribution and this is probably more reliable than a smaller sample size.