team final project - public.csusm.edu€¦ · team final project business 304 business statistics...

Team Final Project

Business 304

Business Statistics

Instructor: Fang Fang

Class: M, W 11:30-12:50

Group:

Cody Britton

Gregory Ortiz

Stephano Bonham

Carlos Fierro

TITLE: “What will it take for you to ride the ‘Sprinter’?”

1

Team Statistical Analysis

Background Demographics-CSUSM

In order to properly conduct research analysis and/or statistical data analysis on a

particular topic, one must have a heightened knowledge of the population in which the

sample data is received. San Diego County is one of the largest growing regions in

California. Within its borders, San Diego County has many well known and renown

college institutions, one of which is California State San Marcos (CSUSM). CSUSM is a

densely populated and diverse university. The school is home to over 8,734 students,

5000 of which are full time students (QuickStats). Of this population, nearly 49.88% are

22 years or younger, 21.90% are ages 23-25, 18.92% are ages 26-36, 9.30% are 36 or

older (QuickStats). Aside from age demographics, CSUSM can also be broken up into

gender. According to Quick Stats for 2007, Cal State San Marcos housed 3,256 male

students and 5,478 female students, which put their gender characteristics at 37.3% and

62.7% respectively (QuickStats). California State University San Marcos remains one of

the largest destination schools. People commute from every direction. They enroll

students from as far Southern San Diego to as far north as San Bernardino and LA

county. Remaining a commuting school, CSUSM is uniquely divers in its ethnicity

breakdown. Roughly 3.3% of CSUSM is African American, 11.2% is Asian/Pacific

Islander, 21.1% is Latino, 1% Native American, 50% white, 13.4% other, (QuickStats).

2

Introduction

Test Hypothesis:

“Those who produce a lower annual income are more inclined to ride the

Sprinter if price for parking rises”

Our survey was crucial to our statistical analysis. The questions we asked our

sample proved to be very useful in supporting our research question. We began by asking

our sample what gender and age they were. To get an accurate sample, the sample

demographics must accurately resemble the population demographics. This is also

important to our research because those who are younger or are female may feel less safe

using public transportation than those who are male or older. Age is also critical in our

research because those who are older often are more financially stable; therefore do not

feel the financial strain of paying for parking as would a younger individual with lower

financial income. Most importantly, having a sample that accurately represents the

population (CSUSM demographics), our research will have a more accurate and definite

meaning than would a sample that didn’t represent or resemble the population in which it

came.

The next questions, which proved to be very useful, and a key part of our research

were ethnicity and level of education. Again, having a sample that best resembles the

population of CSUSM is key to our research. Having a sample that both resembles the

ethnicity of the college and level of education of individuals will have a smaller sampling

error and less similarities in the research we are trying to prove. Some of the general

3

questions we asked our sample were about their job, pay, hours worked, hours at school,

and whether or not they pay for their own school/parking. These questions are all very

important to our research because a person who goes to school more hours may work less

and therefore has less of an ability to pay for school/parking than would a person who

works more and is paid more at their job. These are all key points in our statistical data

and will prove to be very useful in proving our research hypothesis.

Some other general questions were asked of our sample in order to view the

commuting patterns and transportation preferences of the individuals. These questions are

also important because if one does not know of the Sprinter or has never ridden it, then he

or she is less likely to be inclined to think the Sprinter is a good idea, or is less likely to

consider riding it if parking prices went up. Asking the sample if they have other forms of

transportation in which they use allows the researchers (us) to see whether or not these

are reactions or usage in response to the high cost of parking.

Plan

The plan of our research method is to accurately conduct a sample of 100 students

who attend California Sate University San Marcos. This sample will be used to prove our

research hypothesis. The survey used for the sample will be comprised of different

questions, which gather information of the students Qualitative, and General

characteristics. Our research team will each administer 25 surveys to 25 random students

that attend CSUSM. Because we have four researchers the total number of administered

surveys will be 100. This will ensure that the sample will be at random and that no one

4

person is responsible for collecting the data; therefore none of the data will be subject to a

bias. There will be no distribution of the survey using stratified random sampling, in

which subgroups are identified. Our team feels that in order to get the best representation

in the sample to population, the survey must be handed out using Simple Random

Sampling in which every possible sample of a specific size has an equal chance of being

selected.

The theorem, which we hope to replicate and use to base our research on, is the

Central Limit Theorem (CLT). This theorem states that “for simple random samples n

observations taken from a population with mean µ and standard deviation σ, regardless of

the population’s distribution, provided the sample size is sufficiently large, the

distribution of the sample means will be approximately normal with a mean equal to the

population mean and a standard deviation equal to the population standard deviation

divided by the square root of the sample size. The larger the sample size, the better the

approximation to the normal distribution”(Pearson, 647).

Prediction

Our prediction is that there will be a positive linear correlation between annual

income and decision to ride the sprinter if parking prices raises X amount of dollars. We

feel that as the samples income goes up they are less likely to commute by Sprinter if

prices went up X amount. This prediction is based off of the idea that individuals who

have a higher annual income are less likely to feel the financial strain and burden of a

5

rising parking permit and therefore will not be subject to riding the Sprinter (San Diego

County Rail System).

Methodology

Research Procedure and Outcomes:

• Our team used a survey of General and Qualitative questions

We chose a survey because it was the best possible way to gather a variety of

information from our sample without spending a large amount of time. A survey proves

to be timely and cost effective. It is relatively low cost when a survey is administered.

Our team spent roughly $10 on the printing and around 2-4 labor hours administering the

survey. Compared to personal interviews and or data through phone calling, a survey is a

cost effective and timely way of gathering information. Given the time and money our

team had, there is no other way in which we could have gathered data from 100 samples.

• Our team chose a sample of 100 to best represent the population

Our team chose a sample of 100 individuals because it is large enough to obtain many

diverse outcomes. Our sample represents about one percent of the population. This is a

good sample number because the sampling error can be controlled because the larger the

sample the more it represents the population, therefore producing a lower sampling error.

• Our team used Simple Random Sampling Techniques

6

This method was used in order to eliminate any potential bias. We wanted to get the

best sample to reflect the population and we felt this was best through the Random

Sampling Technique in which the samples were administered randomly and there was no

reference or basis off of subgroups and/or clustering.

• Our team split up the surveys administered

By having each team member administer 25 surveys, we were able to avoid any bias

each individual had. For instance, one researcher may feel intimidated by people of an

older age making them less likely to pass out of survey of someone who is in their 30’s or

40’s. By dividing up the surveys this was eliminated. Another example, by dividing up

the surveys, the data was not compiled from a single place. Having the data compiled

from a variety of places insures that there will be a sample that is more representational

of the population.

• When data was compiled, our team entered the gathered data into Excel data

base.

Having the data on an Excel Spreadsheet made it easier for our team to analyze key

factors such as Linear Correlation, Regression, Mean, Variance, Standard Deviation, and

Hypothesis Test.

• Our team then proceeded to conduct a Correlation analysis, in which a

Scatter Plot was used to test if our Hypothesis had any correlation.

7

The Scatter Plot was used in our research to define and analyze the correlation

because it is very easy to see whether or not two variables express any correlation among

one another.

• Dealing with unfinished surveys

Honestly, our team did not have a real issue with unfinished surveys. All 100 were

administered and filled out. With the exception of a couple questions unanswered, we

were still able to conduct a proper analysis. The unfinished answers actually proved to be

towards our benefit because in the population data we used there was portions that were

or had no-response sections. The no-response answers of our sample proved to be useful

in replicating the statistics of our population.

Result

This chart shows Population Ethnicity. The Pie Chart is broken down into subcategories that measure percentage of population. These subcategories include: Caucasian, African American, Asian/Pacific Islander, Latino, Native American, Other, and Non Resident or No Response

8

This chart shows Sample Ethnicity. The Pie Chart is broken down into subcategories that measure percentage of our generated sample. These subcategories include: Caucasian, African American, Asian/Pacific Islander, Latino, Native American, Other, and Non Resident or No Response

This chart shows the percentage of gender distribution within the sample. The subcategories are broken up into males and females. The percentage is out of a sample of 100.

This chart shows the percentage of gender distribution within the population. The subcategories are broken up into males and females. The percentage is out of a population of 8,734 students.

9

As you can see our through our generated sample and fixed population, we were

able to generate a sample gender distribution that is proportionate to population gender

distribution. Both the gender distribution and ethnicity distribution was both

representative of the entire population. Now that we have a sample that is representative

This chart shows the percentage of age distribution within the sample. The subcategories are broken up into 22 or younger, 23‐25, 26‐35, and 36 or older. The percentage is out of a sample of 100.

This chart shows Population Age Distribution. The Pie Chart is broken down into subcategories that measure percentage of our population. These subcategories include: 22 or younger, 23‐25, 26‐35, and 36 or older.

10

of the population, we can now proceed to analyze the hypothesis. The age of our sample

proved also to be representative of our population, further adding less error to our sample

analysis and research.

Sample Mean and Variance

Numerical Analysis:

SAMPLE Mean Variance Age 23 10 Units at CSUSM 11.43 10.85363636 Annual Income 27058 160073773 Price increase on Parking to Influence Decision 141.06 8850.339 Days at School 2.16 0.499 Hours Work 36.52 48.717 Commute Miles 12.82 103.68

This chart shows the breakdown of our sample numerical data. The questions that

had either a numerical or ordinal set of data were compiled in an Excel spreadsheet. Both

the mean and variance were determined based on the compiled data. The average age

received in our sample data was 23 years old. The sample variance is found by squaring

the distance from the mean divided by sample size minus one. In our case a sample

variance of age distribution read 10. The average units taken at CSUSM by these

individuals were 11.43 with a variance of 10.85. Annual income was around $27,000 and

the average price increase on parking at CSUSM that most stated would influence their

11

decision to ride the sprinter was 141 dollars. Most of these students spent an average of

2-2.5 days at school and worked 36 hours at their job. On average most of our sample

generally commuted on an average of 12-13 miles per trip to CSUSM.

Research Variables

This shows our data to have a positive linear correlation among annual income

and dollar amount increase on parking to influence traveling by Sprinter. As Income

increases the dollar amount it takes to change a person’s decision to commute by Sprinter

increases. Therefore, the higher the income, the less likely an individual is to ride the

sprinter if parking prices rise. To give you a better example, a person who has annual

income of 20,000 may decide to ride the Sprinter if parking prices for CSUSM rise

another 100 dollars, whereas an individual who makes 60,000 annually may decide to

take the Sprinter if the price of parking rises 300 dollars. As you can see on the average

12

an individual who makes less is more inclined to commute by the new Sprinter if the

amount of parking prices rises at a lesser amount. Our team feels this is because those

who make less feel the pressure and financial strain of a fluctuating parking price more

than a person who makes substantially more money annually and therefore is more

inclined to withdraw from the financial strain (substitute: commuting by sprinter).

Correlation coefficient=.803154

Covariance= 946398.5

After creating a scatter plot to show the correlation between two variables, our

team was curious to find the correlation coefficient of variables X and Y. The correlation

coefficient of annual income and dollar amount parking increases to influence

commuting by sprinter was relatively high. Our team was relatively surprised. The

correlation coefficient when comparing the X and Y variables was .803154. The

correlation coefficient measures the linear strength of the relationship between two

Column 1 Column 2

Column 1 1.58E+08

Column 2 946398.5 8761.836

Column 1 Column

2 Column 1 1 Column 2 0.803154 1

13

variables. The correlation ranges from -1.0 to +1.0. A correlation of +1.0 indicates a

perfect linear relationship, whereas a correlation of 0.0 indicates no linear relationship.

Having a .803154 linear correlation coefficient proves that our variables show a strong

positive linear correlation.

The correlation coefficient was conducted using the equation σσ

σρyx

xy= , where

ρ is the Correlation Coefficient, σ xyis the Covariance between variables X and Y,

σ xis the standard deviation of variable X, and lastly, σ y is the standard deviation of

variable Y.

Our Equation for Correlation Coefficient: =ρ 803154.1178353946398.5

=

The covariance was conducted using the equation σ xy , “The covariance can be

positive or negative depending on whether the two variables move together in the same or

opposite directions. If the covariance is zero or close to zero, this implies that the two

variables do not move closely together” (Parson, 164). In our case our covariance number

was very high, indicating that the variables move closely together.

Regression Analysis

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.80315394 R Square 0.64505625 Adjusted R Square 0.64143437 Standard Error 7576.07767 Observations 100 ANOVA

df SS MS F Significance

F

Regression 1 10222402220 1.02E+10 178.1001 9.03398E-

24 Residual 98 5624901380 57396953 Total 99 15847303600

Coefficients Standard

Error t Stat P-value Lower 95%

Intercept 11821.585 1370.19735 8.627651 1.14E-

13 9102.473059

X Variable 1 108.013717 8.093695259 13.34541 9.03E-

24 91.95204289

Upper 95% 14540.69701 124.0753917

In the Spreadsheet above, one can view the regression analysis for our sample data. In

a Regression analysis we are trying to prove and represent the relationship between

values X and Y, which in this case are annual income and dollar amount price of parking

must raise in order for someone to ride Sprinter. One item or concept the regression

analysis is useful for is determining the estimated regression equation. The equation is

^

y = ob + 1b x , where ob is the unbiased estimate of the regression intercept and

where 1b x is the unbiased estimate of the regression slope. In our case the estimated

equation of our regression line between variables X and Y is

15

^

y =11821.585 +108.013717(X) . The Multiple R section in the Regression analysis

represents the correlation coefficient. This value is indicated at .803. This signifies

that the variables X and Y have a positive high correlation. The R Square value of the

Regression analysis represents the Multiple R squared. This value is .64505. This

value is received by taking the value of 2r =SSRSST

. This value is the portion of the

total variation in the dependent variable that is explained by its relationship with

the independent variable. The value means that 64.5% of the variation in dollar

increase in price of parking to influence customers decision to commute by Sprinter

can be explained by the linear relationship between annual income and dollar

increase in price of parking to influence customers decision to commute by Sprinter.

The chart section indicated by a D.F represents the Degree of Freedom. In our

Sample case the degree of freedom is 98. Because we have a sample of 100 and

degree of freedom is n‐2, the degree of freedom is 100‐2=98. The section indicated

SS indicate Sum of Squares Regression, Sum of Squares Total, and Sum of Squares

Residual. Calculating the Sum of Squares Regression and Sum of Squares Total gives

us values of 1022240220 and 5624901380 respectively. SSR represents the portion

of total sum of squares that is explained by the regression line and the SSE

represents the amount of total sum of squares in the dependent variable not

explained by the least regression line. Adding these two numbers up will give us the

SST or Sum of Squares Total.

16

The title of Standard Error measures the dispersion of the dependent variable

about its mean value at each value of the dependent variable in the original units of

the dependent variable. The value given for the standard error of the regression

slope is 8.09. Having a relatively low number for this value indicates that the slope

values are less variable. The Lower and Upper 95% values indicate the range in

which both X and Y variables fall with the particular confidence level. For instance

with 95% confidence annual income will fall between values 9102‐14540 and price

of parking rising that influences decision to ride Sprinter will fall between values

91‐124.

Hypothesis Testing (Correlation, Significance of Regression Slope)

Hypothesis:

( )

05.0:

0:

=

≠

=

α

ρ

ρ

HH

A

O ionNoCorrelat

D.F= n‐2= (100‐2)=98

Reject Region 025.02/ =α


‐t 0.025=‐1.9840 t 0.025=1.9840

17

Decision Rule

Because 13.34>1.9840, Reject H O

Calculated T‐Value is

9835494375.

80315394.

21 2

=

−−

=

n

rtr

=13.34

Based on the sample evidence, we conclude there is a significance positive linear

relationship between annual income and dollar amount price of parking increases to

influence riding Sprinter.

Significance Test of Regression Slope

Hypothesis:

If t>tα /2=1.9840, reject H O If t< ‐ tα /2= ‐1.9840, Reject H O Otherwise, do not Reject H O

18

05.

0:

0:

1

1

=

≠

=

αββ

HH

A

O

D.F= n‐2= (100‐2)=98

The calculated t is:

35.1309.8

001.108

1

11 =−

=−

=s

bb

tβ

Because 13.35>1.9840, we should reject the null hypothesis and conclude the true slope

is not zero. Thus, the simple linear relationship that utilizes the independent variable,

annual income, is useful in explaining the variation in the dependent variable, dollar

amount price for parking must increase to influence individual to ride the Sprinter.

Estimation of Population Mean (Annual Income)

With 95% confidence

−

x ± s2

t sn= 27058 ±1.9612652.03

10= (24,578.203− 29,537.80)

t 0.025=1.9840 ‐t 0.025=‐1.9840



If t>tα /2=1.9840, reject H O If t< ‐ tα /2= ‐1.9840, Reject H O Otherwise, do not Reject H O

19

With 90% confidence

−

x ± s2

t sn= 27058±1.64512652.03

10= (24,976.74 − 29,138.26)

Estimation of Population Mean (Dollar Value Price of Parking must increase to

influence decision to take Sprinter)

With 95% confidence

−

x ± s2

t sn=141.06 ±1.96 94.076

10= (122.62−159.50)

With 90% confidence

−

x ± s2

t sn=141.06±1.645 94.076

10= (125.58−156.54)

The values indicated above signify the mean range if values from samples are

applied with both 90% and 95% confidence. This range let’s know the range at

which the mean would fall based on the statistics given from our compiled sample

data.

Conclusion

As you can see, our research team successfully conducted a statistical analysis of

a population by implying statistical data of a sample of 100 individuals an applying it to

the population in which it came. We were able to successfully retrieve both mean and

variance of the sample variables. These statistical measures proved very useful when

20

applying two of the variables together. Our research team found the best research topic to

be whether or not we could prove there rested a positive linear correlation among

individuals annual income and dollar amount the price for parking must increase in order

to influence decision to ride the Sprinter. Our team set out to conduct both a correlation

analysis and a regression analysis. We found that there was in fact a positive linear

correlation between these two variables. The regression analysis also proved to be very

useful in proving that these two variables were strong in their correlation coefficient.

After a correlation analysis and Regression analysis was conducted, our team was able to

successfully estimate the mean of the population within both 90% and 95% certainty.

These values give us a range in which the population would fall under these variables.

How might this be useful?

One might ask their self, ”How might this be useful?”. The truth is that our

research hypothesis has a great deal of significance. Some individuals who might find our

research useful is the Parking and Transit center at Cal State San Marcos. By reviewing

our data, they would be able to tell the average maximum amount most CSUSM student

would pay for their parking pass before using other transportation services. They may

also use our annual revenue average to base their fiancés around. Most importantly, the

parking and transit offices at CSUSM would be able to maximize their revenues by

charging the largest amount possible before actually losing customers. This would help

create price stability. This may also prove beneficial to the new parking structure that is

set to be built at CSUSM within the next few years. Conducting a Sample analysis of the

21

school helps researches attain an idea of what people can afford and what they want to

afford in our case.

Works Cited

Faculty. “QUICKSTAT: Cougar Review” 2006-2007, Accessed April 26, 2008.

<www.csusm.edu/ipa>

Groebner. David F. “Business Statistics: Fourth Edition” 2007 Mcgraw Hill

New York.

Research. “CalStateSanMarcos. Student Statistics” April 28, 2008

<http://sanmarcoscalifornia.stateuniversity.com/>

Sample Research Data Analysis‐ Sample Survey 100 individuals conducted

April 10‐27, 2008

team final project - public.csusm.edu€¦ · team final project business 304 business statistics...

Documents