team final project - public.csusm.edu€¦ · team final project business 304 business statistics...
TRANSCRIPT
Team Final Project
Business 304
Business Statistics
Instructor: Fang Fang
Class: M, W 11:30-12:50
Group:
Cody Britton
Gregory Ortiz
Stephano Bonham
Carlos Fierro
TITLE: “What will it take for you to ride the ‘Sprinter’?”
1
Team Statistical Analysis
Background Demographics-CSUSM
In order to properly conduct research analysis and/or statistical data analysis on a
particular topic, one must have a heightened knowledge of the population in which the
sample data is received. San Diego County is one of the largest growing regions in
California. Within its borders, San Diego County has many well known and renown
college institutions, one of which is California State San Marcos (CSUSM). CSUSM is a
densely populated and diverse university. The school is home to over 8,734 students,
5000 of which are full time students (QuickStats). Of this population, nearly 49.88% are
22 years or younger, 21.90% are ages 23-25, 18.92% are ages 26-36, 9.30% are 36 or
older (QuickStats). Aside from age demographics, CSUSM can also be broken up into
gender. According to Quick Stats for 2007, Cal State San Marcos housed 3,256 male
students and 5,478 female students, which put their gender characteristics at 37.3% and
62.7% respectively (QuickStats). California State University San Marcos remains one of
the largest destination schools. People commute from every direction. They enroll
students from as far Southern San Diego to as far north as San Bernardino and LA
county. Remaining a commuting school, CSUSM is uniquely divers in its ethnicity
breakdown. Roughly 3.3% of CSUSM is African American, 11.2% is Asian/Pacific
Islander, 21.1% is Latino, 1% Native American, 50% white, 13.4% other, (QuickStats).
2
Introduction
Test Hypothesis:
“Those who produce a lower annual income are more inclined to ride the
Sprinter if price for parking rises”
Our survey was crucial to our statistical analysis. The questions we asked our
sample proved to be very useful in supporting our research question. We began by asking
our sample what gender and age they were. To get an accurate sample, the sample
demographics must accurately resemble the population demographics. This is also
important to our research because those who are younger or are female may feel less safe
using public transportation than those who are male or older. Age is also critical in our
research because those who are older often are more financially stable; therefore do not
feel the financial strain of paying for parking as would a younger individual with lower
financial income. Most importantly, having a sample that accurately represents the
population (CSUSM demographics), our research will have a more accurate and definite
meaning than would a sample that didn’t represent or resemble the population in which it
came.
The next questions, which proved to be very useful, and a key part of our research
were ethnicity and level of education. Again, having a sample that best resembles the
population of CSUSM is key to our research. Having a sample that both resembles the
ethnicity of the college and level of education of individuals will have a smaller sampling
error and less similarities in the research we are trying to prove. Some of the general
3
questions we asked our sample were about their job, pay, hours worked, hours at school,
and whether or not they pay for their own school/parking. These questions are all very
important to our research because a person who goes to school more hours may work less
and therefore has less of an ability to pay for school/parking than would a person who
works more and is paid more at their job. These are all key points in our statistical data
and will prove to be very useful in proving our research hypothesis.
Some other general questions were asked of our sample in order to view the
commuting patterns and transportation preferences of the individuals. These questions are
also important because if one does not know of the Sprinter or has never ridden it, then he
or she is less likely to be inclined to think the Sprinter is a good idea, or is less likely to
consider riding it if parking prices went up. Asking the sample if they have other forms of
transportation in which they use allows the researchers (us) to see whether or not these
are reactions or usage in response to the high cost of parking.
Plan
The plan of our research method is to accurately conduct a sample of 100 students
who attend California Sate University San Marcos. This sample will be used to prove our
research hypothesis. The survey used for the sample will be comprised of different
questions, which gather information of the students Qualitative, and General
characteristics. Our research team will each administer 25 surveys to 25 random students
that attend CSUSM. Because we have four researchers the total number of administered
surveys will be 100. This will ensure that the sample will be at random and that no one
4
person is responsible for collecting the data; therefore none of the data will be subject to a
bias. There will be no distribution of the survey using stratified random sampling, in
which subgroups are identified. Our team feels that in order to get the best representation
in the sample to population, the survey must be handed out using Simple Random
Sampling in which every possible sample of a specific size has an equal chance of being
selected.
The theorem, which we hope to replicate and use to base our research on, is the
Central Limit Theorem (CLT). This theorem states that “for simple random samples n
observations taken from a population with mean µ and standard deviation σ, regardless of
the population’s distribution, provided the sample size is sufficiently large, the
distribution of the sample means will be approximately normal with a mean equal to the
population mean and a standard deviation equal to the population standard deviation
divided by the square root of the sample size. The larger the sample size, the better the
approximation to the normal distribution”(Pearson, 647).
Prediction
Our prediction is that there will be a positive linear correlation between annual
income and decision to ride the sprinter if parking prices raises X amount of dollars. We
feel that as the samples income goes up they are less likely to commute by Sprinter if
prices went up X amount. This prediction is based off of the idea that individuals who
have a higher annual income are less likely to feel the financial strain and burden of a
5
rising parking permit and therefore will not be subject to riding the Sprinter (San Diego
County Rail System).
Methodology
Research Procedure and Outcomes:
• Our team used a survey of General and Qualitative questions
We chose a survey because it was the best possible way to gather a variety of
information from our sample without spending a large amount of time. A survey proves
to be timely and cost effective. It is relatively low cost when a survey is administered.
Our team spent roughly $10 on the printing and around 2-4 labor hours administering the
survey. Compared to personal interviews and or data through phone calling, a survey is a
cost effective and timely way of gathering information. Given the time and money our
team had, there is no other way in which we could have gathered data from 100 samples.
• Our team chose a sample of 100 to best represent the population
Our team chose a sample of 100 individuals because it is large enough to obtain many
diverse outcomes. Our sample represents about one percent of the population. This is a
good sample number because the sampling error can be controlled because the larger the
sample the more it represents the population, therefore producing a lower sampling error.
• Our team used Simple Random Sampling Techniques
6
This method was used in order to eliminate any potential bias. We wanted to get the
best sample to reflect the population and we felt this was best through the Random
Sampling Technique in which the samples were administered randomly and there was no
reference or basis off of subgroups and/or clustering.
• Our team split up the surveys administered
By having each team member administer 25 surveys, we were able to avoid any bias
each individual had. For instance, one researcher may feel intimidated by people of an
older age making them less likely to pass out of survey of someone who is in their 30’s or
40’s. By dividing up the surveys this was eliminated. Another example, by dividing up
the surveys, the data was not compiled from a single place. Having the data compiled
from a variety of places insures that there will be a sample that is more representational
of the population.
• When data was compiled, our team entered the gathered data into Excel data
base.
Having the data on an Excel Spreadsheet made it easier for our team to analyze key
factors such as Linear Correlation, Regression, Mean, Variance, Standard Deviation, and
Hypothesis Test.
• Our team then proceeded to conduct a Correlation analysis, in which a
Scatter Plot was used to test if our Hypothesis had any correlation.
7
The Scatter Plot was used in our research to define and analyze the correlation
because it is very easy to see whether or not two variables express any correlation among
one another.
• Dealing with unfinished surveys
Honestly, our team did not have a real issue with unfinished surveys. All 100 were
administered and filled out. With the exception of a couple questions unanswered, we
were still able to conduct a proper analysis. The unfinished answers actually proved to be
towards our benefit because in the population data we used there was portions that were
or had no-response sections. The no-response answers of our sample proved to be useful
in replicating the statistics of our population.
Result
This chart shows Population Ethnicity. The Pie Chart is broken down into subcategories that measure percentage of population. These subcategories include: Caucasian, African American, Asian/Pacific Islander, Latino, Native American, Other, and Non Resident or No Response
8
This chart shows Sample Ethnicity. The Pie Chart is broken down into subcategories that measure percentage of our generated sample. These subcategories include: Caucasian, African American, Asian/Pacific Islander, Latino, Native American, Other, and Non Resident or No Response
This chart shows the percentage of gender distribution within the sample. The subcategories are broken up into males and females. The percentage is out of a sample of 100.
This chart shows the percentage of gender distribution within the population. The subcategories are broken up into males and females. The percentage is out of a population of 8,734 students.
9
As you can see our through our generated sample and fixed population, we were
able to generate a sample gender distribution that is proportionate to population gender
distribution. Both the gender distribution and ethnicity distribution was both
representative of the entire population. Now that we have a sample that is representative
This chart shows the percentage of age distribution within the sample. The subcategories are broken up into 22 or younger, 23‐25, 26‐35, and 36 or older. The percentage is out of a sample of 100.
This chart shows Population Age Distribution. The Pie Chart is broken down into subcategories that measure percentage of our population. These subcategories include: 22 or younger, 23‐25, 26‐35, and 36 or older.
10
of the population, we can now proceed to analyze the hypothesis. The age of our sample
proved also to be representative of our population, further adding less error to our sample
analysis and research.
Sample Mean and Variance
Numerical Analysis:
SAMPLE Mean Variance Age 23 10 Units at CSUSM 11.43 10.85363636 Annual Income 27058 160073773 Price increase on Parking to Influence Decision 141.06 8850.339 Days at School 2.16 0.499 Hours Work 36.52 48.717 Commute Miles 12.82 103.68
This chart shows the breakdown of our sample numerical data. The questions that
had either a numerical or ordinal set of data were compiled in an Excel spreadsheet. Both
the mean and variance were determined based on the compiled data. The average age
received in our sample data was 23 years old. The sample variance is found by squaring
the distance from the mean divided by sample size minus one. In our case a sample
variance of age distribution read 10. The average units taken at CSUSM by these
individuals were 11.43 with a variance of 10.85. Annual income was around $27,000 and
the average price increase on parking at CSUSM that most stated would influence their
11
decision to ride the sprinter was 141 dollars. Most of these students spent an average of
2-2.5 days at school and worked 36 hours at their job. On average most of our sample
generally commuted on an average of 12-13 miles per trip to CSUSM.
Research Variables
This shows our data to have a positive linear correlation among annual income
and dollar amount increase on parking to influence traveling by Sprinter. As Income
increases the dollar amount it takes to change a person’s decision to commute by Sprinter
increases. Therefore, the higher the income, the less likely an individual is to ride the
sprinter if parking prices rise. To give you a better example, a person who has annual
income of 20,000 may decide to ride the Sprinter if parking prices for CSUSM rise
another 100 dollars, whereas an individual who makes 60,000 annually may decide to
take the Sprinter if the price of parking rises 300 dollars. As you can see on the average
12
an individual who makes less is more inclined to commute by the new Sprinter if the
amount of parking prices rises at a lesser amount. Our team feels this is because those
who make less feel the pressure and financial strain of a fluctuating parking price more
than a person who makes substantially more money annually and therefore is more
inclined to withdraw from the financial strain (substitute: commuting by sprinter).
Correlation coefficient=.803154
Covariance= 946398.5
After creating a scatter plot to show the correlation between two variables, our
team was curious to find the correlation coefficient of variables X and Y. The correlation
coefficient of annual income and dollar amount parking increases to influence
commuting by sprinter was relatively high. Our team was relatively surprised. The
correlation coefficient when comparing the X and Y variables was .803154. The
correlation coefficient measures the linear strength of the relationship between two
Column 1 Column 2
Column 1 1.58E+08
Column 2 946398.5 8761.836
Column 1 Column
2 Column 1 1 Column 2 0.803154 1
13
variables. The correlation ranges from -1.0 to +1.0. A correlation of +1.0 indicates a
perfect linear relationship, whereas a correlation of 0.0 indicates no linear relationship.
Having a .803154 linear correlation coefficient proves that our variables show a strong
positive linear correlation.
The correlation coefficient was conducted using the equation σσ
σρyx
xy= , where
ρ is the Correlation Coefficient, σ xyis the Covariance between variables X and Y,
σ xis the standard deviation of variable X, and lastly, σ y is the standard deviation of
variable Y.
Our Equation for Correlation Coefficient: =ρ 803154.1178353946398.5
=
The covariance was conducted using the equation σ xy , “The covariance can be
positive or negative depending on whether the two variables move together in the same or
opposite directions. If the covariance is zero or close to zero, this implies that the two
variables do not move closely together” (Parson, 164). In our case our covariance number
was very high, indicating that the variables move closely together.
Regression Analysis
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.80315394 R Square 0.64505625 Adjusted R Square 0.64143437 Standard Error 7576.07767 Observations 100 ANOVA
df SS MS F Significance
F
Regression 1 10222402220 1.02E+10 178.1001 9.03398E-
24 Residual 98 5624901380 57396953 Total 99 15847303600
Coefficients Standard
Error t Stat P-value Lower 95%
Intercept 11821.585 1370.19735 8.627651 1.14E-
13 9102.473059
X Variable 1 108.013717 8.093695259 13.34541 9.03E-
24 91.95204289
Upper 95% 14540.69701 124.0753917
In the Spreadsheet above, one can view the regression analysis for our sample data. In
a Regression analysis we are trying to prove and represent the relationship between
values X and Y, which in this case are annual income and dollar amount price of parking
must raise in order for someone to ride Sprinter. One item or concept the regression
analysis is useful for is determining the estimated regression equation. The equation is
^
y = ob + 1b x , where ob is the unbiased estimate of the regression intercept and
where 1b x is the unbiased estimate of the regression slope. In our case the estimated
equation of our regression line between variables X and Y is
15
^
y =11821.585 +108.013717(X) . The Multiple R section in the Regression analysis
represents the correlation coefficient. This value is indicated at .803. This signifies
that the variables X and Y have a positive high correlation. The R Square value of the
Regression analysis represents the Multiple R squared. This value is .64505. This
value is received by taking the value of 2r =SSRSST
. This value is the portion of the
total variation in the dependent variable that is explained by its relationship with
the independent variable. The value means that 64.5% of the variation in dollar
increase in price of parking to influence customers decision to commute by Sprinter
can be explained by the linear relationship between annual income and dollar
increase in price of parking to influence customers decision to commute by Sprinter.
The chart section indicated by a D.F represents the Degree of Freedom. In our
Sample case the degree of freedom is 98. Because we have a sample of 100 and
degree of freedom is n‐2, the degree of freedom is 100‐2=98. The section indicated
SS indicate Sum of Squares Regression, Sum of Squares Total, and Sum of Squares
Residual. Calculating the Sum of Squares Regression and Sum of Squares Total gives
us values of 1022240220 and 5624901380 respectively. SSR represents the portion
of total sum of squares that is explained by the regression line and the SSE
represents the amount of total sum of squares in the dependent variable not
explained by the least regression line. Adding these two numbers up will give us the
SST or Sum of Squares Total.
16
The title of Standard Error measures the dispersion of the dependent variable
about its mean value at each value of the dependent variable in the original units of
the dependent variable. The value given for the standard error of the regression
slope is 8.09. Having a relatively low number for this value indicates that the slope
values are less variable. The Lower and Upper 95% values indicate the range in
which both X and Y variables fall with the particular confidence level. For instance
with 95% confidence annual income will fall between values 9102‐14540 and price
of parking rising that influences decision to ride Sprinter will fall between values
91‐124.
Hypothesis Testing (Correlation, Significance of Regression Slope)
Hypothesis:
( )
05.0:
0:
=
≠
=
α
ρ
ρ
HH
A
O ionNoCorrelat
D.F= n‐2= (100‐2)=98
Reject Region 025.02/ =α
Reject Region 025.02/ =α
‐t 0.025=‐1.9840 t 0.025=1.9840
17
Decision Rule
Because 13.34>1.9840, Reject H O
Calculated T‐Value is
9835494375.
80315394.
21 2
=
−−
=
n
rtr
=13.34
Based on the sample evidence, we conclude there is a significance positive linear
relationship between annual income and dollar amount price of parking increases to
influence riding Sprinter.
Significance Test of Regression Slope
Hypothesis:
If t>tα /2=1.9840, reject H O If t< ‐ tα /2= ‐1.9840, Reject H O Otherwise, do not Reject H O
18
05.
0:
0:
1
1
=
≠
=
αββ
HH
A
O
D.F= n‐2= (100‐2)=98
The calculated t is:
35.1309.8
001.108
1
11 =−
=−
=s
bb
tβ
Because 13.35>1.9840, we should reject the null hypothesis and conclude the true slope
is not zero. Thus, the simple linear relationship that utilizes the independent variable,
annual income, is useful in explaining the variation in the dependent variable, dollar
amount price for parking must increase to influence individual to ride the Sprinter.
Estimation of Population Mean (Annual Income)
With 95% confidence
−
x ± s2
t sn= 27058 ±1.9612652.03
10= (24,578.203− 29,537.80)
t 0.025=1.9840 ‐t 0.025=‐1.9840
Reject Region 025.02/ =α
Reject Region 025.02/ =α
If t>tα /2=1.9840, reject H O If t< ‐ tα /2= ‐1.9840, Reject H O Otherwise, do not Reject H O
19
With 90% confidence
−
x ± s2
t sn= 27058±1.64512652.03
10= (24,976.74 − 29,138.26)
Estimation of Population Mean (Dollar Value Price of Parking must increase to
influence decision to take Sprinter)
With 95% confidence
−
x ± s2
t sn=141.06 ±1.96 94.076
10= (122.62−159.50)
With 90% confidence
−
x ± s2
t sn=141.06±1.645 94.076
10= (125.58−156.54)
The values indicated above signify the mean range if values from samples are
applied with both 90% and 95% confidence. This range let’s know the range at
which the mean would fall based on the statistics given from our compiled sample
data.
Conclusion
As you can see, our research team successfully conducted a statistical analysis of
a population by implying statistical data of a sample of 100 individuals an applying it to
the population in which it came. We were able to successfully retrieve both mean and
variance of the sample variables. These statistical measures proved very useful when
20
applying two of the variables together. Our research team found the best research topic to
be whether or not we could prove there rested a positive linear correlation among
individuals annual income and dollar amount the price for parking must increase in order
to influence decision to ride the Sprinter. Our team set out to conduct both a correlation
analysis and a regression analysis. We found that there was in fact a positive linear
correlation between these two variables. The regression analysis also proved to be very
useful in proving that these two variables were strong in their correlation coefficient.
After a correlation analysis and Regression analysis was conducted, our team was able to
successfully estimate the mean of the population within both 90% and 95% certainty.
These values give us a range in which the population would fall under these variables.
How might this be useful?
One might ask their self, ”How might this be useful?”. The truth is that our
research hypothesis has a great deal of significance. Some individuals who might find our
research useful is the Parking and Transit center at Cal State San Marcos. By reviewing
our data, they would be able to tell the average maximum amount most CSUSM student
would pay for their parking pass before using other transportation services. They may
also use our annual revenue average to base their fiancés around. Most importantly, the
parking and transit offices at CSUSM would be able to maximize their revenues by
charging the largest amount possible before actually losing customers. This would help
create price stability. This may also prove beneficial to the new parking structure that is
set to be built at CSUSM within the next few years. Conducting a Sample analysis of the
21
school helps researches attain an idea of what people can afford and what they want to
afford in our case.
Works Cited
Faculty. “QUICKSTAT: Cougar Review” 2006-2007, Accessed April 26, 2008.
<www.csusm.edu/ipa>
Groebner. David F. “Business Statistics: Fourth Edition” 2007 Mcgraw Hill
New York.
Research. “CalStateSanMarcos. Student Statistics” April 28, 2008
<http://sanmarcoscalifornia.stateuniversity.com/>
Sample Research Data Analysis‐ Sample Survey 100 individuals conducted
April 10‐27, 2008