quantitative technique

Personal Learning PaperOn

Quantitative Techniques

Prepared By:

Submitted To: Prof. Priyabrata Nayak

Contents

Introduction History Data Collecting Arranging Data

Using Data Array

Using Frequency Distribution Central Tendency Dispersion Skewness Measures of Central Tendency

Mean

Median

Mode Probability Probability Distribution Discrete Probability Distribution

Contents

Sampling Central Limit Theorem Standard Error Sample Size Estimation Point Estimation Interval Estimation Correlation Coefficient of Determination

Introduction to Statistics

The word statistics means different things to different people.

To a football fan, statistics are rushing, passing, and first down numbers; to the chargers’ coach in the second example, statistics is the chance that the giants will through the pass over center.

To the manager of a power station, statistics are the amounts pollution being released into the atmosphere.

To the Food and Drug administrator in our third example, statistics is the likely percentage of undesirable effects in the population using the new prostate drug.

To the Community Bank in the fourth example, statistics is the chance that Sarah will repay her loan on time.

Each of these people is using the word correctly, yet each person uses it in a different way. All of them are using statistics to help them make decisions

History

The word statistik comes from the Italian word statista (meaning “statesman”). It was first used by Gottfried Achenwall(1719-1772), a professor at Marlborough and Gottingen. Dr. E. A. W. Zimmerman introduced the word statistics into England. Its use was popularized by Sir John Sinclair in his work Statistical Account of Scotland 1791-1799. Long before the eighteenth century, however, people had been recording and using data.

Data Collecting

Statisticians select their observations so that all relevant groups are represented in the data. To determine the potential market for a new product, for example, analysts may study 100 consumers in a certain geographical area. Analysts must be certain that this group contains people representing variables such as income level, race, education, and neighborhood.

Past data is used to make decisions about the future.

Arranging Data using the Data Array

The data array is one of the simplest ways to present data. It arranges the data in ascending or descending order.

Table 1-1

Sample of Daily

Production in

Yards of 30

Carpet Looms

16.2 15.8 15.8 15.8 16.3 15.6

15.7 16.0 16.2 16.1 16.8 16.0

16.4 15.2 15.9 15.9 15.9 16.8

15.4 15.7 15.9 16.0 16.3 16.0

16.4 16.6 15.6 15.6 16.9 16.3

Table 1-2

Data Array of

daily production in yards of 30 Carpet Looms

15.2 15.7 15.9 16.0 16.2 16.4

15.4 15.7 15.9 16.0 16.3 16.6

15.6 15.8 15.9 16.0 16.3 16.8

15.6 15.8 15.9 16.1 16.3 16.8

15.6 15.8 16.0 16.2 16.4 16.9

The table 1-1 contains the Raw Data and the

table 1-2 rearranges the data in a data array in

ascending order. Advantages of Data Array We can quickly locate the lowest and highest values in

the data. We can easily divide the data into sections. We can see whether any values appear more than once

in the array. We can observe the distance succeeding values in the

data.

Arranging Data using the Frequency Distribution

In statistics, a graph or data set organized to show the frequency of occurrence of each possible outcome of a repeatable event observed many times.

Simple examples are election returns and test scores listed by percentile. A frequency distribution can be graphed as a histogram or pie chart. For large data sets, the stepped graph of a histogram is often approximated by the smooth curve of a distribution function (called a density function when normalized so that the area under the curve is 1).

The famed bell curve or normal distribution is the graph of one such function. Frequency distributions are particularly useful in summarizing large data sets and assigning probabilities.

Central Tendency

Measure that indicates the typical Median value of a distribution. The mean and the median are examples of measures of central tendency.

Dispersion

A term used in statistics that refers to the location of a set of values relative to a mean or average level.Investopedia Says: In finance, dispersion is used to measure the volatility of different types of investment strategies. Returns that have wide dispersions are generally seen as more risky because they have a higher probability of closing dramatically lower than the mean. In practice, standard deviation is the tool that is generally used to measure the dispersion of returns.

Skewness

The degree to which a distribution departs from symmetry about its mean value.

In probability theory and statistics, Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. Roughly speaking, a distribution has positive skew (right-skewed) if the right (higher value) tail is longer or fatter and negative skew (left-skewed) if the left (lower value) tail is longer or fatter. The two are often confused, since most of the mass of a right (or left) skewed distribution is to the left (or right) of its respective tail.

Measures of Central Tendency

The three most common measures of central tendency are the mean, the median, and the mode.


Arithmetic MeanThe arithmetic mean is the most common measure of central tendency. It simply the sum of the numbers divided by the number of numbers. The symbol m is used for the mean of a population. The symbol M is used for the mean of a sample. The formula for m is shown below: m= SX

N

where SX is the sum of all the numbers in the numbers in the sample and N is the number of numbers in the sample. As an example, the mean of the numbers 1+2+3+6+8=

20

5=4 regardless of whether the numbers constitute the entire population or just a sample from the population. The table, Number of touchdown passes, shows the number of touchdown (TD) passes thrown by each of the 31 teams in the National Football League in the 2000 season. The mean number of touchdown passes thrown is 20.4516 as shown below. m=

SX

N= 634

31= 20.4516

37 33 33 32 29 28 28 23

22 22 22 21 21 21 20 20

19 19 18 18 18 18 16 15

14 14 14 12 12 9 6

Although the arithmetic mean is not the only "mean" (there is also a geometric mean), it is by far the most commonly used. Therefore, if the term "mean" is used without specifying whether it is the arithmetic mean, the geometric mean, or some other mean, it is assumed to refer to the arithmetic mean.

Number of touchdown passes


MedianThe median is also a frequently used measure of central tendency. The median is the midpoint of a distribution: the same number of scores are above the median as below it. For the data in the table, Number of touchdown passes, there are 31 scores. The 16th highest score (which equals 20) is the median because there are 15 scores below the 16th score and 15 scores above the 16th score. The median can also be thought of as the 50th percentile. Let's return to the made up example of the quiz on which you made a three discussed previously in the module Introduction to Central Tendency and shown in table 2.

Three possible datasets for the 5-point make-up quiz

Student Dataset 1 Dataset 2 Dataset 3

You 3 3 3

John's 3 4 2

Maria's 3 4 2

Shareecia's 3 4 2

Luther's 3 5 1

For Dataset 1, the median is three, the same as your score. For Dataset 2, the median is 4. Therefore, your score is below the median. This means you are in the lower half of the class. Finally for Dataset 3, the median is 2. For this dataset, your score is above the median and therefore in the upper half of the distribution. Computation of the Median: When there is an odd number of numbers, the median is simply the middle number. For example, the median of 2, 4, and 7 is 4. When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12 is

4+7

2=5.5.


Mode

The mode is the most frequently occuring value. For the data in the table, Number of touchdown passes, the mode is 18 since more teams (4) had 18 touchdown passes than any other number of touchdown passes. With continuous data such as response time measured to many decimals, the frequency of each value is one since no two scores will be exactly the same (see discussion of continuous variables). Therefore the mode of continuous data is normally computed from a grouped frequency distribution. The Grouped frequency distribution table shows a grouped frequency distribution for the target response time data. Since the interval with the highest frequency is 600-700, the mode is the middle of that interval (650). Grouped frequency distribution

Range Frequency

500-600 3

600-700 6

700-800 5

800-900 5

900-1000 0

1000-1100 1

Probability

Probability theory is the mathematical study of phenomena characterized by randomness or uncertainty.More precisely, probability is used for modelling situations when the result of an experiment, realized under the same circumstances, produces different results (typically throwing a dice or a coin). Mathematicians and actuaries think of probabilities as numbers in the closed interval from 0 to 1 assigned to "events" whose occurrence or failure to occur is random. Probabilities P(A) are assigned to events A according to the probability axioms.The probability that an event A occurs given the known occurrence of an event B is the conditional probability of A given B; its numerical value is (as long as P(B) is nonzero). If the conditional probability of A given B is the same as the ("unconditional") probability of A, then A and B are said to be independent events. That this relation between A and B is symmetric may be seen more readily by realizing that it is the same as saying when A and B are independent events.

Probability Distribution

Outcomes of an experiment and their probabilities of occurrence. If the experiment were to be repeated any number of times, the same probabilities should also repeat. For example, the probability distribution for the possible number of heads from two tosses of a fair coin having both a head and a tail would be as follows:Number of Heads Tosses Probability of Event0 (tail, tail) . 251 (head, tail) + (tail, head) . 502 (head, head) . 25

In mathematics and statistics, a probability distribution, more properly called a probability distribution function, assigns to every interval of the real numbers a probability, so that the probability axioms are satisfied. In technical terms, a probability distribution is a probability measure whose domain is the Borel algebra on the reals.A probability distribution is a special case of the more general notion of a probability measure, which is a function that assigns probabilities satisfying the Kolmogorov axioms to the measurable sets of a measurable space. Additionally, some authors define a distribution generally as the probability measure induced by a random variable X on its range - the probability of a set B is P(X - 1(B)). However, this article discusses only probability measures over the real numbers.

Discrete Probability Distribution

The binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli trial. In fact, when n = 1, then the binomial distribution is the Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

ExampleA typical example is the following: assume 5% of the population is green-eyed. You pick 500 people randomly. The number of green-eyed people you pick is a random variable X which follows a binomial distribution with n = 500 and p = 0.05 (when picking the people with replacement).

Sampling

In many disciplines, there is often a need to describe the characteristics of some large entity, such as the air quality in a region, the prevalence of smoking in the general population, or the output from a production line of a pharmaceutical company. Due to practical considerations, it is impossible to assay the entire atmosphere, interview every person in the nation, or test every pill. Sampling is the process whereby information is obtained from selected parts of an entity, with the aim of making general statements that apply to the entity as a whole, or an identifiable part of it. Opinion pollsters use sampling to gauge political allegiances or preferences for brands of commercial products, whereas water quality engineers employed by public health departments will take samples of water to make sure it is fit to drink. The process of drawing conclusions about the larger entity based on the information contained in a sample is known as statistical inference.There are several advantages to using sampling rather than conducting measurements on an entire population. An important advantage is the considerable savings in time and money that can result from collecting information from a much smaller population. When sampling individuals, the reduced number of subjects that need to be contacted may allow more resources to be devoted to finding and persuading nonresponders to participate. The information collected using sampling is often more accurate, as greater effort can be expended on the training of interviewers, more sophisticated and expensive measurement devices can be used, repeated measurements can be taken, and more detailed questions can be posed.

Sampling

DefinitionsThe term "target population" is commonly used to refer to the group of people or entities (the "universe") to which the findings of the sample are to be generalized. The "sampling unit" is the basic unit (e.g., person, household, pill) around which a sampling procedure is planned. For instance if one wanted to apply sampling methods to estimate the prevalence of diabetes in a population, the sampling unit would be persons, whereas households would be the sampling unit for a study to determine the number of households where one or more persons were smokers. The "sampling frame" is any list of all the sampling units in the target population. Although a complete list of all individuals in a population is rarely available, an alphabetic listing of residents in a community or of registered voters are examples of sampling frames.

Central Limit Theorem

A central limit theorem is any of a set of weak-convergence results in probability theory. They all express the fact that any sum of many independent identically distributed random variables will tend to be distributed according to a particular "attractor distribution". The most important and famous result is called The Central Limit Theorem which states that if the sum of the variables has a finite variance, then it will be approximately normally distributed.Since many real processes yield distributions with finite variance, this explains the ubiquity of the normal distribution.

In statistics, the standard error of a measurement, value or quantity is the estimated standard deviation of the process by which it was generated, including adjusting for sample size. In other words the standard error is the standard deviation of the sampling distribution of the sample statistic (such as sample mean, sample proportion or sample correlation).

Standard Error

Sample Size

Sample size, usually designated N, is the number of repeated measurements in a statistical sample. They are used to estimate a parameter, a descriptive quantity of some population. N determines the precision of that estimate. Larger N gives smaller error bounds of estimation. A typical statement is to say that one can be 95% sure the true parameter is within +or- B of the estimate, where B is an error bound that decreases with increasing N. Such a bounded estimate is referred to as the confidence interval for that parameter.

Estimation is the calculated approximation of a result which is usable even if input data may be incomplete, uncertain, or noisy.In statistics, see estimation theory, estimator.In mathematics, approximation or estimation typically means finding upper or lower bounds of a quantity that cannot readily be computed precisely. While initial results may be unusable uncertain, recursive input from output, can purify results to be approximately accurate, certain, complete and noise-free.

Estimation

Point Estimation

In statistics, point estimation involves the use of sample data to calculate a single value (known as a statistic) which is to serve as a "best guess" for an unknown (fixed or random) population parameter.More formally, it is the application of a point estimator to the data.Point estimation should be contrasted with Bayesian methods of estimation, where the goal is usually to compute (perhaps to an approximation) the posterior distributions of parameters and other quantities of interest. The contrast here is between estimating a single point (point estimation), versus estimating a weighted set of points (a probability density function).

In statistics, interval estimation is the use of sample data to calculate an interval of possible (or probable) values of an unknown population parameter. The most prevalent forms of interval estimation are confidence intervals (a frequentist method) and credible intervals (a Bayesian method).

Interval Estimation

Point Estimation

In statistics, regression analysis is used to model relationships between variables and determine the magnitude of those relationships. The models can be used to make predictions.

Introduction

Regression analysis models the relationship between one or more response variables (also called dependent variables, explained variables, predicted variables, or regressands) (usually named Y), and the predictors (also called independent variables, explanatory variables, control variables, or regressors,) usually named X1,...,Xp). Multivariate regression describes models that have more than one response variable.

Types of regression

Simple and multiple linear regressionSimple linear regression and multiple linear regression are related statistical methods for modeling the relationship between two or more random variables using a linear equation. Simple linear regression refers to a regression on two variables while multiple regression refers to a regression on more than two variables. Linear regression assumes the best estimate of the response is a linear function of some parameters (though not necessarily linear on the predictors).

Nonlinear regression modelsIf the relationship between the variables being analyzed is not linear in parameters, a number of nonlinear regression techniques may be used to obtain a more accurate regression.

Correlation

Degree of relationship between business and economic variables such as cost and volume. Correlation analysis evaluates cause/effect relationships. It looks consistently at how the value of one variable changes when the value of the other is changed. A prediction can be made based on the relationship uncovered. An example is the effect of advertising on sales. A degree of correlation is measured statistically by the Coefficient of Determination (r-squared).

Statistical measure of Goodness-Of-Fit. It measures how good the estimated regression equation is, designated as r2 (read as r-squared). The higher the r-squared, the more confidence one can have in the equation. Statistically, the coefficient of determination represents the proportion of the total variation in the y variable that is explained by the regression equation. It has the range of values between 0 and 1. It is computed as

Coefficient of Determination

quantitative technique

Documents