data sampling and probability

72
DATA SAMPLING AND PROBABILITY Avjinder Singh Kaler and Kristi Mai

Upload: avjinder-avi-kaler

Post on 29-Jan-2018

203 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Data sampling and probability

DATA SAMPLING AND PROBABILITY Avjinder Singh Kaler and Kristi Mai

Page 2: Data sampling and probability

Multiplication Rule: Complements and Conditional Probability

Counting

Types of Sampling Methods

Summarizing Data

Statistical Graphs

Probability Distributions

Normal and Standard Normal Distribution

Page 3: Data sampling and probability

A conditional probability of an event is a probability obtained with the additional information that some other event has already occurred.

denotes the conditional probability of event B occurring, given that event A has already occurred, and it can be found by dividing the probability of events A and B both occurring by the probability of event A:

( | )P B A

( and )( | )

( )

P A BP B A

P A

Page 4: Data sampling and probability

Refer to Table 4-1 to find the following:

a) If 1 of the 1000 test subjects is randomly selected, find the probability that the subject had a positive test result, given that the subject actually uses drugs. That is, find 𝑷(𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕|𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔).

a) If 1 of the 1000 test subjects is randomly selected, find the probability that the subject actually uses drugs, given that the he/she had a positive test result. That is, find 𝑷(𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔|𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕).

Page 5: Data sampling and probability

Solution:

a) P positive test result subject uses drugs =P subject uses drugs and had a positive test result

P(subject uses drugs)

P positive test result subject uses drugs =44

10050

100

=44

50= 0.88

b) P subject uses drugs positive test result =P subject uses drugs and had a positive test result

P(positive test result)

𝑃 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑢𝑠𝑒𝑠 𝑑𝑟𝑢𝑔𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 𝑟𝑒𝑠𝑢𝑙𝑡 =44

134= 0.328

Table 4-1 Pre-Employment Drug Screening Results

Positive Test Result Negative Test Result

Subject Uses Drugs 44 (True Positive) 6 (False Negative)

Subject Is Not a Drug User 90 (False Positive) 860 (True Negative)

Page 6: Data sampling and probability

For a sequence of two events in which the first event can occur 𝑚

ways and the second event can occur 𝑛 ways, the events together

can occur a total of 𝑚 ∗ 𝑛 ways.

Example:

For a two-character code consisting of a letter followed by a digit, the

number of different possible codes is 26 ∗ 10 = 260.

Page 7: Data sampling and probability

The factorial symbol ! denotes the product of decreasing positive

whole numbers.

For example,

By special definition, 0! = 1.

4! 4 3 2 1 24

Page 8: Data sampling and probability

n! = Number of different permutations (order counts) of n different items can

be arranged when all n of them are selected. (This factorial rule reflects the

fact that the first item may be selected in n different ways, the second item

may be selected in n – 1 ways, and so on.)

Example:

The number of ways that the five letters {a, b, c, d, e} can be arranged is as

follows: 5! = 5 ∙ 4 ∙ 3 ∙ 2 ∙ 1 = 120

Page 9: Data sampling and probability

Requirements:

1. There are n different items available. (This rule does not apply if some of

the items are identical to others.)

2. We select r of the n items (without replacement).

3. We consider rearrangements of the same items to be different sequences.

(The permutation of ABC is different from CBA and is counted separately.)

If the preceding requirements are satisfied, the number of permutations (or

sequences) of r items selected from n available items (without replacement) is

!

( )!n r

nP

n r

Page 10: Data sampling and probability

If the five letters {a, b, c, d, e} are available and three of them are to be selected without replacement, the number of different permutations is as follows:

𝑛𝑃𝑟 =𝑛!

(𝑛 − 𝑟)!=

5!

(5 − 3)!= 60

Page 11: Data sampling and probability

Requirements:

1. There are n items available, and some items are identical to others.

2. We select all of the n items (without replacement).

3. We consider rearrangements of distinct items to be different sequences.

If the preceding requirements are satisfied, and if there are n1 alike, n2 alike,

. . . nk alike, the number of permutations (or sequences) of all items selected

without replacement is

1 2

!

! ! !k

n

n n n

Page 12: Data sampling and probability

If the 10 letters {a, a, a, a, b, b, c, c, d, e} are available and all 10 of them are to be selected without replacement, the number of different permutations is as follows:

𝑛!

𝑛1! 𝑛2! ⋯ 𝑛𝑘!=

10!

4! 2! 2!=

3,628,800

24 ∗ 2 ∗ 2= 37,800

Page 13: Data sampling and probability

Requirements:

1. There are n different items available.

2. We select r of the n items (without replacement).

3. We consider rearrangements of the same items to be the same. (The

combination of ABC is the same as CBA.)

If the preceding requirements are satisfied, the number of combinations of r

items selected from n different items is

!

( )! !n r

nC

n r r

Page 14: Data sampling and probability

In the Pennsylvania Match 6 Lotto, winning the jackpot requires you select six different numbers from 1 to 49. The winning numbers may be drawn in any order. Find the probability of winning if one ticket is purchased.

! 49!Number of combinations: 13,983,816

! ! 43!6!

1winning

13,983,816

n r

nC

n r r

P

Page 15: Data sampling and probability

When different orderings of the same items are to be counted separately, we have a permutation problem, but when different orderings are not to be counted separately, we have a combination problem.

Permutations are for lists (order matters) and combinations are for groups (order doesn’t matter).

Page 16: Data sampling and probability

Data – collections of observations, such as measurements, genders,

or survey responses

Population – the complete collection of all individuals to be studied

Sample – sub-collection of population the data comes from

Census – the collection of data from every member of the population

Page 17: Data sampling and probability

planning studies, designing experiments, and

obtaining data

organizing, summarizing, analyzing, interpreting,

drawing conclusions about, and presenting data

Page 18: Data sampling and probability

The Gallup corporation collected data from 1013 adults in the United States. Results showed that 66% of the respondents worried about identity theft.

The population consists of all 241,472,385 adults in the United States.

The sample consists of the 1013 polled adults.

The objective is to use the sample data as a basis for drawing a conclusion about the whole population.

Page 19: Data sampling and probability

Simple random sample

Random sample

Systematic sampling

Convenience sampling

Stratified sampling

Cluster sampling

Page 20: Data sampling and probability

A sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen.

Page 21: Data sampling and probability

Members from the population are selected in such a way that each individual member in the population has an equal chance of being selected.

Page 22: Data sampling and probability

Select some starting point and then select every kth element in the population.

Page 23: Data sampling and probability

Use results that are easy to get.

Page 24: Data sampling and probability

Subdivide the population into at least two different subgroups that share the same characteristics, then draw a sample from each subgroup (or stratum).

Page 25: Data sampling and probability

Divide the population area into sections (or clusters). Then randomly select some of those clusters. Now choose all members from selected clusters.

Page 26: Data sampling and probability

When working with large data sets, it is often helpful to

organize and summarize data by constructing a table called

a frequency distribution.

Page 27: Data sampling and probability

Shows how a data set is partitioned among all of several

categories (or classes) by listing all of the categories along

with the number (frequency) of data values in each of them

All categories/classes and the number of observations in

that given category/class

Page 28: Data sampling and probability

IQ Score Frequency

50-69 2

70-89 33

90-109 35

110-129 7

130-149 1

Lower Class

Limits

are the smallest numbers that can

actually belong to different classes.

Page 29: Data sampling and probability

IQ Score Frequency

50-69 2

70-89 33

90-109 35

110-129 7

130-149 1

Upper Class

Limits

are the largest numbers that can

actually belong to different classes.

Page 30: Data sampling and probability

IQ Score Frequency

50-69 2

70-89 33

90-109 35

110-129 7

130-149 1

Class

Boundaries

are the numbers used to separate

classes, but without the gaps created

by class limits.

49.5

69.5

89.5

109.5

129.5

149.5

Page 31: Data sampling and probability

IQ Score Frequency

50-69 2

70-89 33

90-109 35

110-129 7

130-149 1

Class

Midpoints

are the values in the middle of the

classes and can be found by adding

the lower class limit to the upper class

limit and dividing the sum by 2.

59.5

79.5

99.5

119.5

139.5

𝑐𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 =𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡

2

Page 32: Data sampling and probability

IQ Score Frequency

50-69 2

70-89 33

90-109 35

110-129 7

130-149 1

Class

Width

is the difference between two

consecutive lower class limits or two

consecutive lower class boundaries.

20

20

20

20

20

Page 33: Data sampling and probability

relative frequency = class frequency

sum of all frequencies

includes the same class limits as a frequency distribution, but the

frequency of a class is replaced with a relative frequencies (a

proportion) or a percentage frequency ( a percent)

percentage

frequency

class frequency

sum of all frequencies 100% =

Page 34: Data sampling and probability

IQ Score Frequency Relative Frequency

50-69 2 2.6%

70-89 33 42.3%

90-109 35 44.9%

110-129 7 9.0%

130-149 1 1.3%

Page 35: Data sampling and probability

Cu

mu

lative

Fre

qu

en

cie

s IQ Score Frequency Cumulative Frequency

50-69 2 2

70-89 33 35

90-109 35 70

110-129 7 77

130-149 1 78

Page 36: Data sampling and probability

The frequencies start low, then increase to higher frequencies until reaching a maximum, and then decrease to low again.

The distribution is approximately symmetric

• frequencies preceding the maximum being roughly a mirror image of those that follow the maximum

Page 37: Data sampling and probability

Numerical in nature

Consists of numbers representing counts or measurements

Have a unit and can be used arithmetically

Quantitative data can be further described by distinguishing between discrete and continuous types.

Examples:

• The weights of supermodels

• The ages of respondents

Page 38: Data sampling and probability

the number of possible values is either a finite number or a

‘countable’ number (i.e. the number of possible values is 0,

1, 2, 3, . . .).

Example:

The number of eggs that a hen lays

Page 39: Data sampling and probability

infinitely many possible values that correspond to some

continuous scale that covers a range of values without gaps,

interruptions, or jumps

Example:

The amount of milk that a cow produces;

e.g. 2.343115 gallons per day

Page 40: Data sampling and probability

consists of names or labels (representing categories)

Example:

• The gender (male/female) of professional athletes.

• Shirt numbers on professional athletes uniforms - substitutes for names.

Page 41: Data sampling and probability

• Uses bars of equal width to show

frequencies of categorical, or

qualitative, data

• Vertical scale represents frequencies or

relative frequencies.

• Horizontal scale identifies the different

categories of qualitative data.

Page 42: Data sampling and probability

A multiple bar graph has two or more sets of bars and is used to

compare two or more data sets.

Page 43: Data sampling and probability

A bar graph for qualitative data, with the bars arranged in descending order according to frequencies

Page 44: Data sampling and probability

A graph depicting qualitative data as slices of a circle, in which the size of each slice is proportional to frequency count

Page 45: Data sampling and probability

a variable (typically represented by 𝑥) that has a single numerical value, determined by chance, for each outcome of a given procedure

Can be discrete or continuous – just like data

Page 46: Data sampling and probability

Discrete Random Variable either a finite number of values or countable number of values, where “countable” refers to the fact that there might be infinitely many values, but that they result from a counting process

Continuous Random Variable has infinitely many values, and those values can be associated with measurements on a continuous scale without gaps or interruptions.

Page 47: Data sampling and probability

a description that gives the probability for each value of the random variable

often expressed in the format of a graph, table, or formula

Note:

If a probability is very small, it is represented as 0+ in tables

(i.e. it is very small, yet positive)

Page 48: Data sampling and probability

1. There is a numerical random variable x and its values are associated with corresponding probabilities.

2. The sum of all probabilities must be 1.

3. Each probability value must be between 0 and 1 inclusive.

1P x

0 1P x

Page 49: Data sampling and probability

The probability histogram is very similar to a relative frequency histogram, but the vertical scale shows probabilities.

Page 50: Data sampling and probability

According to the range rule of thumb, most values should lie within 2 standard deviations of the mean.

We can therefore identify “unusual” values by determining if they lie outside these limits:

Maximum usual value =

Minimum usual value =

2

2

Page 51: Data sampling and probability

We found for families with two children, the mean number of girls is 1.0 and the standard deviation is 0.7 girls.

Use those values to find the maximum and minimum usual values for the number of girls.

Solution:

maximum usual value 2 1.0 2 0.7 2.4

minimum usual value 2 1.0 2 0.7 0.4

Page 52: Data sampling and probability

Rare Event Rule for Inferential Statistics

If, under a given assumption (such as the assumption that a coin is fair), the probability of a particular observed event (such as 992 heads in 1000 tosses of a coin) is extremely small, we conclude that the assumption is probably not correct.

Page 53: Data sampling and probability

Using Probabilities to Determine When Results Are Unusual

Unusually high # of successes: x successes among n trials is an unusually high number of successes if

.

Unusually low # of successes : x successes among n trials is an unusually low number of successes if

( or fewer) 0.05P x

( or more) 0.05P x

Page 54: Data sampling and probability

A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

1. The total area under the curve must equal 1.

2. Every point on the curve must have a vertical height that is 0 or greater. (That is, the curve cannot fall below the x-axis.)

Page 55: Data sampling and probability

Because the total area under the density curve is equal to 1, there is a correspondence between area and probability.

Page 56: Data sampling and probability

A continuous random variable has a uniform distribution if its values are spread evenly over the range of probabilities. The graph of a uniform distribution results in a rectangular shape.

Page 57: Data sampling and probability

Given the uniform distribution illustrated, find the probability that a randomly selected voltage level is greater than 124.5 volts.

Shaded area

represents voltage

levels greater than

124.5 volts.

Page 58: Data sampling and probability

21

2

( )2

x

ef x

A continuous R.V. has a normal distribution if it has a graph that is

symmetric and bell-shaped and if the R.V. can be described by the

following equation:

Page 59: Data sampling and probability

The standard normal distribution is a normal probability distribution with μ = 0 and σ = 1. The total area under its density curve is equal to 1.

Page 60: Data sampling and probability

Represents how much a given value, 𝑥, deviates/varies from the center of a set of data

This value can help to assess how “extreme” a particular data value is based on the distribution the value is supposed to follow

This score can also be used to convert sample data (sample statistics) to a measure of relative standing so that we may be able to compare sample to one another.

Basic “Idea” Behind Formulas for Z-Scores:

𝑍 =𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 − 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎

𝑎 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑠𝑢𝑟𝑒

Page 61: Data sampling and probability

If the z-score is positive (+), the specific value falls above the center value.

If the z-score is negative (-), the specific value falls below the center value.

“Usual” values have z-scores between -2 and 2.

“Unusual” values have z-scores less than -2 or greater than 2.

Page 62: Data sampling and probability

We can find areas (probabilities) for different regions under a normal model using StatCrunch.

Page 63: Data sampling and probability

A bone mineral density test can be helpful in identifying the presence of osteoporosis.

The result of the test is commonly measured as a z score, which has a normal distribution with a mean of 0 and a standard deviation of 1.

A randomly selected adult undergoes a bone density test.

Find the probability that the result is a reading less than 1.27.

Page 64: Data sampling and probability

The probability of random adult having a bone density less than 1.27 is 0.8980.

( 1.27) 0.8980P z

Page 65: Data sampling and probability

Using the same bone density test, find the probability that a randomly selected person has a result above –1.00 (which is considered to be in the “normal” range of bone density readings.

The probability of a randomly selected adult having a bone density above –1 is 0.8413.

Page 66: Data sampling and probability

A bone density reading between –1.00 and –2.50 indicates the subject has osteopenia. Find this probability.

The probability of a randomly selected adult having osteopenia is 0.1525.

Page 67: Data sampling and probability

denotes the probability that the z score is between a and b.

denotes the probability that the z score is greater than a.

denotes the probability that the z score is less than a.

( )P a z b

( )P z a

( )P z a

Page 68: Data sampling and probability

Finding the 95th Percentile

1.645

5% or 0.05

(z score will be positive)

Page 69: Data sampling and probability

Using the same bone density test, find the bone density scores that separates the bottom 2.5% and find the score that separates the top 2.5%.

Page 70: Data sampling and probability

For the standard normal distribution, a critical value is a z score separating unlikely values from those that are likely to occur.

Notation:

The expression zα denotes the z score with an area of α to its right.

Page 71: Data sampling and probability

Find the value of z0.025.

The notation z0.025 is used to represent the z score with an area of 0.025 to its right.

Referring back to the bone density example,

z0.025 = 1.96.

Page 72: Data sampling and probability

• Complete HW1 and HW2 on MLP