6/4/2015 william b. vogt 1 45-733: lecture 1. 6/4/2015 william b. vogt 2 topics administrative...

91
11/03/22 William B. Vogt 1 45-733: lecture 1 45-733: lecture 1

Post on 19-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1
  • 6/4/2015 William B. Vogt 1 45-733: lecture 1
  • Slide 2
  • 6/4/2015 William B. Vogt 2 Topics Administrative matters What is statistics and why should you care (chapter 1) Presenting data (chapter 2)
  • Slide 3
  • 6/4/2015 William B. Vogt 3 Administrative Instructor Bill Vogt [email protected]@andrew.cmu.edu Office: Hamburg Hall, 2116D Office phone: (412) 268-1843 Office hours: Tuesday, Thursday 5-6pm By appointment
  • Slide 4
  • 6/4/2015 William B. Vogt 4 Administrative Grading Homework, midterm, final equal weight Cooperation Unlimited cooperation on homeworks Exams are open book, notes, etc No cooperation on exams
  • Slide 5
  • 6/4/2015 William B. Vogt 5 Administrative Software Everything may be done with excel You may use any software you like Office hours Tuesday 5-6pm Thursday 5-6pm Others by appointment
  • Slide 6
  • 6/4/2015 William B. Vogt 6 Administrative Web site http://www.andrew.cmu.edu/course/45-733/index.htmhttp://www.andrew.cmu.edu/course/45-733/index.htm Lecture PowerPoint slides available by clicking on relevant dates topic Homework Distributed via web site, solutions also Due in class according to syllabus Returned in class next meeting
  • Slide 7
  • 6/4/2015 William B. Vogt 7 Administrative Special class meeting February 19, 8-9:50pm, GSIA 152 Sections M,F meet simultaneously Office hours Tuesday 5-6pm Thursday 5-6pm Others by appointment
  • Slide 8
  • 6/4/2015 William B. Vogt 8 What is statistics? Systematic methods to analyze and present numerical information OR, a systematic way of discussing both our knowledge and our ignorance arising from numerical information
  • Slide 9
  • 6/4/2015 William B. Vogt 9 Who cares about statistics? Increasing relevance of numerical data Importance of correctly assessing information Use numerical data to construct good estimates Know the limitations of the estimates
  • Slide 10
  • 6/4/2015 William B. Vogt 10 What is statistics: Systematic Most Americans like our product vs. 59% 0.6% prefer our product to our leading competitors The economy will contract vs. GDP will contract by 2.2% 0.4% Women are more likely to buy our product vs. Women are 15% 3% more likely to buy our product than are men
  • Slide 11
  • 6/4/2015 William B. Vogt 11 What is statistics: Analyze We want to know about a population Real populations Population of people in the US Population of our customers Population of units off of our production line Imaginary populations Population of ways economy might work Population of ways consumers might react to our new product
  • Slide 12
  • 6/4/2015 William B. Vogt 12 What is statistics: Analyze What we want to know about population How big/small is some quantity Average income % who approve of G.W. Bush Market share of Intel in x86 PC processors
  • Slide 13
  • 6/4/2015 William B. Vogt 13 What is statistics: Analyze What we want to know about population Does quantity differ in different groups Average income of Northern vs. Southern households Average income of Target vs. Walmart shoppers Market share of Intel in desktop vs. mobile x86 PC processors
  • Slide 14
  • 6/4/2015 William B. Vogt 14 What is statistics: Analyze What we want to know about population How are two/more variables related As income rises, how much does consumption of Starbucks coffee rise? As family size rises, how do sensitivities to price and advertising change? As people age, how does their sensitivity to advertising change?
  • Slide 15
  • 6/4/2015 William B. Vogt 15 What is statistics: Analyze How can we know these things? Collect a census Accurate information on the whole population Ask everyone in the US their income Audit their answers carefully This is always expensive and often impossible Imaginary populations? Parallel universes?
  • Slide 16
  • 6/4/2015 William B. Vogt 16 What is statistics: Analyze How can we know these things? Collect a sample Sample: A few members of a population. Accurate information on the sample Ask 100 people in the US their income Audit their answers carefully? But knowing all about the sample knowing all about the population!
  • Slide 17
  • 6/4/2015 William B. Vogt 17 What is statistics: Analyze Going from sample to population, we hope for: A good description of the sample An estimate (informed guess) of what we want to know about the population A statement about how far off our estimate might be
  • Slide 18
  • 6/4/2015 William B. Vogt 18 What is statistics: Analyze Population and sample, and example What is avg household income in US? Phone survey of 100 completed households, asking their incomes Sample = {$50K, $23K, , 180K} Average sampled income is, say, $53K My estimate of US avg household income is $53K and I am 95% sure that it is in the range $53K $3K
  • Slide 19
  • 6/4/2015 William B. Vogt 19 What is statistics? Description of a sample Analysis: estimation of quantities of interest for a population Levels Differences Relationships Analysis: statement of how far off the estimates might be
  • Slide 20
  • 6/4/2015 William B. Vogt 20 Data description Topic of chapter 2 is describing data This is the part of the definition of statistics in which we describe our data This is also the part of our goals in sampling in which we describe our sample accurately Topic of the rest of the book/course will be analysis
  • Slide 21
  • 6/4/2015 William B. Vogt 21 Data Description: Population and Sample Population All of the relevant people/units you are interested in For example Population of people in the US Population of our customers Population of light bulbs from our production facility
  • Slide 22
  • 6/4/2015 William B. Vogt 22 Data Description: Population and Sample Sample A subset of a population. Only a few of the units we are interested in. For example: Survey of 100 people in the US Survey of 5 of our customers 1 in 100,000 of the light bulbs from our production facility
  • Slide 23
  • 6/4/2015 William B. Vogt 23 Data Description: Dataset A dataset is just a group of numbers measuring something Could be a population Could be a sample Examples A list of all the incomes of all the households in the US A list of the incomes of 5 of our customers A list of the time to failure of 500 light bulbs from our production facility
  • Slide 24
  • 6/4/2015 William B. Vogt 24 Data Description: Dataset Notation Dataset = {4,5,12,3,,0} Dataset = {x 1, x 2, x 3, , x N } x i = any arbitrary one of the elements in our dataset N = the number of elements (observations) in our dataset
  • Slide 25
  • 6/4/2015 William B. Vogt 25 Data Description: Dataset Notation Example: Dataset = {4,5,12,3} x 1 =4 x 3 =12 N=4
  • Slide 26
  • 6/4/2015 William B. Vogt 26 Data Description: Measures of central tendency Measures of central tendency Measure where the middle of the data are Useful if you want to know what an average or typical member of your sample/population looks like
  • Slide 27
  • 6/4/2015 William B. Vogt 27 Data Description: Measures of central tendency Mean Also known as average Calculated by adding up all the observations and dividing by the number of observations
  • Slide 28
  • 6/4/2015 William B. Vogt 28 Data Description: Measures of central tendency Mean Can also be written:
  • Slide 29
  • 6/4/2015 William B. Vogt 29 Data Description: Measures of central tendency Mean Example: Dataset = {53,45,23,19,87} Mean
  • Slide 30
  • 6/4/2015 William B. Vogt 30 Data Description: Measures of central tendency Median Also known as the 50 th percentile Is the point in the data where half of the observations are greater and half are lesser. Calculated by sorting the data and choosing the middle value
  • Slide 31
  • 6/4/2015 William B. Vogt 31 Data Description: Measures of central tendency Median Example: Dataset = {53,45,23,19,87} Dataset sorted = {19,23,45,53,87} Median = 45
  • Slide 33
  • 6/4/2015 William B. Vogt 33 Data Description: Measures of central tendency Percentiles The 25 th percentile is the point at which 25% of the data are lesser and 75% of the data are greater The 75 th percentile is the point at which 75% of the data are lesser and 25% of the data are greater The Y th percentile is the point at which Y% of the data are lesser and (100-Y)% of the data are greater The median is the 50 th percentile
  • Slide 34
  • 6/4/2015 William B. Vogt 34 Data Description: Measures of central tendency Percentiles Calculation Sort the dataset The 25 th percentile is observation (N+1)/4 or 0.25*(N+1) The 75 th percentile is observation 3*(N+1)/4 or 0.75*(N+1) The Y th percentile is observation (Y/100)*(N+1) Use interpolation if (Y/100)*(N+1) is not a whole number
  • Slide 35
  • 6/4/2015 William B. Vogt 35 Data Description: Measures of central tendency Percentiles Calculation Use interpolation if (Y/100)*(N+1) is not a whole number If there are 10 observations, and you want the 44 th percentile 0.44*(10+1)=4.84 So, the 44 th percentile will be the number 84% of the way between observations 4 and 5 44 th percentile = 0.16*x 4 + 0.84* x 5
  • Slide 36
  • 6/4/2015 William B. Vogt 36 Data Description: Measures of central tendency Percentiles Example Dataset = {45,23,110,19,87,36,100} Sorted dataset = {19,23,36,45,87,100,110} 25 th percentile=23 75 th percentile=100 50 th percentile (median)=45
  • Slide 38
  • 6/4/2015 William B. Vogt 38 Data Description: Measures of central tendency Mode The most common value in the dataset Might think of as the most typical value Example Dataset = {53,45,45,23,19,87,100} Mode = 45 Example Dataset = {53,45,45,23,19,87,87} Mode = 45 and 87 --- data are bimodal
  • Slide 39
  • 6/4/2015 William B. Vogt 39 Data Description: Measures of dispersion Measures of dispersion tell us how spread out our data are: Compare: Dataset 1 = {53,45,23,19,87} Dataset 2 = {44,47,43,45,48} Both have a mean of 45.4 Dataset 1 is more spread out, however
  • Slide 40
  • 6/4/2015 William B. Vogt 40 Data Description: Measures of dispersion Lets display the datasets graphically: 458719 DS1: DS2:
  • Slide 41
  • 6/4/2015 William B. Vogt 41 Data Description: Measures of dispersion A good measure of dispersion will, for example, be bigger for DS1 than for DS2 458719 DS1: DS2:
  • Slide 42
  • 6/4/2015 William B. Vogt 42 Data Description: Measures of dispersion Average deviation One way to think about dispersion is to ask how far, on average, points are from the mean Call d i the deviation from the mean:
  • Slide 43
  • 6/4/2015 William B. Vogt 43 Data Description: Measures of dispersion Average deviation Dataset 1 = {53,45,23,19,87} Deviations 1 = {7.6,-0.4,-22.4,-26.4,41.6} Maybe the average of the d i will be a good measure of dispersion (7.6-0.4-22.4-26.4+41.6)/5 = 0 Hmmm.
  • Slide 48
  • 6/4/2015 William B. Vogt 48 Data Description: Measures of dispersion Mean absolute deviation Example Dataset 2 = {44,47,43,45,48} Deviations 2 = {-1.4,1.6,-2.4,-0.4,2.6} Absolute Dev 2 = {1.4,1.6,2.4,0.4,2.6} MAD=(1.4+1.6+2.4+0.4+2.6)/5=1.68
  • Slide 49
  • 6/4/2015 William B. Vogt 49 Data Description: Measures of dispersion Mean absolute deviation 458719 DS1: DS2: MAD=19.68 MAD=1.68
  • Slide 50
  • 6/4/2015 William B. Vogt 50 Data Description: Measures of dispersion Variance Solves the problem of negative deviations in a different way Variance calls for the deviations to be squared:
  • Slide 51
  • 6/4/2015 William B. Vogt 51 Data Description: Measures of dispersion Variance Example Dataset 1 = {53,45,23,19,87} Deviations 1 = {7.6,-0.4,-22.4,-26.4,41.6} Squared Dev 1 = {57.76,0.16,501.76,696.96,1730.56} Variance = (57.76+0.16+501.76+696.96+1730.56)/5 = 597.44