overview of univariate samples geog 210cgeog.ucsb.edu/~chris/lecture2_210c_spring2011_univariate...
TRANSCRIPT
![Page 1: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/1.jpg)
Overview of Univariate SamplesGeog 210C
Introduction to Spatial Data Analysis
Chris Funk
Lecture 2
![Page 2: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/2.jpg)
C. Funk Geog 210C Spring 20112
Examples of Filtering
-4
-3
-2
-1
0
1
2
3
40 20 40 60 80 100 120 140 160 180 200
No Smoothing
![Page 3: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/3.jpg)
Random Data?
C. Funk Geog 210C Spring 20113
-4
-3
-2
-1
0
1
2
3
4
50 20 40 60 80 100 120 140 160 180 200
Random Data
![Page 4: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/4.jpg)
Two Temperature Time-series
C. Funk Geog 210C Spring 20114
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1880 1900 1920 1940 1960 1980 2000 2020
Western Pac T
Global GISS T
![Page 5: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/5.jpg)
Temperature Time-series as Z-scores
C. Funk Geog 210C Spring 20115
-4
-3
-2
-1
0
1
2
3
4
1880 1900 1920 1940 1960 1980 2000 2020
Western Pac T-Z
Global GISS T-Z
![Page 6: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/6.jpg)
C. Funk Geog 210C Spring 20116
Some Definitions (1)
Statistical population versus sample:Population: total set of elements/measurements that could be (hypothetically) observed in a study, e.g., all U.S. college studentsSample: subset of elements/measurements from population, e.g., college students in western U.S.
Population variables:Characteristics that describe a population, e.g., age or height of all college students in the U.S.
Population parameters versus sample statistics:Parameters: summary measures that describe a population variable, e.g., average age of college students in the U.S.Statistics: summary measures that describe a sample variable, e.g., average age of college students in western U.S.
![Page 7: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/7.jpg)
C. Funk Geog 210C Spring 7
Some Definitions (2)
Statistical sampling:Procedure of getting a representative sample of a population, e.g., a random visit of all U.S. collegesRandom sample = sample in which every individual in population has same chance of being included
Descriptive statistics:Procedure of determining sample statistics, e.g., determination of the average student age of all randomly visited colleges
Statistical inference:Procedure of making statements regarding population parameters from sample statistics, e.g., average student age of all randomly visited colleges = average age of college students in the U.S.?
Statistical estimate:Best (educated) guess about the value of a population parameter
Hypothesis testing:Procedure of determining whether sample data support a hypothesis that specifies the value (or range of values) of a certain population parameter
![Page 8: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/8.jpg)
C. Funk Geog 210C Spring 20108
Some Notation
Sum of n values (outcomes) from a variable X:
Sum of linear combination of n pairs of values (one set belongs to variable X, the other to variable Y ):
Sum of product of n values of two variables X and Y :
Sum of a constant k:
sum of n values of a variable X each raised to a power w:
Product of n values of a variable X:
![Page 9: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/9.jpg)
C. Funk Geog 210C Spring 20109
Sample HistogramSetting: Consider 10 hypothetical sample values:
Definition: bar graph of # of sample values (counts) falling within a set of classes (bins)
Estimated relative frequency table:
histogram shape depends on number and width of bins: use non-overlapping equal intervals with simple bounds Rule of thumb for number of classes: 5 × log10(#of data) For a density histogram, total area of bars = 1
![Page 10: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/10.jpg)
C. Funk Geog 210C Spring 2010
10
Histogram Shape Characteristics
Peaked or not:
Number of peaks:
Symmetric or not:
![Page 11: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/11.jpg)
C. Funk Geog 210C Spring 201011
Sample Cumulative Distribution Function (CDF)
Definition: proportion of sample values less than, or equal to, any given datum value xi
≈ estimated probability that any sample chosen at random ≤ xiRANKED sample data and their estimated relative frequency:
No bin width enters the construction of a CDF
![Page 12: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/12.jpg)
C. Funk Geog 210C Spring 2010
12
Constructing a Sample CDF
Flowchart:1. sort the n sample data, x-values, in ascending order2. construct a set of n probability values, p-values, as:
there are different ways to construct the p-values: the most widely used is actually
this accommodates values below the lowest sample x-value and beyond the largest x-value3. plot the sorted x-values against the corresponding p-values
Sample CDF = look up table of sortedx-values versus p-values
![Page 13: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/13.jpg)
C. Funk Geog 210C Spring 2010
13
Increasing the Resolution of a Sample CDF
Objective: construct a non-parametric (or is it multi-parametric) sample CDF (without fitting any parametric function, e.g., Gaussian) to that CDFFlowchart:
1. choose a smallest possible x-value, xmin, and a largest possible x-value, xmax
2. associate probability pmin = 0 with xmin and pmax = 1 with xmax
3. linearly interpolate between x-and p-value pairs to construct a piecewise linear sample CDF; there are variants on how to interpolate between such pairs of x- and p-values
![Page 14: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/14.jpg)
C. Funk Geog 210C Spring 2010
14
Quantiles
Definition: sample value xpcorresponding to specific cumulative relative frequency value p
Famous quantiles: min: x0.0, lower quartile: x0.25, median: x0.5, upper quartile: x0.75, max: x1.0
e.g., upper quartile is the number (in data units) with 75% of data being less than or equal to this valuePercentiles: x0.01, x0.02, . . ., x0.98, x0.99
Deciles: x0.1, x0.2, . . ., x0.8, x0.9
Quantiles are not sensitive to extreme values (outliers)
The graph above constitutes the sample quantile function of the inverse sample CDF
![Page 15: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/15.jpg)
C. Funk Geog 210C Spring 2010
15
Measures of Central Tendency
Mid-range:arithmetic average of highest and lowest data: (xmax+xmin)/2
Mode:most frequently occurring value in data set
Median:datum value that divides data set into two halves; also defined as 50-th percentile: x0.5
Mean:arithmetic average of data setsample mean:
population mean:
Expressed in data unitsAlso, the sample mean is an estimate of the population mean
Most appropriate measure of central tendency depends on distribution shape
xxm µ̂=
![Page 16: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/16.jpg)
C. Funk Geog 210C Spring 2010
16
Measures of Dispersion (1)
Range:Difference between highest and lowest data: xmax − xmin
Interquartile range:Difference between upper and lower quartiles: x0.75 − x0.25
Mean absolute deviation from mean:Average absolute difference between each datum and the mean:
Median absolute deviation from median:Median absolute difference between each datum and the median: median|xi − x0.5|
Variance:average squared difference between any datum and the meansample variance:
population variance:
the sample variance is an estimate of the population variance
When will we care?
![Page 17: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/17.jpg)
C. Funk Geog 210C Spring 2010
17
Measures of Dispersion (2)
Variance:Alternative definition: difference between average squared data and the mean squaredSample variance
Population Variance
Coefficient of variation:ratio of standard deviation and the meanSample coefficient
Population coefficient
Choosing alternative measures of dispersion:summary statistics involving squared values are sensitive to outlierssummary statistic based on quantiles are robust to outlierscoefficient of variation: useful for comparing spread of different data sets
Variance is expressed in data units SQUARED
The coefficient of variation is UNIT-LESS
![Page 18: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/18.jpg)
C. Funk Geog 210C Spring 2010
18
Centering/Standardizing or Normalizing Data
Normalizing data to zero mean and unit variance (i.e. 1) allows more meaningful comparison of different data sets
Standardization procedure:1. compute mean μx and standard deviation σ x of data set2. subtract the mean from each datum: xi − μx3. divide by the standard deviation: zi =(xi − μx)/ σ x
NOTE – Only applicable for normal data!Normalization procedure for non-normal data:
1. Fit appropriate distribution2. Translate data into percentiles3. Translate percentiles into quantiles from a standard normal distribution (μx=0,σ x=1)
Normalized data are unit free; shape of distribution does not change (e.g., modes remain the same)
EWX Example
![Page 19: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/19.jpg)
C. Funk Geog 210C Spring 2010
19
An Alternative Histogram Transformation (1)
Objective: Transform original data set of x-values into a new data set of z-values with an arbitrary CDFFlowchart:
1. construct piecewise linear CDF FX(xp) of x-values and target CDF FZ(zp) (with or without interpolation)2. find quantile zp of target CDF that corresponds to same quantile xp of sample CDF3. Forward transformation: )](ˆ[)( 111
pXZZp xFFpFz −−− ==
![Page 20: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/20.jpg)
C. Funk Geog 210C Spring 2010
20
An Alternative Histogram Transformation (2)
Transformation characteristics:one-to-one (bijective)non-linear and rank preserving; known as histogram equalization in digital image processingcan match any target CDF; that target CDF can be another sample CDF or a parametric CDF
Inverse transformation: )]([ˆ)(ˆ 11pXXXp zFFpFx −− ==
![Page 21: Overview of Univariate Samples Geog 210Cgeog.ucsb.edu/~chris/Lecture2_210C_Spring2011_Univariate Samples.… · Overview of Univariate Samples Geog 210C Introduction to Spatial Data](https://reader034.vdocument.in/reader034/viewer/2022051801/5ad8e5627f8b9a3e578df210/html5/thumbnails/21.jpg)
C. Funk Geog 210C Spring 2010
21
Quantile-Quantile (Q-Q) Plots
Graph for comparing the shapes of two distributionsProcedure:
1. rank both data sets from smallest to largest value2. compute quantiles of each data set3. cross-plot each quantile pair
Example:
Interpretationstraight plot aligned with45 line implies two similar distribution shapes
EWX Example