data screening and adjustments - umass amherst · p detect and correct data errors p detect and...
Post on 27-May-2020
11 Views
Preview:
TRANSCRIPT
1
Purpose:P Detect and correct data errors
P Detect and treat missing data
P Detect and handleinsufficiently sampledvariables (e.g., rare species)
P Conduct transformations andstandardizations
P Detect and handle outliers
Data Screening and Adjustments
2
P Examine summary statistics(e.g., n, mean, min, max) andcheck for irregularities
Data Screening for Errors
Unrealistic value?
Where did allthe data go?
Action: correct errors in the raw data
3
P Evaluate amount and pattern ofmissing data and take correctiveaction, if needed:
Data Screening for Missing Data
Action: replace with prior knowledge; insert meansor medians; use regression to estimate values
e.g., Median replacement
4
P Check for and drop“insufficient” variables< E.g., rare species in
community datasets
Sufficiency is the extent to which each variable, e.g.,each species’ ecological character, is accurately andmeaningfully described by the data.
E.g., species with very few records are not likely to beaccurately placed in ecological space. You must decideat what level of frequency of occurrence you want toaccept the ‘message’ and eliminate species below thislevel.
Data Screening for Sufficiency
5
P Other issues:
< Influence of abundant generalists in communitydatasets
Abundant generalists define strong dimensions ofthe data cloud that have no meaningful pattern onthem. They can overwhelm the message of rarerspecies in some types of analysis. You must decidewhether to include or exclude these “dominant”species.
< Variables with too little variation (i.e., no signature)
Variables with too little variation have nomeaningful pattern (or influence) and are thereforeunnecessary.
Data Screening for Sufficiency
6
Data Screening for Sufficiency
Typical communitydataset
Rarespecies
Dominantspecies
Median occurrence
95% occurrence
5% occurrence
7
Data Screening for Sufficiency
Too fewoccurrences ?
Too little variation?
P Drop “insufficient” variables (species) andconduct sensitivity analysis< Rare species (e.g., <5% occurrence)< Too little variability (e.g., <5-10% CV)
Some Rules of Thumb
8
Data Screening for SufficiencySome Rules of Thumb
P Drop “abundant generalist” species andconduct sensitivity analysis< Dominant species (e.g., >95%
occurrence)
Too ubiquitous?
TotalFEDCBASite2.410.300.600.600.300.300.3013.340.000.850.850.700.480.4826.390.001.491.491.321.041.0432.280.000.300.300.480.600.6040.300.000.300.000.000.000.0051.320.001.320.000.000.000.006
16.050.304.863.242.802.422.42Total
TotalFEDCBASite-0.202.04-0.59-0.31-0.45-0.44-0.441-1.23-0.41-0.35-0.06-0.06-0.18-0.1829.12-0.411.642.002.001.941.943
-1.80-0.41-0.76-0.49-0.320.090.094-3.73-0.41-0.76-0.57-0.58-0.71-0.715-2.16-0.410.82-0.57-0.58-0.71-0.716
0000000Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
9
Purpose:P Statistical
< Improve assumptions of normality, linearity,homogeneity of variance, etc.
< Make units of variables comparable when measured ondifferent scales.
P Ecological< Make ecological distance measures work better.< Reduce effect of total quantity in sample units, to put
focus on relative quantities.< Equalize (or otherwise alter) the relative importance of
variables (e.g., common and rare species).< Emphasize informative variables (species) at the
expense of uninformative variables (species).
Data Transformations & Standardizations
10
LogTransformation
bij=log(xij+1)
Raw Data Matrix Column Z-score Standardizationbij=(xij-0j)/sj
Transformations are applied toeach element of the data matrix,independent of the otherelements.Standardizations adjust matrixelements by a row or columnstandard (e.g., max, sum, etc.).
Data Transformations & Standardizations
TotalFEDCBASite611111115011111250111113501111141010000510100006
23164444Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
11
P To adjust for highly skewedvariables
P To better meet assumptions ofstatistical test (e.g., normality,constant variance, etc.)
P To emphasizepresence/absence(nonquantitative) signature
When to Transform?
Which Transformation?
P Depends on type of data
P Whichever works best
Monotonic Transformations
12
Monotonic Transformations
Raw Data Matrix
Acceptable Domain of x: AllRange of f(x): 0 and 1 only
P Converts quantitative data intononquantitative data
P Applicable for species data
P Most useful when there is littlequantitative information present
P Can be a severe transformation
Binary presence/absenceTransformationbij=xij
0 (power)
bij=xij0 (power)
TotalFEDCBASite2.410.300.600.600.300.300.3013.340.000.850.850.700.480.4826.390.001.491.491.321.041.0432.280.000.300.300.480.600.6040.300.000.300.000.000.000.0051.320.001.320.000.000.000.006
16.050.304.863.242.802.422.42Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
13
Raw Data Matrix
Acceptable Domain of x: >0Range of f(x): AllP Compresses high values and
spreads low values byexpressing values as orders ofmagnitude
P Useful when high degree ofvariation; ratio of largest tosmallest >10; highly positivelyskewed data
bij=log(xij+1)
Log Transformationbij=log(xij+1)
?
Monotonic Transformations
14
Monotonic Transformations
Log Transformationbij=log(xij+1)
T ?
TotalFEDCBASite7.461.001.731.731.001.001.0019.730.002.452.452.001.411.412
21.750.005.485.484.473.163.1636.880.001.001.001.411.731.7341.000.001.000.000.000.000.0054.470.004.470.000.000.000.006
51.291.0016.1310.668.897.317.31Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
Power Transformations
0
2
4
6
8
10
12
0 10 20 30 40 50 60 70 80 90 100
x
b
p=1/2
p=1/3
p=1/4
p=1/5
p=1/10
15
Raw Data Matrix
Acceptable Domain of x: $0Range of f(x): $0
P Similar in effect to, but lessdramatic than, the logtransformation
P Often used with count(meristic) data; e.g., whenmean equals the variance(Poisson distribution)
Square Root Transformationbij=xij
½ (power)
bij=xij½ (power)
Monotonic Transformations
16
Acceptable Domain of x: $0Range of f(x): $0P Different exponents change
the effect of thetransformation; the smallerthe exponent, the morecompression applied to highvalues
P Flexible transformation usefulfor a wide variety of data
Monotonic Transformations
Power Family Transformation bij=xij
1/p
TotalFEDCBASite1.291.000.050.080.040.060.0610.650.000.100.150.150.130.1323.230.000.490.750.740.630.6330.490.000.020.030.070.190.1940.020.000.020.000.000.000.0050.330.000.330.000.000.000.006
6111111Total
TotalFEDCBASite1.761.000.140.180.120.160.1611.170.000.200.250.250.230.2322.980.000.490.670.660.580.5830.930.000.080.100.180.290.2940.080.000.080.000.000.000.0050.390.000.390.000.000.000.006
7.312511.3921.1981.211.2561.256Total
17
Monotonic Transformations
Power Family Transformation bij=xij
1/p
18
Raw Data Matrix
Acceptable Domain of x: 0-1Range of f(x): 0-1P Spreads end of the scale while
compressing the middle forproportion data
P Useful for proportion datawith positive skew (can usearcsine transformation fornegative skew)
Arcsin Square RootTransformation
bij=(2/π)*sin- 1(xij½)
bij=(2/π)*sin- 1(xij½)
Monotonic Transformations
BS
s
1
19
Monotonic TransformationsArcsin Square Root
Transformationbij=(2/π)*sin- 1(xij
½)
T ?
20
P Consider binary(presence/absence)transformation when:< percent zeros high
(say >50%)< number of distinct
values low (say < 10)< Beta diversity high
(say >5)
Some Rules of Thumb
Monotonic Transformations
P Use a log or square roottransformation for “highly”skewed data or ranging over several(>2) orders of magnitude
P Use arcsine squareroot transformationfor proportion data
P If applied to related variable set(e.g., species), then use sametransformation (e.g., log) so that allare scaled the same; otherwise,transform independently
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
TotalFEDCBASite3.330.331.001.000.330.330.3313.330.001.001.000.670.330.3323.330.001.001.000.670.330.3333.330.000.330.330.671.001.0041.000.001.000.000.000.000.0051.000.001.000.000.000.000.006
15.330.335.333.332.332.002.00Total
TotalFEDCBASite-0.222.24-0.65-0.34-0.50-0.48-0.481-1.35-0.45-0.38-0.06-0.07-0.19-0.19210.00-0.451.802.192.192.132.133-1.97-0.45-0.83-0.53-0.350.100.104-4.09-0.45-0.83-0.63-0.64-0.77-0.775-2.36-0.450.89-0.63-0.64-0.77-0.776
0000000Total
TotalFEDCBASite0.00-0.711.411.41-0.71-0.71-0.7110.00-1.511.211.210.30-0.60-0.6020.00-1.511.211.210.30-0.60-0.6030.00-1.51-0.60-0.600.301.211.2140.00-0.452.24-0.45-0.45-0.45-0.4550.00-0.452.24-0.45-0.45-0.45-0.456
0-6.127.6952.329-0.7-1.6-1.6Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
21
Raw Data MatrixP To place on equal footing
highly unequal sample units orvariables (species)
P To better represent thepatterns of interest
When to Standardize?
Which Standardization?P Depends on objective (sample
or variable adjustment) andstatistical technique(ordination, cluster, etc.)?
P Which standard (variance,totals, max, etc.) makes sense?
bij=xij / max(xi)
Standardizations
22
Standardizations
Raw Data Matrix
bij=(xij-0j)/sj P Standardizations adjustmatrix elements by a row orcolumn standard (e.g., max,sum, etc.).
P All standardizations can beapplied to either rows orcolumns (or both)
bij=(xij-0i)/si
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
23
Column or Row Standardizations?
Raw Data Matrix
P When the principal concern is toadjust for differences (e.g., variances,total abundance, ubiquity) amongvariables (species) in order to placethem on equal footing.
P When the focus is on the profileacross sample units.
Column Standardization
Row StandardizationP When the principal
concern is to adjust fordifferences (e.g., totalabundance, diversity)among sample units inorder to place them onequal footing.
P When the focus is on theprofile within a sampleunit.
24
Common Standardizations
P Total...divide by margin total
P Max...divide by marginmaximum
P Range...standardize values torange 0-1
P Frequency...divide by marginmaximum and multiply bynumber of non-zero items,so that the average of non-zero items is 1
P Hellinger...square root ofmethod=total
P Normalization...makemargin sums of squaresequal 1
P Standardize...scale to zeromean and unit variance (z-scores)
P Chi.square...divide by rowsums and square root ofcolumn sums, and adjust forsquare root of matrix total
TotalFEDCBASite-0.202.04-0.59-0.31-0.45-0.44-0.441-1.23-0.41-0.35-0.06-0.06-0.18-0.1829.12-0.411.642.002.001.941.943
-1.80-0.41-0.76-0.49-0.320.090.094-3.73-0.41-0.76-0.57-0.58-0.71-0.715-2.16-0.410.82-0.57-0.58-0.71-0.716
0000000Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
TotalFEDCBASite1.291.000.050.080.040.060.0610.650.000.100.150.150.130.1323.230.000.490.750.740.630.6330.490.000.020.030.070.190.1940.020.000.020.000.000.000.0050.330.000.330.000.000.000.006
6111111Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
25
Raw Data Matrix
Acceptable Domain of x: AllRange of f(x): AllP Converts data to z-scores
(mean=0, variance=1)
P Commonly used to placevariables on equal footing
P Essential when variables havedifferent scales or units ofmeasurement
Column Z-score Standardizationbij=(xij-0j)/sj
bij=(xij-0j)/sj
Standardizations
26
Raw Data Matrix
Acceptable Domain of x: $0Range of f(x): 0-1
P Commonly used with species datato adjust for unequal abundancesamong species
P Equalizes areas under curves ofspecies response profiles
P Relative abundance profiles ofsamples depends on species’relative abundances across all sites
Column Total Standardizationbij=xij / 3xj
bij=xij / 3xj
Standardizations
TotalFEDCBASite1.451.000.100.100.050.100.1011.000.000.200.200.200.200.2025.000.001.001.001.001.001.0030.770.000.030.030.100.300.3040.030.000.030.000.000.000.0050.670.000.670.000.000.000.0068.921.002.031.331.351.601.60Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8 9 10
Abundance (count)
Fre
qu
ency
Species A
Species B
C olu mn T ota l Stan dardization
0.000
0.050
0.100
0.150
0.200
0.250
0.300
0.350
0 1 2 3 4 5 6 7 8 9 10
Abundance (count)
Fre
quen
cy
C olu mn M ax Stan dardization
0.000
0.200
0.400
0.600
0.800
1.000
1.200
0 1 2 3 4 5 6 7 8 9 10
Abundance (count)
Fre
quen
cy
27
Raw Data Matrix
Acceptable Domain of x: $0Range of f(x): 0-1
P Similar to column total, except:
P Equalizes heights of peaks of speciesresponse curves
P Based on extreme values whichcan introduce noise
P Can exacerbate importance ofrare species
Column Max Standardizationbij=xij / max(xj)
bij=xij / max(xj)
Standardizations
28
Standardizations
Equalizes areaunder curve
Equalizes peaksof curves
TotalFEDCBASite10.100.300.300.100.100.10110.000.300.300.200.100.10210.000.300.300.200.100.10310.000.100.100.200.300.30410.001.000.000.000.000.00510.001.000.000.000.000.00660.103.001.000.700.600.60Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
TotalFEDCBASite3.330.331.001.000.330.330.3313.330.001.001.000.670.330.3323.330.001.001.000.670.330.3333.330.000.330.330.671.001.0041.000.001.000.000.000.000.0051.000.001.000.000.000.000.006
15.330.335.333.332.332.002.00Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
29
Raw Data Matrix
Acceptable Domain of x: $0Range of f(x): 0-1
P Commonly used with species datato adjust for unequal abundancesamong sample units
P Equalizes areas under curves ofsample unit profiles
P Shifts emphasis to relativeabundance within a sample unit
P Relative abundance profiles ofsamples are independent
Row Total Standardizationbij=xij / 3xi
bij=xij / 3xi
Standardizations
30
Raw Data Matrix
Acceptable Domain of x: $0Range of f(x): 0-1
P Similar to row total; except:
P Equalizes heights of peaks ofsample unit profiles
P Based on extreme valueswhich can introduce noise
Row Max Standardizationbij=xij / max(xi)
bij=xij / max(xi)
Standardizations
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
Site A B C D E F Total1 0.10 0.10 0.05 0.10 0.10 1.00 1.452 0.20 0.20 0.20 0.20 0.20 0.00 1.003 1.00 1.00 1.00 1.00 1.00 0.00 5.004 0.30 0.30 0.10 0.03 0.03 0.00 0.775 0.00 0.00 0.00 0.00 0.03 0.00 0.036 0.00 0.00 0.00 0.00 0.67 0.00 0.67
Total 1.60 1.60 1.35 1.33 2.03 1.00 8.92
Site A B C D E F Total1 0.01 0.01 0.01 0.01 0.01 0.10 0.152 0.01 0.01 0.01 0.01 0.01 0.00 0.053 0.01 0.01 0.01 0.01 0.01 0.00 0.054 0.03 0.03 0.01 0.00 0.00 0.00 0.085 0.00 0.00 0.00 0.00 0.03 0.00 0.036 0.00 0.00 0.00 0.00 0.03 0.00 0.03
Total 0.06 0.06 0.04 0.03 0.10 0.10 0.39
31
Raw Data Matrix
Acceptable Domain of x: $0Range of f(x): 0-1
P 1st standardize by species (col)maxima, then by row totals
P Equalize emphasis among sampleunits and among species
P Appealing, but comes at cost ofdiminishing the intuitive meaningfor individual data values
Wisconsin DoubleStandardization
Standardizations
bij=col max
bij=row total
32
StandardizationsSome Rules of Thumb
P Effect of standardization on analysis dependson variability among rows and/or columns
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
TotalFEDCBASite10.100.300.300.100.100.10110.000.300.300.200.100.10210.000.300.300.200.100.10310.000.100.100.200.300.30410.001.000.000.000.000.00510.001.000.000.000.000.00660.103.001.000.700.600.60Total
Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20
Total 16 16 27 40 61 1 161
TotalFEDCBASite-0.202.04-0.59-0.31-0.45-0.44-0.441-1.23-0.41-0.35-0.06-0.06-0.18-0.1829.12-0.411.642.002.001.941.943
-1.80-0.41-0.76-0.49-0.320.090.094-3.73-0.41-0.76-0.57-0.58-0.71-0.715-2.16-0.410.82-0.57-0.58-0.71-0.716
0000000Total
33
Some Rules of Thumb
P Consider rowstandardizations for speciesdata sets, commonly:
< Row normalize (Euclideandistance (ED) = chorddistance)
Standardizations
< Row chi.square (ED =chi.square distance ofCA/CCA)
< Row total (ED = speciesprofile distance)
< Row hellinger (ED =Hellinger distance)(Legendre and Gallagher 2001)
34
Some Rules of Thumb
P Consider column standardizationsto “equalize” variables measured indifferent units and scales,commonly:
< Column standardize (z-scores =zero mean and unit variance)
< Column normalize (uncenteredwith unit variance)
< Column total (col sums = 1)
< Column range (col range 0-1)
Standardizations
bij=(xij-0j)/sj
35
Some Rules of Thumb
P Standardizations may not matter depending onsubsequent analysis, e.g.,:
< Principal components of correlation matrixhas built in column standardization
< Correspondence analysis of species data sethas essentially a built in chi-squarestandardization
P No theoretical basis for selecting the “best”standardization - should justify on biologicalgrounds and perhaps conduct sensitivity analysis
Standardizations
36
P What are outliers?< Sample units with extreme values
for individual variables(univariate outliers) or sampleunits with unusual combinationof values for more than onevariable (mulitvariate outliers).
P Why worry about outliers?< Outliers can have a large effect
on the outcome of an analysisand therefore can lead toerroneous conclusions.
Data Screening for Outliers
AMGO AMRO BAEA BCCH81 6.50 4.91 NA NA82 6.50 NA NA NA83 4.27 4.30 NA 4.2984 6.50 NA NA NA85 NA NA NA 5.4487 NA NA NA NA89 NA NA NA 5.4490 NA NA NA NA91 NA NA 12.73 NA
37
P Univariate outliers:< Examine sample standard
deviation scores on eachvariable separately.
Data Screening for Outliers
Extremeobservations
Standard deviation scores >3
38
P Multivariate outliers:< Examine deviations of
the sample averagedistances to othersamples.
Data Screening for Outliers
Extremeobservations
Standard deviation scores >3
39
P Multivariate outliers:< Examine each sample’s
Mahalanobis distance tothe group of remainingsamples.
Data Screening for Outliers
D2
40
Data Screening for Outliers
PC2PC1
PC3
P Multivariate outliers:< Examine results of subsequent
analyses for extreme values (e.g.,isolated points in ordination plots,single-member clusters in clusteranalysis, etc.)
41
P Examine data at all stages of analysis(i.e., input data,transformed/standardized data,ecological distance matrix, results ofanalysis) for extreme values
P Be aware of potential impact ofextreme values in chosen analysis
P Delete extreme values only ifjustifiable on ecological grounds
P Conduct sensitivity analysis
Data Screening for OutliersSome Rules of Thumb
top related