data screening and adjustments - umass amherst · p detect and correct data errors p detect and...

Purpose:P Detect and correct data errors

P Detect and treat missing data

P Detect and handleinsufficiently sampledvariables (e.g., rare species)

P Conduct transformations andstandardizations

P Detect and handle outliers

Data Screening and Adjustments

P Examine summary statistics(e.g., n, mean, min, max) andcheck for irregularities

Data Screening for Errors

Unrealistic value?

Where did allthe data go?

Action: correct errors in the raw data

P Evaluate amount and pattern ofmissing data and take correctiveaction, if needed:

Data Screening for Missing Data

Action: replace with prior knowledge; insert meansor medians; use regression to estimate values

e.g., Median replacement

P Check for and drop“insufficient” variables< E.g., rare species in

community datasets

Sufficiency is the extent to which each variable, e.g.,each species’ ecological character, is accurately andmeaningfully described by the data.

E.g., species with very few records are not likely to beaccurately placed in ecological space. You must decideat what level of frequency of occurrence you want toaccept the ‘message’ and eliminate species below thislevel.

Data Screening for Sufficiency

P Other issues:

< Influence of abundant generalists in communitydatasets

Abundant generalists define strong dimensions ofthe data cloud that have no meaningful pattern onthem. They can overwhelm the message of rarerspecies in some types of analysis. You must decidewhether to include or exclude these “dominant”species.

< Variables with too little variation (i.e., no signature)

Variables with too little variation have nomeaningful pattern (or influence) and are thereforeunnecessary.

Typical communitydataset

Rarespecies

Dominantspecies

Median occurrence

95% occurrence

5% occurrence

Too fewoccurrences ?

Too little variation?

P Drop “insufficient” variables (species) andconduct sensitivity analysis< Rare species (e.g., <5% occurrence)< Too little variability (e.g., <5-10% CV)

Some Rules of Thumb

Data Screening for SufficiencySome Rules of Thumb

P Drop “abundant generalist” species andconduct sensitivity analysis< Dominant species (e.g., >95%

occurrence)

Too ubiquitous?

TotalFEDCBASite2.410.300.600.600.300.300.3013.340.000.850.850.700.480.4826.390.001.491.491.321.041.0432.280.000.300.300.480.600.6040.300.000.300.000.000.000.0051.320.001.320.000.000.000.006

16.050.304.863.242.802.422.42Total

TotalFEDCBASite-0.202.04-0.59-0.31-0.45-0.44-0.441-1.23-0.41-0.35-0.06-0.06-0.18-0.1829.12-0.411.642.002.001.941.943

-1.80-0.41-0.76-0.49-0.320.090.094-3.73-0.41-0.76-0.57-0.58-0.71-0.715-2.16-0.410.82-0.57-0.58-0.71-0.716

0000000Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

Purpose:P Statistical

< Improve assumptions of normality, linearity,homogeneity of variance, etc.

< Make units of variables comparable when measured ondifferent scales.

P Ecological< Make ecological distance measures work better.< Reduce effect of total quantity in sample units, to put

focus on relative quantities.< Equalize (or otherwise alter) the relative importance of

variables (e.g., common and rare species).< Emphasize informative variables (species) at the

expense of uninformative variables (species).

Data Transformations & Standardizations

LogTransformation

bij=log(xij+1)

Raw Data Matrix Column Z-score Standardizationbij=(xij-0j)/sj

Transformations are applied toeach element of the data matrix,independent of the otherelements.Standardizations adjust matrixelements by a row or columnstandard (e.g., max, sum, etc.).

Data Transformations & Standardizations

TotalFEDCBASite611111115011111250111113501111141010000510100006

23164444Total

Total 16 16 27 40 61 1 161

P To adjust for highly skewedvariables

P To better meet assumptions ofstatistical test (e.g., normality,constant variance, etc.)

P To emphasizepresence/absence(nonquantitative) signature

When to Transform?

Which Transformation?

P Depends on type of data

P Whichever works best

Monotonic Transformations

Raw Data Matrix

Acceptable Domain of x: AllRange of f(x): 0 and 1 only

P Converts quantitative data intononquantitative data

P Applicable for species data

P Most useful when there is littlequantitative information present

P Can be a severe transformation

Binary presence/absenceTransformationbij=xij

0 (power)

bij=xij0 (power)

16.050.304.863.242.802.422.42Total

Total 16 16 27 40 61 1 161

Raw Data Matrix

Acceptable Domain of x: >0Range of f(x): AllP Compresses high values and

spreads low values byexpressing values as orders ofmagnitude

P Useful when high degree ofvariation; ratio of largest tosmallest >10; highly positivelyskewed data

bij=log(xij+1)

Log Transformationbij=log(xij+1)

TotalFEDCBASite7.461.001.731.731.001.001.0019.730.002.452.452.001.411.412

21.750.005.485.484.473.163.1636.880.001.001.001.411.731.7341.000.001.000.000.000.000.0054.470.004.470.000.000.000.006

51.291.0016.1310.668.897.317.31Total

Total 16 16 27 40 61 1 161

Power Transformations

0 10 20 30 40 50 60 70 80 90 100

p=1/10

Raw Data Matrix

Acceptable Domain of x: $0Range of f(x): $0

P Similar in effect to, but lessdramatic than, the logtransformation

P Often used with count(meristic) data; e.g., whenmean equals the variance(Poisson distribution)

Square Root Transformationbij=xij

½ (power)

bij=xij½ (power)

Acceptable Domain of x: $0Range of f(x): $0P Different exponents change

the effect of thetransformation; the smallerthe exponent, the morecompression applied to highvalues

P Flexible transformation usefulfor a wide variety of data

Power Family Transformation bij=xij

6111111Total

7.312511.3921.1981.211.2561.256Total

Power Family Transformation bij=xij

Raw Data Matrix

Acceptable Domain of x: 0-1Range of f(x): 0-1P Spreads end of the scale while

compressing the middle forproportion data

P Useful for proportion datawith positive skew (can usearcsine transformation fornegative skew)

Arcsin Square RootTransformation

bij=(2/π)*sin- 1(xij½)

Monotonic TransformationsArcsin Square Root

Transformationbij=(2/π)*sin- 1(xij

P Consider binary(presence/absence)transformation when:< percent zeros high

(say >50%)< number of distinct

values low (say < 10)< Beta diversity high

(say >5)

Some Rules of Thumb

P Use a log or square roottransformation for “highly”skewed data or ranging over several(>2) orders of magnitude

P Use arcsine squareroot transformationfor proportion data

P If applied to related variable set(e.g., species), then use sametransformation (e.g., log) so that allare scaled the same; otherwise,transform independently

Total 16 16 27 40 61 1 161

15.330.335.333.332.332.002.00Total

TotalFEDCBASite-0.222.24-0.65-0.34-0.50-0.48-0.481-1.35-0.45-0.38-0.06-0.07-0.19-0.19210.00-0.451.802.192.192.132.133-1.97-0.45-0.83-0.53-0.350.100.104-4.09-0.45-0.83-0.63-0.64-0.77-0.775-2.36-0.450.89-0.63-0.64-0.77-0.776

0000000Total

TotalFEDCBASite0.00-0.711.411.41-0.71-0.71-0.7110.00-1.511.211.210.30-0.60-0.6020.00-1.511.211.210.30-0.60-0.6030.00-1.51-0.60-0.600.301.211.2140.00-0.452.24-0.45-0.45-0.45-0.4550.00-0.452.24-0.45-0.45-0.45-0.456

0-6.127.6952.329-0.7-1.6-1.6Total

Total 16 16 27 40 61 1 161

Raw Data MatrixP To place on equal footing

highly unequal sample units orvariables (species)

P To better represent thepatterns of interest

When to Standardize?

Which Standardization?P Depends on objective (sample

or variable adjustment) andstatistical technique(ordination, cluster, etc.)?

P Which standard (variance,totals, max, etc.) makes sense?

bij=xij / max(xi)

Standardizations

Raw Data Matrix

bij=(xij-0j)/sj P Standardizations adjustmatrix elements by a row orcolumn standard (e.g., max,sum, etc.).

P All standardizations can beapplied to either rows orcolumns (or both)

bij=(xij-0i)/si

Total 16 16 27 40 61 1 161

Column or Row Standardizations?

Raw Data Matrix

P When the principal concern is toadjust for differences (e.g., variances,total abundance, ubiquity) amongvariables (species) in order to placethem on equal footing.

P When the focus is on the profileacross sample units.

Column Standardization

Row StandardizationP When the principal

concern is to adjust fordifferences (e.g., totalabundance, diversity)among sample units inorder to place them onequal footing.

P When the focus is on theprofile within a sampleunit.

Common Standardizations

P Total...divide by margin total

P Max...divide by marginmaximum

P Range...standardize values torange 0-1

P Frequency...divide by marginmaximum and multiply bynumber of non-zero items,so that the average of non-zero items is 1

P Hellinger...square root ofmethod=total

P Normalization...makemargin sums of squaresequal 1

P Standardize...scale to zeromean and unit variance (z-scores)

P Chi.square...divide by rowsums and square root ofcolumn sums, and adjust forsquare root of matrix total

-1.80-0.41-0.76-0.49-0.320.090.094-3.73-0.41-0.76-0.57-0.58-0.71-0.715-2.16-0.410.82-0.57-0.58-0.71-0.716

0000000Total

Total 16 16 27 40 61 1 161

6111111Total

Total 16 16 27 40 61 1 161

Raw Data Matrix

Acceptable Domain of x: AllRange of f(x): AllP Converts data to z-scores

(mean=0, variance=1)

P Commonly used to placevariables on equal footing

P Essential when variables havedifferent scales or units ofmeasurement

Column Z-score Standardizationbij=(xij-0j)/sj

bij=(xij-0j)/sj

Standardizations

Raw Data Matrix

Acceptable Domain of x: $0Range of f(x): 0-1

P Commonly used with species datato adjust for unequal abundancesamong species

P Equalizes areas under curves ofspecies response profiles

P Relative abundance profiles ofsamples depends on species’relative abundances across all sites

Column Total Standardizationbij=xij / 3xj

bij=xij / 3xj

Standardizations

TotalFEDCBASite1.451.000.100.100.050.100.1011.000.000.200.200.200.200.2025.000.001.001.001.001.001.0030.770.000.030.030.100.300.3040.030.000.030.000.000.000.0050.670.000.670.000.000.000.0068.921.002.031.331.351.601.60Total

Total 16 16 27 40 61 1 161

0 1 2 3 4 5 6 7 8 9 10

Abundance (count)

Species A

Species B

C olu mn T ota l Stan dardization

0 1 2 3 4 5 6 7 8 9 10

Abundance (count)

C olu mn M ax Stan dardization

0 1 2 3 4 5 6 7 8 9 10

Abundance (count)

Raw Data Matrix

P Similar to column total, except:

P Equalizes heights of peaks of speciesresponse curves

P Based on extreme values whichcan introduce noise

P Can exacerbate importance ofrare species

Column Max Standardizationbij=xij / max(xj)

bij=xij / max(xj)

Standardizations

Equalizes areaunder curve

Equalizes peaksof curves

TotalFEDCBASite10.100.300.300.100.100.10110.000.300.300.200.100.10210.000.300.300.200.100.10310.000.100.100.200.300.30410.001.000.000.000.000.00510.001.000.000.000.000.00660.103.001.000.700.600.60Total

Total 16 16 27 40 61 1 161

15.330.335.333.332.332.002.00Total

Total 16 16 27 40 61 1 161

Raw Data Matrix

P Commonly used with species datato adjust for unequal abundancesamong sample units

P Equalizes areas under curves ofsample unit profiles

P Shifts emphasis to relativeabundance within a sample unit

P Relative abundance profiles ofsamples are independent

Row Total Standardizationbij=xij / 3xi

bij=xij / 3xi

Standardizations

Raw Data Matrix

P Similar to row total; except:

P Equalizes heights of peaks ofsample unit profiles

P Based on extreme valueswhich can introduce noise

Row Max Standardizationbij=xij / max(xi)

bij=xij / max(xi)

Standardizations

Total 16 16 27 40 61 1 161

Site A B C D E F Total1 0.10 0.10 0.05 0.10 0.10 1.00 1.452 0.20 0.20 0.20 0.20 0.20 0.00 1.003 1.00 1.00 1.00 1.00 1.00 0.00 5.004 0.30 0.30 0.10 0.03 0.03 0.00 0.775 0.00 0.00 0.00 0.00 0.03 0.00 0.036 0.00 0.00 0.00 0.00 0.67 0.00 0.67

Total 1.60 1.60 1.35 1.33 2.03 1.00 8.92

Site A B C D E F Total1 0.01 0.01 0.01 0.01 0.01 0.10 0.152 0.01 0.01 0.01 0.01 0.01 0.00 0.053 0.01 0.01 0.01 0.01 0.01 0.00 0.054 0.03 0.03 0.01 0.00 0.00 0.00 0.085 0.00 0.00 0.00 0.00 0.03 0.00 0.036 0.00 0.00 0.00 0.00 0.03 0.00 0.03

Total 0.06 0.06 0.04 0.03 0.10 0.10 0.39

Raw Data Matrix

P 1st standardize by species (col)maxima, then by row totals

P Equalize emphasis among sampleunits and among species

P Appealing, but comes at cost ofdiminishing the intuitive meaningfor individual data values

Wisconsin DoubleStandardization

Standardizations

bij=col max

bij=row total

StandardizationsSome Rules of Thumb

P Effect of standardization on analysis dependson variability among rows and/or columns

Total 16 16 27 40 61 1 161

TotalFEDCBASite10.100.300.300.100.100.10110.000.300.300.200.100.10210.000.300.300.200.100.10310.000.100.100.200.300.30410.001.000.000.000.000.00510.001.000.000.000.000.00660.103.001.000.700.600.60Total

Total 16 16 27 40 61 1 161

-1.80-0.41-0.76-0.49-0.320.090.094-3.73-0.41-0.76-0.57-0.58-0.71-0.715-2.16-0.410.82-0.57-0.58-0.71-0.716

0000000Total

Some Rules of Thumb

P Consider rowstandardizations for speciesdata sets, commonly:

< Row normalize (Euclideandistance (ED) = chorddistance)

Standardizations

< Row chi.square (ED =chi.square distance ofCA/CCA)

< Row total (ED = speciesprofile distance)

< Row hellinger (ED =Hellinger distance)(Legendre and Gallagher 2001)

Some Rules of Thumb

P Consider column standardizationsto “equalize” variables measured indifferent units and scales,commonly:

< Column standardize (z-scores =zero mean and unit variance)

< Column normalize (uncenteredwith unit variance)

< Column total (col sums = 1)

< Column range (col range 0-1)

Standardizations

bij=(xij-0j)/sj

Some Rules of Thumb

P Standardizations may not matter depending onsubsequent analysis, e.g.,:

< Principal components of correlation matrixhas built in column standardization

< Correspondence analysis of species data sethas essentially a built in chi-squarestandardization

P No theoretical basis for selecting the “best”standardization - should justify on biologicalgrounds and perhaps conduct sensitivity analysis

Standardizations

P What are outliers?< Sample units with extreme values

for individual variables(univariate outliers) or sampleunits with unusual combinationof values for more than onevariable (mulitvariate outliers).

P Why worry about outliers?< Outliers can have a large effect

on the outcome of an analysisand therefore can lead toerroneous conclusions.

Data Screening for Outliers

AMGO AMRO BAEA BCCH81 6.50 4.91 NA NA82 6.50 NA NA NA83 4.27 4.30 NA 4.2984 6.50 NA NA NA85 NA NA NA 5.4487 NA NA NA NA89 NA NA NA 5.4490 NA NA NA NA91 NA NA 12.73 NA

P Univariate outliers:< Examine sample standard

deviation scores on eachvariable separately.

Extremeobservations

Standard deviation scores >3

P Multivariate outliers:< Examine deviations of

the sample averagedistances to othersamples.

Extremeobservations

Standard deviation scores >3

P Multivariate outliers:< Examine each sample’s

Mahalanobis distance tothe group of remainingsamples.

PC2PC1

P Multivariate outliers:< Examine results of subsequent

analyses for extreme values (e.g.,isolated points in ordination plots,single-member clusters in clusteranalysis, etc.)

P Examine data at all stages of analysis(i.e., input data,transformed/standardized data,ecological distance matrix, results ofanalysis) for extreme values

P Be aware of potential impact ofextreme values in chosen analysis

P Delete extreme values only ifjustifiable on ecological grounds

P Conduct sensitivity analysis

Data Screening for OutliersSome Rules of Thumb

data screening and adjustments - umass amherst · p detect and correct data errors p detect and...

Documents

high-resolution p- and s-wave reflection to detect a

using data analytics to detect fraud · using data...

extraction and processing of videocapsule data to detect

detect and process outliers for temperature data at 3h

alchemist xf understanding cadence - sam · pdf...

save on registration using data analytics to detect...

using big data to detect & reduce fraud, 9th april...

using machine learning to detect cognitive states across ......

using multimodal biosignal data from wearables to detect

microarray data analysis statistical methods to detect...

screening public procurement data to detect bid-rigging: the...

leveraging anonymized patient level data to detect … ·...

using an obcd approach and landsat tm data to detect

mining data provenance to detect advanced …...mining data...

intro to data preprocessing€¦ · data preprocessing...

leveraging anonymized patient level data to detect hidden...

high-resolution p- and s-wave reflection to detect a shallow

data-driven solutions to detect api compatibility issues in...

using data analysis to detect fraud

using data analytics to detect fraud · using data...