multivariate outlier detection

Outlier Identification in National Resources Inventory and Theoretical Extensions to Nondifferentiable Survey Estimators

Jianqiang Wang

Major Professor: Jean Opsomer

Committee: Wayne A. Fuller

Song X. Chen

Dan Nettleton

Dimitris Margaritis

OutlineW Introduction

W Notation and assumptions

W Mean and median-based inference

W Variance estimation

W Simulation study

W Application in National Resources Inventory

W Theoretical extensions

National Resources Inventory (1)

W National Resources Inventory is a longitudinal survey of natural resources on non-Federal land in U.S.

W Conducted by the USDA NRCS, in co-operation with CSSM at Iowa State University.

W Produce a longitudinal database containing numerous agro-environmental variables for scientific investigation and policy-making.

W Information was updated every 5 years before 1997 and annually through a partially overlapping subsampling design.

National Resources Inventory (2)

W Various aspects of land use, farming practice, and environmentally important variables like wetland status and soil erosion.

W Measure both level and change over time in these variables.

W Primary mode of data collection is a combination of aerial photography and field collection.

W Outliers arise from errors in data collection, processing or some real points themselves behave abnormally.

Outlier identification for a longitudinal surveyW Identify outliers for periodically updated data.

W Build outlier identification rules on previous years’ data and use the rules to flag current observations.

Observe years

2001-2005

(2001,2002,2003)

(2003,2004,2005)

Training set

Test set

Target variablesW Non-pseudo core points with soil erosion in years

2001-2005.

W Training set variables: broad use, land use, C factor, support practice factor, slope, slope length and USLE loss in years 2001, 2002 and 2003.

W USLE loss represents the potential long term soil loss in tons/acre.USLELOSS= R * K * LS * C * P

Point classification

b.u. Point Type b.u. Point Type

1 Cultivated cropland 7 Urban and built-up land

2 Noncultivated cropland 8 Rural transportation

3 Pastureland 9 Small water areas

4 Rangeland 10 Large water areas

5 Forest land 11 Rederal land

6 Minor land 12 CRP

Initial partitioningW Initial partitioning uses geographical association

and broad use category.Partition national data into state-wise categories.

Collapse northeastern states.

Partition each region based on broad use sequence into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points

with broad use change.

Merge points with same broad use change pattern, say (2,2,3), (1,1,12).

Source of outlyingnessW Flagged 1% points on training set, and compare test

distances with 99%-quantile of training distances.

W Source of outlyingness

eº;i = b§ ¡ 1=2º (¹ º ¡ y i )

kb§ ¡ 1=2º (¹ º ¡ y i )k

Analysis of flagged pointsW Agricultural specialists analyzed identified points by

suspicious variables.

W C factor: almost all points were considered suspicious.W Data entry errors

W Invalid entries c factor=1 for hayland, pastureland or CRP

W Unusual levels or trends in relation to landuse

(0.013, 0.13, 0.013, 0.013, 0.013)

(0.011, 0.06, 0.11, 0.003, 0.003)

Analysis of flagged pointsW P factor: all points are candidates for review

because of the change over time.

W Slope length: all points were flagged because of the level, not change over time.

(1.0, 1.0, 1.0, 0.6, 1.0)

Nondifferentiable survey estimatorsW The sample distance distribution is

nondifferentiable function of the estimated location parameter.

W A general class of survey estimators:

with corresponding population quantity

W A direct Taylor linearization may not be applicable, again use a differentiable limiting function , with derivative .

bT(^) = 1N

Pi2Sº

h(yi ; ^)

TN (¸ N ) = 1N

P Ni=1 h(yi ;¸ N )

Not necessarily differentiable

T (° ) = limN ! 1

TN (° )³ (° )

bDº;d(¹ º )

AsymptoticsW Under certain regularity conditions,

W The extra variance due to estimating unknown parameter may or may not be negligible.

W Propose a kernel estimator to estimate unknown derivative.

n¤1=2hV( bT(^))

i ¡ 1=2 ³bT(^) ¡ TN (¸ N )

´ ¯¯F d! N (0;1)

( bT(^)) =³1;[³ (¸ N )]T

´V (¹z¼)

µ 1³ (¸ N )

Estimating distribution function using auxiliary informationW Ratio model

W Use as a substitute of , where .

W Difference estimator

W The extra variance due to estimating ratio is negligible (RKM, 1990).

yi = Rxi + ²i ; ²i » N (0;xi ¾2)

Rxi yi R =P

S º yi =¼iPS º x i =¼i

bT(R) = 1N

I(yi · t) +hP

U I(R xi · t) ¡ PSº

I(Rxi · t)i o

Estimating a fraction below an estimated quantity W Estimate the fraction of households in poverty when

the poverty line is drawn at 60% of the median income.

with population quantity

W Assume that , the extra variance depends on .

bT(q) = 1N

I(yi · 0:6q)

TN (qN ) = 1N

I(yi · 0:6qN )

limN ! 1

TN (°) = FY (0:6°)@F Y (0:6° )

Concluding remarksW Proposed an estimator for subpopulation distance

distribution and demonstrated its statistical properties.

W Application in a large-scale longitudinal survey.

W Theoretical extensions to nondifferentiable survey estimators.

Thank you

multivariate outlier detection

Data & Analytics

outlier detection a survey

error rates for multivariate outlier...

robust outlier detection

outlier detection in multivariate time series by...

robust multivariate outlier detection using mahalanobis...

new elliptical insights: geometric travels in multivariate...

chapter 5 outlier detection in multivariate...

outlier detection techniques - lmu...

outlier detection in multivariate linear models …outlier...

cross-outlier detection

spatio-temporal outlier detection in streaming...

outlier detection for compositional data using robust...

anomaly detection using outlier detection schemes

short communication: multivariate outlier detection …short...

multivariate outlier modeling for capturing customer...

outlier detection for temporal data outlier d for temp...

chapter 1 outlier detection

outlier detection methods in multivariate …...abstract....

traffic outlier detection by density-based bounded local...

outlier detection and a method of adjustment for the ... ·...