multivariate outlier detection

Post on 22-Jan-2017

115 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Outlier Identification in National Resources Inventory and Theoretical Extensions to Nondifferentiable Survey Estimators

Jianqiang Wang

Major Professor: Jean Opsomer

Committee: Wayne A. Fuller

Song X. Chen

Dan Nettleton

Dimitris Margaritis

2

OutlineW Introduction

W Notation and assumptions

W Mean and median-based inference

W Variance estimation

W Simulation study

W Application in National Resources Inventory

W Theoretical extensions

3

National Resources Inventory (1)

W National Resources Inventory is a longitudinal survey of natural resources on non-Federal land in U.S.

W Conducted by the USDA NRCS, in co-operation with CSSM at Iowa State University.

W Produce a longitudinal database containing numerous agro-environmental variables for scientific investigation and policy-making.

W Information was updated every 5 years before 1997 and annually through a partially overlapping subsampling design.

4

National Resources Inventory (2)

W Various aspects of land use, farming practice, and environmentally important variables like wetland status and soil erosion.

W Measure both level and change over time in these variables.

W Primary mode of data collection is a combination of aerial photography and field collection.

W Outliers arise from errors in data collection, processing or some real points themselves behave abnormally.

5

Outlier identification for a longitudinal surveyW Identify outliers for periodically updated data.

W Build outlier identification rules on previous years’ data and use the rules to flag current observations.

Observe years

2001-2005

(2001,2002,2003)

(2003,2004,2005)

Training set

Test set

6

Target variablesW Non-pseudo core points with soil erosion in years

2001-2005.

W Training set variables: broad use, land use, C factor, support practice factor, slope, slope length and USLE loss in years 2001, 2002 and 2003.

W USLE loss represents the potential long term soil loss in tons/acre.USLELOSS= R * K * LS * C * P

7

Point classification

b.u. Point Type b.u. Point Type

1 Cultivated cropland 7 Urban and built-up land

2 Noncultivated cropland 8 Rural transportation

3 Pastureland 9 Small water areas

4 Rangeland 10 Large water areas

5 Forest land 11 Rederal land

6 Minor land 12 CRP

8

Initial partitioningW Initial partitioning uses geographical association

and broad use category.Partition national data into state-wise categories.

Collapse northeastern states.

Partition each region based on broad use sequence into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points

with broad use change.

Merge points with same broad use change pattern, say (2,2,3), (1,1,12).

9

Source of outlyingnessW Flagged 1% points on training set, and compare test

distances with 99%-quantile of training distances.

W Source of outlyingness

eº;i = b§ ¡ 1=2º (¹ º ¡ y i )

kb§ ¡ 1=2º (¹ º ¡ y i )k

10

Analysis of flagged pointsW Agricultural specialists analyzed identified points by

suspicious variables.

W C factor: almost all points were considered suspicious.W Data entry errors

W Invalid entries c factor=1 for hayland, pastureland or CRP

W Unusual levels or trends in relation to landuse

(0.013, 0.13, 0.013, 0.013, 0.013)

(0.011, 0.06, 0.11, 0.003, 0.003)

11

Analysis of flagged pointsW P factor: all points are candidates for review

because of the change over time.

W Slope length: all points were flagged because of the level, not change over time.

(1.0, 1.0, 1.0, 0.6, 1.0)

12

Nondifferentiable survey estimatorsW The sample distance distribution is

nondifferentiable function of the estimated location parameter.

W A general class of survey estimators:

with corresponding population quantity

W A direct Taylor linearization may not be applicable, again use a differentiable limiting function , with derivative .

bT(^) = 1N

Pi2Sº

1¼i

h(yi ; ^)

TN (¸ N ) = 1N

P Ni=1 h(yi ;¸ N )

Not necessarily differentiable

T (° ) = limN ! 1

TN (° )³ (° )

bDº;d(¹ º )

13

AsymptoticsW Under certain regularity conditions,

where

W The extra variance due to estimating unknown parameter may or may not be negligible.

W Propose a kernel estimator to estimate unknown derivative.

n¤1=2hV( bT(^))

i ¡ 1=2 ³bT(^) ¡ TN (¸ N )

´ ¯¯F d! N (0;1)

( bT(^)) =³1;[³ (¸ N )]T

´V (¹z¼)

µ 1³ (¸ N )

¶:

14

Estimating distribution function using auxiliary informationW Ratio model

W Use as a substitute of , where .

W Difference estimator

W The extra variance due to estimating ratio is negligible (RKM, 1990).

yi = Rxi + ²i ; ²i » N (0;xi ¾2)

Rxi yi R =P

S º yi =¼iPS º x i =¼i

bT(R) = 1N

nPSº

1¼i

I(yi · t) +hP

U I(R xi · t) ¡ PSº

1¼i

I(Rxi · t)i o

15

Estimating a fraction below an estimated quantity W Estimate the fraction of households in poverty when

the poverty line is drawn at 60% of the median income.

with population quantity

W Assume that , the extra variance depends on .

bT(q) = 1N

PSº

1¼i

I(yi · 0:6q)

TN (qN ) = 1N

NPi=1

I(yi · 0:6qN )

limN ! 1

TN (°) = FY (0:6°)@F Y (0:6° )

16

Concluding remarksW Proposed an estimator for subpopulation distance

distribution and demonstrated its statistical properties.

W Application in a large-scale longitudinal survey.

W Theoretical extensions to nondifferentiable survey estimators.

17

Thank you

top related