Download - Multivariate outlier detection
1
Outlier Identification in National Resources Inventory and Theoretical Extensions to Nondifferentiable Survey Estimators
Jianqiang Wang
Major Professor: Jean Opsomer
Committee: Wayne A. Fuller
Song X. Chen
Dan Nettleton
Dimitris Margaritis
2
OutlineW Introduction
W Notation and assumptions
W Mean and median-based inference
W Variance estimation
W Simulation study
W Application in National Resources Inventory
W Theoretical extensions
3
National Resources Inventory (1)
W National Resources Inventory is a longitudinal survey of natural resources on non-Federal land in U.S.
W Conducted by the USDA NRCS, in co-operation with CSSM at Iowa State University.
W Produce a longitudinal database containing numerous agro-environmental variables for scientific investigation and policy-making.
W Information was updated every 5 years before 1997 and annually through a partially overlapping subsampling design.
4
National Resources Inventory (2)
W Various aspects of land use, farming practice, and environmentally important variables like wetland status and soil erosion.
W Measure both level and change over time in these variables.
W Primary mode of data collection is a combination of aerial photography and field collection.
W Outliers arise from errors in data collection, processing or some real points themselves behave abnormally.
5
Outlier identification for a longitudinal surveyW Identify outliers for periodically updated data.
W Build outlier identification rules on previous years’ data and use the rules to flag current observations.
Observe years
2001-2005
(2001,2002,2003)
(2003,2004,2005)
Training set
Test set
6
Target variablesW Non-pseudo core points with soil erosion in years
2001-2005.
W Training set variables: broad use, land use, C factor, support practice factor, slope, slope length and USLE loss in years 2001, 2002 and 2003.
W USLE loss represents the potential long term soil loss in tons/acre.USLELOSS= R * K * LS * C * P
7
Point classification
b.u. Point Type b.u. Point Type
1 Cultivated cropland 7 Urban and built-up land
2 Noncultivated cropland 8 Rural transportation
3 Pastureland 9 Small water areas
4 Rangeland 10 Large water areas
5 Forest land 11 Rederal land
6 Minor land 12 CRP
8
Initial partitioningW Initial partitioning uses geographical association
and broad use category.Partition national data into state-wise categories.
Collapse northeastern states.
Partition each region based on broad use sequence into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points
with broad use change.
Merge points with same broad use change pattern, say (2,2,3), (1,1,12).
9
Source of outlyingnessW Flagged 1% points on training set, and compare test
distances with 99%-quantile of training distances.
W Source of outlyingness
eº;i = b§ ¡ 1=2º (¹ º ¡ y i )
kb§ ¡ 1=2º (¹ º ¡ y i )k
10
Analysis of flagged pointsW Agricultural specialists analyzed identified points by
suspicious variables.
W C factor: almost all points were considered suspicious.W Data entry errors
W Invalid entries c factor=1 for hayland, pastureland or CRP
W Unusual levels or trends in relation to landuse
(0.013, 0.13, 0.013, 0.013, 0.013)
(0.011, 0.06, 0.11, 0.003, 0.003)
11
Analysis of flagged pointsW P factor: all points are candidates for review
because of the change over time.
W Slope length: all points were flagged because of the level, not change over time.
(1.0, 1.0, 1.0, 0.6, 1.0)
12
Nondifferentiable survey estimatorsW The sample distance distribution is
nondifferentiable function of the estimated location parameter.
W A general class of survey estimators:
with corresponding population quantity
W A direct Taylor linearization may not be applicable, again use a differentiable limiting function , with derivative .
bT(^) = 1N
Pi2Sº
1¼i
h(yi ; ^)
TN (¸ N ) = 1N
P Ni=1 h(yi ;¸ N )
Not necessarily differentiable
T (° ) = limN ! 1
TN (° )³ (° )
bDº;d(¹ º )
13
AsymptoticsW Under certain regularity conditions,
where
W The extra variance due to estimating unknown parameter may or may not be negligible.
W Propose a kernel estimator to estimate unknown derivative.
n¤1=2hV( bT(^))
i ¡ 1=2 ³bT(^) ¡ TN (¸ N )
´ ¯¯F d! N (0;1)
( bT(^)) =³1;[³ (¸ N )]T
´V (¹z¼)
µ 1³ (¸ N )
¶:
14
Estimating distribution function using auxiliary informationW Ratio model
W Use as a substitute of , where .
W Difference estimator
W The extra variance due to estimating ratio is negligible (RKM, 1990).
yi = Rxi + ²i ; ²i » N (0;xi ¾2)
Rxi yi R =P
S º yi =¼iPS º x i =¼i
bT(R) = 1N
nPSº
1¼i
I(yi · t) +hP
U I(R xi · t) ¡ PSº
1¼i
I(Rxi · t)i o
15
Estimating a fraction below an estimated quantity W Estimate the fraction of households in poverty when
the poverty line is drawn at 60% of the median income.
with population quantity
W Assume that , the extra variance depends on .
bT(q) = 1N
PSº
1¼i
I(yi · 0:6q)
TN (qN ) = 1N
NPi=1
I(yi · 0:6qN )
limN ! 1
TN (°) = FY (0:6°)@F Y (0:6° )
@°
16
Concluding remarksW Proposed an estimator for subpopulation distance
distribution and demonstrated its statistical properties.
W Application in a large-scale longitudinal survey.
W Theoretical extensions to nondifferentiable survey estimators.
17
Thank you