loss functions for detecting outliers in panel data
DESCRIPTION
Loss Functions for Detecting Outliers in Panel Data. Charles D. Coleman Thomas Bryan Jason E. Devine U.S. Census Bureau. Prepared for the Spring 2000 meetings of the Federal-State Cooperative Program for Population Estimates, Los Angeles, CA, March, 2000. Panel Data. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/1.jpg)
Loss Functions for Detecting Outliers in Panel Data
Charles D. ColemanThomas Bryan
Jason E. DevineU.S. Census Bureau
Prepared for the Spring 2000 meetings of the Federal-State Cooperative Program for Population Estimates, Los Angeles, CA,
March, 2000
![Page 2: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/2.jpg)
Panel Data
A.k.a. “longitudinal data.”
xit:
– i indexes cross-sectional units: retain identities over time. Exx: Geographic areas, persons, households, companies, autos.
– t indexes time.– Chronological or nominal.– Chronological time measures time elapsed between two dates.– Nominal time indexes different sets of estimates, can also
index true values.
![Page 3: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/3.jpg)
Notation
• Bi is base value for unit i.
• Fi is “future” value for unit i.
• Fit is future value for unit i at time t.
• Bi, Fi, Fit > 0.
i=|Fi-Bi| is absolute difference for unit i.
• Subscripts will be dropped when not needed.
![Page 4: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/4.jpg)
What is an Outlier?
“[An outlier is] an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”
D.M. Hawkins, Identification of Outliers, 1980, p. 1.
![Page 5: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/5.jpg)
Meaning of an Outlier
• Either– Indication of a problem with the data
generation process.
• Or– A true, but unusual, statement about reality.
![Page 6: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/6.jpg)
Loss Functions• Motivations: The i come from unknown
distributions. Want to compare multiple size classes on same basis.
• L(Fi;Bi)(i,Bi) is loss function for observation i.
• Loss functions measure “badness.”
• Loss functions produce rankings of observations to be examined.
• Loss functions are empirically based, except for one special case in nominal time.
![Page 7: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/7.jpg)
Assumption 1
Loss is symmetric in error:
L(B+; B) = L(B–; B)
![Page 8: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/8.jpg)
Assumption 2
Loss increases in difference:
/ > 0
![Page 9: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/9.jpg)
Assumption 3
Loss decreases in base value:
/B < 0
![Page 10: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/10.jpg)
Property 1
Loss associated with given absolute percentage difference (| / B|) increases in B.
![Page 11: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/11.jpg)
Simplest Loss Function
L(F;B) = |F – B|Bq (1a)
or
(,B) = Bq (1b)
with
0 > q > –1.
![Page 12: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/12.jpg)
~( ; )L F B F B
F B
Br
s
Loss as Weighted Combination of Absolute Difference and
Absolute Percentage Difference
• This generates loss function with q = –s/(r + s).• Infinite number of pairs (r, s) correspond to any given q.
![Page 13: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/13.jpg)
Outlier Criterion
• Outlier declared wheneverL(F;B)(,B) > C
• C is “critical value.”
• C can be determined in advance, or as function of data (e.g., quantile or multiple of scale measure).
![Page 14: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/14.jpg)
Loss Function Variants
• Time-Invariant Loss Function
• Signed Loss Function
• Nominal Time
![Page 15: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/15.jpg)
Time-Invariant Loss Function
• Idea: Compare multiple dates of data on same basis.
• Time need not be round number.
• L(Fit;Bi,t) = |Fit – Bi|Btq
• Property 1 satisfied as long as t < –1/q.
• Thus, useful horizon is limited.
![Page 16: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/16.jpg)
Signed Loss Function• Idea: Account for direction and magnitude of loss.
S(F;B) = (F – B) Bq
• Can use asymmetric critical values and “q”s:– Declare outliers whenever
S+(F;B) = (F – B) Bq+ > C+
or
S–(F;B) = (F – B) Bq– < C–
with C+ –C–, q+ q–.
![Page 17: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/17.jpg)
Nominal Time
• Compare 2 sets of estimates, one set can be actual values, Ai.
• Assumptions:– Unbiased: EBi = EFi = Ai.
– Proportionate variance: Var(Bi) = Var(Fi) = 2Ai.
• q = –1/2.
• Either set of estimates can be used for Bi, Fi.
– Exception: Ai can only be substituted for Bi.
![Page 18: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/18.jpg)
How to Use: No Preexisting Outlier Criteria
• Start with q = – 0.5.– Adjust by increments of 0.1 to get “good”
distribution of outliers.
• Alternative: Start with
q = log(range)/25 – 1, where range is range of data. (Bryan, 1999)– Can adjust.
![Page 19: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/19.jpg)
How to Use: Preexisting Discrete Outlier Criteria
• Start with schedule of critical pairs (j, Bj).
– These pairs (approximately) satisfy equation Bq = C for some q and C. They are the cutoffs between outliers and nonoutliers.
• Run regressionlog j = –q log Bj + K
• Then, C = eK.
![Page 20: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/20.jpg)
Loss Functions and GIS
• Loss functions can be used with GIS to focus analyst’s attention on problem areas.
• Maps compare tax method county population estimates to unconstrained housing unit method estimates.
• q = –0.5 in loss function map.
![Page 21: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/21.jpg)
Persons
0 - 50005000 - 2500025000 - 50000Over 50000No Data
Note: The tax method estimates are the base
Map 1Absolute Differences between the Two Sets of Population EstimatesAbsolute Differences between the Population Estimates
![Page 22: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/22.jpg)
Percent
0 - 55 - 1010 - 20Above 20No Data
Note: The tax method estimates are the base
Map 2Absolute Percent Differences between the Two Sets of Population EstimatesPercent Absolute Differences between the Population Estimates
![Page 23: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/23.jpg)
0 - 10001000 - 20002000 - 4000Above 4000No Data
Loss
Map 3Loss Function Values
Note: The tax method estimates are the base
Loss Function Values
![Page 24: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/24.jpg)
Outliers Classified by Another Variable
• Di is function of 2 successive observations.
• Ri is “reference” variable, used to classify outliers.
• Start with schedule of critical pairs (Dj, Rj).
• Run regressionlog Dj = a + log Rj
• Then, L(D, R) = DRb and C = ea.
![Page 25: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/25.jpg)
What to Do with Negative Data
• From Coleman and Bryan (2000):
L(F,B) = |F–B|(|F|+|B|)q, B 0 or F 0,
0 , B = F = 0.
S(F,B) = (F–B)(|F|+|B|)q, B 0 or F 0,
0 , B = F = 0.
• 0 > q > –1. Suggest q –0.5.
![Page 26: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/26.jpg)
Summary
• Defined panel data.
• Defined outliers.
• Created several types of loss functions to detect outliers in panel data.
• Loss functions are empirical (except for nominal time.)
• Showed several applications, including GIS.
![Page 27: Loss Functions for Detecting Outliers in Panel Data](https://reader037.vdocument.in/reader037/viewer/2022110210/56812eb8550346895d945b21/html5/thumbnails/27.jpg)
URL for Presentation
http://chuckcoleman.home.dhs.org/fscpela.ppt