[ppt]anomaly detection for scientific data - nasa … · web viewanomaly detection for scientific...

9
Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005

Upload: ngodang

Post on 29-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Anomaly Detection for Scientific Data

Mark SchwabacherNASA ARC, Code TI (formerly IC, TC)

ROSES Code S & T WorkshopFebruary 17, 2005

What is Anomaly Detection?

• Seek to find parts of the data (“anomalies”) that are different from the rest of the data

• “Supervised” approaches use examples of anomalies; “unsupervised” approaches do not.

How can Anomaly Detectionbe Applied to Scientific Data?

• Examples:– Data from Earth-observing satellites– Data from telescopes

• Direct scientists’ attention to anomalies – could lead to scientific discoveries

• Detect errors, so they can be corrected

Example Earth Science Application:Vegetation Data

• Joint work with Ranga Myneni of Boston University• Used Leaf Area Index (LAI) & Fraction Absorbed of

Photosynthetically Available Radiation (FPAR) from Moderate Resolution Imaging Spectroradiometer (MODIS) instrument aboard the Terra and Aqua satellites

Results

• Used MODIS data from one time point at 4 km resolution (7.7 million pixels within Earth’s land area)

• Used 4 variables: LAI, FPAR, QA, and latitude• Used an unsupervised, distance-based anomaly detection algorithm• The #1 outlier was in northern Russia and the #2 outlier was in

southern New Zealand• Both points had unusually high LAI and FPAR values for their

latitudes• Investigation revealed a bug in the software that produced the LAI and

FPAR products• Error was corrected, and new versions of the data were made available

to the scientific community.

Algorithm Used: Orca(Distance-Based Outliers)

The main idea is to find points in low density regions of the feature space

NVkxP )(

x d

• V is the total volume within radius d• N is the total number of examples• k is the number of examples in sphere

Joint work with Stephen Bay of ISLE

Orca Algorithm

• Based on nested loops– For each example, find it’s nearest neighbors with a

sequential scan• Modified with a pruning rule

– While performing the sequential scan,• Keep track of closest neighbors found so far• prune examples once the neighbors found so far indicate that the

example cannot be a top outlier

• Worst case O(N2) distance computations• In practice, runs in nearly linear time• Can handle millions of data points

Conclusions

• Anomaly detection algorithms can find previously-unknown anomalies in large scientific data sets

• Could lead to scientific discoveries or correction of errors• Different algorithms find qualitatively different anomalies,

so it is worth running multiple algorithms• I presented one algorithm (Orca) that runs in nearly linear

time so it can be applied to very large data sets

Pruning

Outliers based on distance to the 3rd nearest neighbor (k=3)

x d

39 State-gov 77516 Bachelors 1350 Self-emp-not-inc83311 Bachelors 1338 Private 215646 HS-grad 953 Private 234721 11th 728 Private 338409 Bachelors 1337 Private 284582 Masters 1449 Private 160187 9th 552 Self-emp-not-inc209642 HS-grad 931 Private 45781 Masters 1442 Private 159449 Bachelors 1337 Private 280464 Some-college 1030 State-gov 141297 Bachelors 1323 Private 122272 Bachelors 1332 Private 205019 Assoc-acdm 1240 Private 121772 Assoc-voc 1134 Private 245487 7th-8th 425 Self-emp-not-inc176756 HS-grad 932 Private 186824 HS-grad 938 Private 28887 11th 743 Self-emp-not-inc292175 Masters 1440 Private 193524 Doctorate 1654 Private 302146 HS-grad 935 Federal-gov 76845 9th 543 Private 117037 11th 759 Private 109015 HS-grad 956 Local-gov 216851 Bachelors 1319 Private 168294 HS-grad 954 ? 180211 Some-college 1039 Private 367260 HS-grad 9

sequential scan

d is distance to 3rd nearest neighbor for the weakest top outlier