1 spatio-temporal outlier detection in precipitation data elizabeth wu, wei liu, sanjay chawla the...

1

Spatio-Temporal Outlier Detection in Precipitation Data

Elizabeth Wu, Wei Liu, Sanjay ChawlaThe University of Sydney, Australia

SensorKDD 2008

Sunday, 24th August, 2008

2

Outline

• What is a spatio-temporal outlier?

• Motivation

• Previous Work

• Contributions

• Our Approach

• Future Work

4

What is a Spatio-Temporal Outlier?

• “A spatio-temporal object whose thematic attribute values are significantly different from those of other spatially and temporally referenced objects in its spatial and/or temporal neighborhoods.”

– Cheng and Li (2006)

t=1 t=2 t=3 t=4 t=5

1 2 3 4 5 1 2 3 4 5

5

4

3

2

1

1 2 3 4 5

5

4

3

2

1

1 2 3 4 5

5

4

3

2

1

1 2 3 4 5

5

4

3

2

1

5

4

3

2

1

5

What is a spatio-temporal object?

• “A time-evolving spatial object whose evolution or ‘history’ is represented by a set of instances (o_id, si, ti) where the spacestamp si is the location of object o_id at timestamp ti.” - Theodoris et. al. (1999)

• Simply put,

time

y co-ordinate

x co-ordinate

time

y co-ordinate

x co-ordinate

A 2D region becomes a 3D region A point becomes a line

6

Data• South American precipitation data

(NOAA)• 10 years (1995-2004)

• 2.5 x 2.5° grids

• 31 latitude x 23 longitude divisions

• 713 grids total

• 2,609,580 possible data values

• Missing data – spatially and temporally

• El Niño Southern Oscillation Data (NOAA)

• Southern Oscillation Index (SOI)

• Measures the difference in Sea Surface Temperature (SST) between Tahiti and Darwin

• The lower the score, the more intense an El Niño event

Figure: Stations used to produce gridded

precipitation fields

7

Motivation

• Why would we be interested in moving outlier regions in precipitation data?• Knowing the location, time and duration of past

extreme precipitation events helps to understand and prepare for future events.

• We can analyse how different phenomenon interact.

• E.g. ENSO and precipitation.

8

Previous Work

• Spatial Scan Statistics• Used to find spatial outliers

• Cluster detection using the spatial scan statistic in spatio-temporal point data (Iyengar, 2004)

• Exact-Grid and Approx-Grid (Agarwal et. al., 2006)• Uses the Kulldorff Spatial Scan Statistic

• Finds the highest discrepancy region (by location and size) in a spatial grid dataset.

• Spatio-temporal outlier detection (Birant and Kut, 2006)

• Limited to finding outliers over a single time period.

time

y co-ordinate

x co-ordinate

9

Contributions

• Extended Exact-Grid and Approx-Grid to find the top-k outliers in a single time period.

• Developed the Outstretch & RecurseNodes algorithm to find outliers that repeatedly appear over several time periods.

• Apply to South American Precipitation data.

• Analyse the behaviour of the outliers against the El Niño Southern Oscilation (ENSO).

10

Our Approach

1. Find the top-k outliers in a spatial grid for each time period• Extend Exact-Grid and Approx-Grid algorithms

2. Use Oustretch to find spatial outliers which extend over several time periods.

3. Use RecurseNodes to extract the sequences from the Outstretch tree.

11

Finding the top-k outliers

• Find every possible region size and shape in the grid.

• Get each region’s discrepancy value to determine which is a more significant outlier.

• Our extension keeps track of the top-k regions rather than just the top-1.

left right

top

bottom

12

• Uses two values:1. Measurement – Number of incidences of an event

• E.g. In how many cells is precipitation extreme?• M – for the whole dataset• m(p) - for the cell p

• mR = ΣpєR m(p) / M

• Baseline – Total population at risk• I.e. How many cells have we recorded values for?• B – for the whole dataset

b(p) - for the cell p

• bR = ΣpєR b(p) / B

• We find the discrepancy for local region R by subsitution into:• When mR > bR

d(mR, bR) = mRlog(mR/bR) + (1-mR)log((1-mR)/(1-bR))

• Otherwise d(mR, bR) = 0

Kulldorff Scan Statistic

13

Kulldorff Scan Statistic: Example

• M = 6 = total # cells with “1” in entire grid

• ΣpєR m(p) = 4= total # cells with “1” in R

• mR = ΣpєR m(p)/M = 0.67

• B = 16= total # cells in entire grid

• ΣpєR b(p) = Sum of b’s in region = 4= total # cells in R

• bR = ΣpєR b(p)/B = 0.25

• Result: d(mR, bR) = 0.3836

1 1 0 0

1 1 0 0

0 0 0 1

0 0 1 0

1 2 3 4

4

3

2

1

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

1 2 3 4

4

3

2

1

14

Finding the top-k outliers: Exact-Grid

left right

top

bottom

15


left right

top

bottom

16


left right

top

bottom

17


left right

top

bottom

18


left right

top

bottom

19


left right

top

bottom

20


left right

top

bottom

21


left right

top

bottom

22


left right

top

bottom

23


left right

top

bottom

24


left right

top

bottom

25


left right

top

bottom

26


left right

top

bottom

27


left right

top

bottom

28


left right

top

bottom

29


left right

top

bottom

30


left right

top

bottom

31


left right

top

bottom

32


left right

top

bottom

33


left right

top

bottom

34


left right

top

bottom

35


left right

top

bottom

36


left right

top

bottom

37


left right

bottom

top

38


left right

bottom

top

39


left right

bottom

top

40


left right

bottom

top

41


left right

bottom

top

Keeps moving top and bottom lines

until all regions have been examined

between the left and right lines…

43


left right

top

bottom

44


left right

top

bottom

45


left right

top

bottom

46


left right

top

bottom

47


left right

top

bottom

48


left right

top

bottom

49


left right

top

bottom

50


left right

top

bottom

51


left right

bottom

top

52


left right

bottom

top

53


left right

bottom

Same again…

Top and bottom lines define all possible areas

between the left and right lines…

top

54


left right

top

bottom

55


left right

top

bottom

56


left right

top

bottom

57


left right

top

bottom

58


left right

top

bottom

59


left right

top

bottom

Continue until all regions have been

examined…

60

Finding the top-k outliers: Approx-Grid

• Reduces the time complexity of the algorithm by using only two sweep lines and finding the interval that maximises the discrepancy function

• (See Agarwal et al. (2006) paper).

top

bottom

m(I,j) stores the sum of the

m(p)’s for each column

For each move of a sweep line, run the Linear1D algorithm to find the interval that maximises the

discrepancy function

61

Finding the top-k outliers: Considerations

• Overlapping Regions

62


• Overlapping Regions – Overlap types

63


• Chain effect

• One option: Union Solution

d=0.45

d=0.51d=0.54

64


• Chosen Option: Allow a percentage of overlap

d=0.45

d=0.51

If this overlap is less than allowable_overlap

% then, keep both regions in the top-k list.

65

Outstretch

• Outstretch – find the paths of the outliers over time.

t=1 t=2 t=3 t=4 t=5

1 2 3 4 5 1 2 3 4 5

5

4

3

2

1

1 2 3 4 5

5

4

3

2

1

1 2 3 4 5

5

4

3

2

1

1 2 3 4 5

5

4

3

2

1

5

4

3

2

1

66

Outstretch

• Use Outstretch to find spatial outliers which extend over several time periods.

• Check the same region (slightly stretched to cover more area) in the next time period, to see if another outlier lies in the region.

• If it is, then it is considered to be part of the spatio-temporal outlier, which is now extended over an additional time period.

• Store in a tree data structure.

r

This region (dark green) has been stretched by

r=2 grid cells…

In the next time period, we will check if any

outliers fall in that area.

67

Outstretch

• Store outliers found over subsequent time periods in a tree data structure.

Node

Num Children

Children

1,1 1 {2,2}

1,2 3 {2,2}, {2,3}, {2,4}

1,3 1 {2,1}

1,4 1 {2,4}

2,1 1 {3,2}

2,2 3 {3,1}, {3,4}

2,3 2 {3,3}

2,4 0 -

3,1 0 -

3,2 0 -

3,3 0 -

3,4 0 -

1,1 1,2 1,3

2,1 2,2 2,3

3,1 3,2 3,3

1,4

3,4

2,4

68

Outstretch

1,1 1,2 1,3 1,4

1

2

34

• Stretch the top-k outliers from t=1 by r (their spatial neighbourhood).

69

Outstretch

1,1 1,2 1,3

2,1 2,2 2,3

1,4

2,4

1

2

34

2

1

3

4

• From the top-k in t=2, find those which fall inside the stretched region from the previous period, t=1.

70

Outstretch

1,1 1,2 1,3

2,1 2,2 2,3

3,1 3,2 3,3

1,4

3,4

2,4

1

2

34

2

1

3

4

1

4

3

2

• Stretch the new outliers from t=2 and find the outliers from t=3, that fall in the newly stretched regions.

71

RecurseNodes

• Now that we’ve stored all the sequences in the tree, how do we get them out?

• Use RecurseNodes to extract the sequences from the Outstretch tree.

Node

Num Children

Children

1,1 1 {2,2}

1,2 3 {2,2}, {2,3}, {2,4}

1,3 1 {2,1}

1,4 1 {2,4}

2,1 1 {3,2}

2,2 3 {3,1}, {3,4}

2,3 2 {3,3}

2,4 0 -

3,1 0 -

3,2 0 -

3,3 0 -

3,4 0 -

73

RecurseNodes

• Start at {1,1}

• We notice it has a child {2,2}

• Check {2,2}

• We notice {2,2} has two children {3,1} and {3,4}.

• Check {3,1} first.

• {3,1} has no children. Stop and store sequence:[ {1,1}, {2,2}, {3,1} ]

• Now check {3,4}.

• {3,4} has no children. Stop and store sequence:[ {1,1}, {2,2}, {3,4} ]

• And so on…

Node

Num Children

Children

1,1 1 {2,2}

1,2 3 {2,2}, {2,3}, {2,4}

1,3 1 {2,1}

1,4 1 {2,4}

2,1 1 {3,2}

2,2 3 {3,1}, {3,4}

2,3 2 {3,3}

2,4 0 -

3,1 0 -

3,2 0 -

3,3 0 -

3,4 0 -

74

Results: Exact vs. Approx-Grid Top-k

Exact-Grid Top-k O(n4k)

229s

Approx-Grid Top-k O(n3k)

35s

Length and number of outliers found

Outlier Discovery – Time Taken

• Exact-Grid Top-k:• finds longer

sequences than Approx-Grid Top-k

• Approx-Grid Top-k• Is faster than

Exact- Grid Top-k

75

Results:Mean discrepancy of Exact-Grid Top-k sequences and the mean SOI

• Notice that some of the discrepancies at the centre time period are higher during the more intense El Niño event

• This is showing that there are more extreme extremes during an El Niño event.

76

Results:Mean discrepancy of Approx-Grid Top-k sequences and the mean SOI

• We also find extreme extremes in the Approx-Grid Top-k sequences

77

Future Work

• Evaluate against Other metrics (besides SOI), such as Sea Surface Temperature (SST)

• Point data

• Other data e.g. other precipitation data.

78

Conclusion

• Our contributions:• Top-k extension to Exact and Approx-Grid algorithms

• Outlier sequence discovery over time

• Evaluate using precipitation data

• Compared results to the El Niño Southern Oscillation Index (SOI)

• Results showed:• More extreme extreme values during El Niño periods

• Able to find these with both Exact and Approx-Grid algorithms

79

Questions

• Please ask

1 spatio-temporal outlier detection in precipitation data elizabeth wu, wei liu, sanjay chawla the...

Documents

spatial object

spatial grid dataset

spatiotemporal object

time periods

line slide

moving spatial outlier

single time period

possible data values