semi-supervised time series classification

Semi-Supervised Time Series Classification

Li WeiEamonn Keogh

University of California, Riverside{wli, eamonn}@cs.ucr.edu

A)

B)

C)

A)

B)

C)

Figure 1: A) A sample of text written by George Washington. B) The word “Alexandria” after having its slant removed. C) A time series created by tracing the upper profile of the word (Image courtesy of Raghavan Manmatha, used with permission)

There has been a recent explosion of interest in indexing handwritten documents. Note that simply treating the words as “time series” (see Figure 1) is an extremely competitive approach for classifying (and thus indexing) handwritten documents.

Handwriting classifiers must be trained on each individual’s particular handwriting. However the cost of obtaining labeled data for each word, for every individual is very expensive as measured in human time. A semi-supervised approach where a user annotates just a few training examples would have great utility.

Indexing of Handwritten Documents

Labeled Training Instances

M1 M2F1?(U1) F5 (U2) F4 (U3) F3 (U4) F2

(U5) M6 (U6) M5 (U7) M4 (U8) M3

Labeled Training Instances

M1 M2F1?

Unlabeled Instances

U1 U2 U3 U4

U5 U6 U7 U8

Figure 3: The small dataset of labeled instances shown in Figure 2 has been augmented by incorporating the previously unlabeled examples. Now the instance to be classified (marked with “?”) is closest to F5, and is correctly classified

Figure 2: A simple example to motivate semi-supervised classification. The instance to be classified (marked with “?”) is actually a F (female) but happens to be closer to a M (male) in this small dataset of labeled instances

Unlabeled data do contain information which can help classification.

For example in Figure 2, we need to classify the instance marked with “?”, which clearly belongs to the F (female) class. However this particular image happens to show the actor in a pose which is very similar to one of the M (male) instances, M1, and is thus misclassified. Note that F1 is a very close match to the unlabeled instance U4, and we could simply change the label from U4 to F2, and add it to our dataset of labeled instances. In fact, the basic tenet of semi-supervised learning is that we can do this repeatedly, and thus end up with the situation shown in Figure 3.

Value of Unlabeled Data

Training the Classifier

A)

B)

C)

Single positively labeled example

Positive Class

Negative Class

Positive Class

Negative Class

Positive Class

Negative Class

Added in first iterationAdded in second iteration

…

Added in seventeenth iteration



Figure 4: Semi-supervised training on a simple two-class dataset yields much higher accuracy than a naive k-nearest-neighbor classifier

We let the classifier teach itself by its own predication.

For example in Figure 4 we have a two-class dataset, where initially only one example is known as positive (the solid square in subplot A). In subplot B, we can see the chaining effect of semi-supervised learning: a positive example is labeled which helps labeling other positive examples and so on. Eventually all positive examples are correctly classified. In contrast, if we simply put the seventeen nearest neighbors of the single labeled example to the positive class, we will get very poor accuracy (see subplot C).

Stopping Heuristic I

0.2

0.4

0.6

0.8

1

Pre

cisi

on-r

ecal

lbr

eake

ven

poin

t

50 100 150 200 250 300 350 4000

0.5

1

1.5

2

2.5

Number of Iteration

Dis

tanc

e be

twee

n th

e cl

oses

t pai

r in

P

2.5

moving into negative space

adding positive examples

find the closest pair

Figure 5: Changing of the minimal nearest neighbor distance of the labeled set on ECG dataset

Because we do not know the ground truth of the data, it is very hard (if not impossible) to know the true performance of the classifier. Fortunately, the distance statistics give us some hint about how well the classifier is doing.

In Figure 4, we can see that the minimal nearest neighbor distance decreases dramatically in the first few iterations, stabilizes for a relatively long time, and drops again. Interestingly, the precision-recall breakeven point achieved by the classifier has a corresponding trend of increasing, stabilizing, and decreasing.

Stopping Heuristic II

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

Closest pair in labeled Positive set

Closest pair in labeled Positive set

PositiveNegative

PositiveNegative

PositiveNegative

PositiveNegative

A) B)

C) D)

A negative instance is added into labeled positive set

Figure 6: A sample dataset shown in two-dimensional space. A) Initially the two solid (red) squares are labeled as positive. B) At some point the closest pair in the positive set is added into labeled positive set. C) A negative instance is being added into labeled positive set. D) The closest pair in labeled positive set changes to two negative instances

In hindsight, the phenomenon in Figure 5 is not surprising.

In the first few iterations, the labeled positive set is relatively small. By adding more positive examples into it, the space gets denser, and as a result, the minimal nearest neighbor distance decreases. At some point, the closest pair of the positive examples is incorporated in the labeled set. The minimal nearest neighbor distance will be the distance between them. However if a negative example is being labeled as positive, chances are high that we will keep adding negative examples because the negative space is much denser than the positive space. Thus we will see a drop of the minimal nearest neighbor distance of the positive set. Figure 6 illustrates the process on a small sample dataset.

ECG Dataset

20 40 60 80 100 120 140 160 1800.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of iterations

Pre

cisi

on-r

ecal

l bre

akev

en p

oint

Figure 7: Classification performance on ECG Dataset

Word Spotting Dataset

10 20 30 40 50 60 70 80 90 1000.65

0.7

0.75

0.8

0.85

0.9

0.95

1


Pre

cisi

on-r

ecal

l bre

akev

en p

oint

0 50 100 150 200 250

Image 19

0 50 100 150 200 25010 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

120

Iterations

140Image 595

595

19

Mor

e lik

ely

to b

e cl

assi

fied

as “the”

Less

like

ly t

o be

cla

ssifi

ed a

s “the

”

Figure 9: Ranking changes of two instances in Word Spotting dataset during semi-supervised training

Figure 8: Classification performance on Word Spotting Dataset

Gun Dataset

5 10 15 20 250.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75


Pre

cisi

on-r

ecal

l bre

akev

en p

oint

Figure 10: Classification performance on Gun Dataset

Wafer Dataset

5 10 15 20 25 30 35 40 45 500.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9


Pre

cisi

on-r

ecal

l bre

akev

en p

oint

Figure 11: Classification performance on Wafer Dataset

Yoga Dataset

20 40 60 80 100 120 1400.8

0.82

0.84

0.86

0.88

0.9

0.92


Pre

cisi

on-r

ecal

l bre

akev

en p

oint

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Figure 13: Classification performance on Yoga Dataset

Figure 12: Shapes can be converted to time series. The distance from every point on the profile to the center is measured and treated as the Y-axis of a time series