Download - Method of Potential Function as Feature Choice Criterion in Alpha Procedure

Hamburg, 2015 Roman Rader

Method of Potential Function as Feature Choice Criterion in Alpha ProcedureMethod of Potential Function as Feature Choice Criterion in Alpha Procedure

Method of Potential Function as Feature Choice Criterion in Alpha Procedure

Roman RaderNational Technical University of Ukraine “Kyiv Polytechnical Institute”, Ukraine

Scientific Advisor:Prof. Dr.-Ing. Tatjana Lange



Contents

● Overview of Alpha Procedure

● Separation power

● Alternative to separation power using Potential Function

● Comparison with original method with visualization



Intro

First, let's take a high-level overview of Alpha Procedure method.

Alpha Procedure is the pattern recognition algorithm.

Most importrant advantages of it that it's

– Non-parametric

– Can significantly reduce feature space

– Since it operates 2d and 3d spaces it's useful to visualize learning and recognition process.



Alpha Procedure: Input data

AP is supervised learning method, so it requires input feature vectors pk=(x1,x2,...,xn) which are already classified by “trainer” by two classes, let's call them A and B. Since it's non-parametric, no any other data should be provided.

# p1 p2 p3 class

1 X1,1 X1,2 X1,3 A

2 X2,1 X2,2 X2,3 B

3 X3,1 X3,2 X3,3 A

4 X4,1 X4,2 X4,3 B

{(x11,x2

1, ... , xn1,C1

) ,(x12,x2

2, ... , xn2,C2

), ... ,(x1k ,x2

k , ... , xnk ,Ck

)}

x1=(x11, x12, ... , x1n)

x2=(x21,x22, ... ,x2n)

x3=(x31,x32, ... ,x3n)

x4=(x41, x42, ... , x4n)



Alpha Procedure: Algorithm

Method based on step-by-step selection of the most “powerful” feature.

If we present single feature as an axis and put all training samples on it, we can choose one that separates points best way – method of determining will be described later. It will be most “powerful” feature.




1. From given features we select one that separates data the best (it will be our basis feature (and also current repere axis)) - f0.

Intersection area

This feature definitely separates data better,so we will use it as repere axis, f0=p1

p1

p2




2. Now let's build set of 2D spaces using f0 as first axis and each of remaining features as second axis.

f0

fk




2. Let's create new axis that goes through zero point and turn it around the origin by the angle .α

On each step of rotation let's project points on the plane on this axis.

f0

fk




2. Let's create new axis that goes through zero point and turn it around the origin by the angle .α

On each step of rotation let's project points on the plane on this axis.

f0

fk

α




2. The goal of this step is to find best pair of second axis feature and angle of new axis.

In the end we'll have n-1 pairs of feature and angle, so using “power” metric of data axis we can choose the best pair, and after that on this stage we have two features projected on new axis. This new axis will be our first repère vector - f1.

f0

f1

α1

f1




3. On next step we use f1 as basis axis

f1




3. On next step we use f1 as basis axisf1





And repeat same procedure as on 2nd step: walk through all remaining features and build 2D space with them, rotating new axis around the origin.

f1

fk





And repeat same procedure as on 2nd step: walk through all remaining features and build 2D space with them, rotating new axis around the origin.

f1

fk

α2

f2




4, 5, … . On next steps we repeat previous until data is separated or no more features remain.

f1

fk

α2

f2



Alpha Procedure

As described above, Alpha Procedure based on geometric transformations of space which is quite easy to figure out, because doesn't matter how many features are given, method operates only on 2D spaces.



Method of Potential Function

Let's disgress from Alpha Procedure and look at Potential Function.

Assume, as previously, that our problem is to classify objects

by classes A and B. Input data contains object's features and “teacher” classification:

x=(x1,x2,... , xn)

{(x11,x2

1, ... , xn1,C1

) ,(x12,x2

2, ... , xn2,C2

), ... ,(x1k ,x2

k , ... , xnk ,Ck

)}




In geometric interpretation, objects are points on space X with coordinats

Assume space

Features are

Then, solution of this problem will be scalar field

which is positive if point should be classified as class A and negative that should be classified as class B

(x1,x2,... , xn)

X=ℝn

xi∈ℝ , i∈1. .n

Φ=Φ(x) ,(Φ∈ℝ)

{' A' ,Φ(x)≥0' B' ,Φ(x)<0




Let's introduce function - so called “kernel”

For fixed point x* value of this function is point in space X. In Physics such kind of functions called potential function – this is the origin of method name. This function defined on space X but depends on the origin of the signal.

On the figure below example of potential function with signal source in the point 0.

K (x ,x *), x , x *∈X




Now, let's introduce functions

In geometric interpretation result of these functions will be superposition of potentials of all points of specific class in the given point x.

Considering features of potential function K, in the resulting plot of functions KA and KB, points located densely will magnify the potential of nearly located points, and will form common region on the plot.

K A=∑x j∈AK (x ,x j)

K B=∑x j∈BK (x , x j)




We have now functions that define the “power” of specific class in the point, so we can introduce the recognition function

where x is feacture vector.

That function, which is actually scalar field, will be solution of our problem.

Having the method of calculating Φ(x) we can predict the class of random object x.

Φ(x)=K A(x)– K B(x)Φ(x)=K A(x)– K B(x)




Φ(x)=K A(x)– K B(x)



Separation Power

Now let's return to Alpha Procedure. As mentioned, it's based on choosing the best axis in terms of quality of data separation. Let's elaborate how original Alpha Procedure method do this.



Separation Power

To calculate which feature separates data better, Alpha Procedure offers straighforward way of calculation the separation power: we have to find the intersection area, which is area where objects could not be unambiguously classified by putting “separation point” between A-class points cloud and B-class cloud.

It can be defined as

where l is overall amount of objects and is count of objects in intersection area.

Intersection area

F (pq)=ωq

lωq



Separation Power

For the given example on the figure, = 3, l=10

So, F = 3/10 = 0.3Intersection area

F (pq)=ωq

l

ωq



Separation Power

Regardless of the way of calculation, the idea of function F as metric of separation quality gives ability to find other, probably better ways keeping same “interface”.

So, let's fix that 0≤F (pq)≤1



Potential Function as Separation Power

Let's consider a way to use Potential Function as separation power.

For each axis on each step let's find out the separation function In our case it will be 1 argument function, because our data defined on single axis.

To calculate the we have to define the K(x,x*) function. It defines how point influences on the potential depending on the distance to it's location.

Let's introduce the distance, which

will be

Also, let's assume the kernel of

potential function as

Φ=Φ(x)

Φ

ρ=ρ(x , x *)=|x−x*|

K (ρ)=11+aρ2



Potential Function Shape

For kernel function K parameter a should be determined. It defines the shape of potential function and hence, the greater it's value the less influence objects make to neighbours, and the plot of potential function will fit original data better.

Alpha Procedure is non-parametric method, so we have to find a way to keep it untouched, automatic kernel shape determining will be very helpful.

K (ρ)=1

1+aρ2




Let's see how parameter influences kernel shape

K (ρ)=11+aρ2

α=0.5 α=5α=1




To determine the parameter we have to find a way to estimate how accurate kernel separates the data and also prevent overfitting.

In this study we used cross-validation, so that optimal kernel will make less error recognitions on test dataset.

Let's introduce recognition error function:

where is heaviside step function

ξA(Φ)=∑x j∈ Aθ(−Φ(x j))

ξB(Φ)=∑x j∈Bθ(Φ(x j))

θ(x) θ(x)={0, x<01, x≥0




Then, general error of recognition of function with kernel K will be

Now, let's assume kernel parameters (a) as parameters for function K, . Then Фkernel function will be ,

recognition function will be

and error of recognition function will be

Now let's fix function doing currying, because method of recognition does not Фmatter for recognition error function

Φ(x)

ξ(Φ)=ξA(Φ)+ξB(Φ)

K=K (ρ ,a)

Φ=Φ(x ,a)

ξX (Φ ,a)=∑x j∈X

θ(−Φ(x j ,a))




Now let's fix function doing currying, because method of recognition does not Фmatter for recognition error function, we will get

Then, the problem of finding best parameter a reduced to finding the minima of function

ξX (Φ ,a)=∑x j∈X

θ(−Φ(x j ,a))

ξ(a)=ξA(a)+ξB(a)

a=argmina∈[amin , amax ]

ξ(a)

ξ(a)



Potential Function: Conclusion

● The shape of kernel function should be determined

● Having parameter a we have everything to build the recognition function with Фpotential function for specific axis of data.

● Then, we can calculate separation power for the data using recognition function , Фwhich is based on potential function now, so we'll use recognition error function to determine better one.

● All other steps in Alpha Procedure algorithm remain the same, so we kept the “interface” untouched and changed the internal way of determining better feature.

a=argmina∈[amin ,amax]

ξ(a)



Study

We described how we can incorporate Potential Function into the Alpha Procedure feature choice algorithm. Let's see now how it influences the method in general.



Study

● Potential function is able to split our data in more complicated way then just on two partitions.

● Two partitions division covers only the case with one separation point in the middle of two clouds.



Study: Separation of complex data

Let's consider this case:

Blue circles here can represent any class that is described by the two-sided inequality




On the figure it's generated data sample that shows how PF can split data into three partitions

Intersection area




● Here, we have three partitions, left and right where potential function is positive and central, where potential function is negative.

● This approach gives Alpha Procedure flexibility to solve complex problems where data is located on the axes in complex way and hence, return more accurate predictions.



Study: Robustness

● Outliers can become big trouble for the quality of separation.

● Let's consider “Banknote authentication Data Set” https://archive.ics.uci.edu/ml/datasets/banknote+authentication

On this data outlier extended the intersection area, and wrong classified objects count here is 210.

https://archive.ics.uci.edu/ml/datasets/banknote+authentication

https://archive.ics.uci.edu/ml/datasets/banknote+authentication



Study: Robustness

● Now, let's use potential function

● In this case count of wrong-classified objects is 151.

It means that separation quality increased on 28%

210−151210

⋅100%=28%



Study: Robustness

● This happened because outlier makes influence on the value of potential function only in it's location and neighbours. So, we can't break the cloud of objects just putting one outlier in the middle.

210−151210

⋅100%=28%



Study: Robustness

● Let's consider same data without outlier, and separate it with original separation power.

● As we see, just one outlier added 17.5% of wrong-classified objects.

210−173210

⋅100%=17.5%



Study: Robustness

● On the other hand, without outlier potential function returned 149 errors, which is almost same amount as with outlier.

● That means that Potential Function based separation power much more robust than original method.

151−149151

⋅100%=1.3%



Study: Overfitting● Let's consider “Wine Data Set”

http://archive.ics.uci.edu/ml/machine-learning-databases/wine/

Data on x0 axis of this dataset is not dense and algorithm chose pretty thin shape of the kernel function. Hence, plot of potential function is obviously overfitted.

● Wider kernel would be better for recognition quality overall, but recognition on this specific axis would be worse.





Study: Overfitting

It shows that there're ways to enhance method by improving kernel parameters calculation.



Conslusions: Original separation powerPros Cons

Can uniquely determine the quality of separation of the data on axis by two clouds

Doesn't work on non-linearly separated datasets

Easy implementation! Not robust – single outliers can break calculations

Reliable (doesn't require parametrization for proper work)



Conslusions: Potential MethodPros Cons

Can separate complex data disposition, like segmented clouds

Prone to overfit

Robust: outliers hardly can influence error function value

Implementation of this algorithm has O(x2), while original method has O(x)

On most datasets where original method worked well, this method also works and choses same axes as basis. Order of axes is also same or similar.

Potential function tries to separate data the best way it can, but doesn't shows real results if kernel wideness (parameters) are wrong.



References

● Aizerman M.A., Braverman, E.M., Rozonoer, L.I.: Методы потенциальных (The method of potential functions in функций в теории обучения машин

the theory of machine learning).

Nauka, Moscow, 1970.

● Vasil’ev V.I., Lange T., und Baranoff A.E.: Interpretation of fuzzy terms (in Russian).

VIII Meshdunarodnaya Konferenziya 1999. KDS 99. Kiazjaveli (Krim).

● V.I. Vasil’ev.: The reduction principle in problems of revealing regularities (in Russian).

Cybernetics and System Analysis 5, Part I: 69—81, 2003. Part II: 7—16, 2004

Download - Method of Potential Function as Feature Choice Criterion in Alpha Procedure

Top Related