virginia de sa desa at cogsci - ucsd cognitive sciencedesa/oldpublic_html/118b... · hebbian...

66
1 Unsupervised Learning, Kmeans and Derivative algorithms Virginia de Sa desa at cogsci

Upload: vocong

Post on 31-Jan-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

1Unsupervised Learning, Kmeans and Derivative algorithms

Virginia de Sadesa at cogsci

2Unsupervised Learning

No target data required

Extract structure (density estimates, cluster memberships, or produce a reduceddimensional representation) from the data

3Unsupervised algorithms are often forms of Hebbian Learning

Hebbian learning refers to modifying the strength of a connection according to afunction of the input and output activity (often simply the product).

It is based on a rule specified by the Canadian Donald Hebb in his 1949 book“The Organization of Behavior”

When an axon of cell A is near enough to excite a cell B and repeatedlyor persistently takes part in firing it, some growth process or metabolicchange takes place in one or both cells such that A’s efficiency, asone of the cells firing B, is increased (Hebb 1949)(figure below fromhttp://www.qub.ac.uk/mgt/intsys/nnbiol.html)

4Data Compression

We might want to compress data from high-dimensional spaces for several reasons:

• to enable us (and also machine learning algorithms) to better see relationships

• for more efficient storage and transmission of information (gzip, jpg)

We want to do this while preserving as much as the useful information as possible.(Of course how useful is determined is critical).

Clustering and PCA are different methods of dimensionality reduction.

5PCA and Clustering

PCA represents a point using a fewer number of dimensions. The directions arethe directions of greatest variance in the data

Clustering represents a point using prototype points.

6K-means

a simple but effective clustering algorithm

partitions the data in to K disjoint sets (clusters)

iterative batch algorithm

• Start with initial guess of k centers

• S(j) is all points closest to µ(j)

• Update

µ(j) = 1/Nj

∑n∈S(j)

x(n)

• until no change in the means

7K-means

8K-means

9K-means

10K-means

11K-means

12K-means

13K-means

14K-means

15K-means

16K-means

17K-means

a simple but effective clustering algorithm

partitions the data in to K disjoint sets (clusters)

iterative batch algorithm

• Start with initial guess of k centers

• S(j) is all points closest to µ(j)

• Update

µ(j) = 1/Nj

∑n∈S(j)

x(n)

• until no change in the means

18Stochastic K-means = Competitive Learning

Find weight w(j) that minimizes ||w(j) − x(n)|| (weight closest to the pattern)

and move it closer to the pattern

∆w(j) = η(t)(x(n) −w(j))

decrease learning rate with time

x1 x2 x3 x4

W

19Competitive Learning

����� ������ ���

���� ���

��� ��������� �"!#�"$"! %'&#& (")+*#,"-�*#./("021"3#45-76"893#("6 :7& ;":7*#*#-7<=6"8',":?>@-/A"-7-765:@B"C74D-76 *#-715:76"1E6"("<245:7&#03#F@-71G:76"1H,"-76"I7-J&#3#-K("6H:K*#./("021"3#4D-@6"893#("6":7&L89;","-7<2-J3#6G*#,"<2-@-K1"3#45-76"893#("6 89MON73#P7-7.�3#89-7QO*#, -.�-73#C7,"*#8R(")S*#,"-T*#,"<2-@-TI7&#B"89*#-@<�I7-76"*#-@<28R,":�>�-TA"-7-@6U6"("<24D:@&#3#F7-71"MWVX, -T<2-71UI7B <2>@-78R8Y,"( .Z*#,"-*#<[:]\^-7I7*#( <2_`(")�*#,"-`.�-73#C7,"*�>@-7I7*#( <289Q�./,"3#I�,a8Y*#:7<2*�:7*�*#,"-R<2-71a; ("3#6"*#8W:76"1b-76"1a:@*�*#,"-RI7-76"*#-@<c(")�:I7&dB"89*#-7< MXe?<=("4Df�gh3#I�,":7<21RiEMXj�B"1":@Q�k7-7*#-7<�l7MXm�:7<=*#Q�:76 1`j�:�>@3#1Rn�MXo7*#("<2P7Q prq7s#s#t7u2vRwSx#q7yYy9z#{"|7q7s#z#} v�~�S� � �@�2�#�7�"�`����"�"�"�D�+�L���"� �D�G�#�#�7�����7�"�"�9�"�d�"�7 

20Competitive Learning

21Competitive Learning

22Competitive Learning

23Competitive Learning

24Competitive Learning

25Competitive Learning

26Competitive Learning

27Competitive Learning

28Competitive Learning

29Competitive Learning

30Competitive Learning

31Competitive Learning

32Competitive Learning

33Competitive Learning

34Competitive Learning

35Competitive Learning

36Competitive Learning

37Kohonen Feature Mapping

Update the neighbours (in output topography) as well as the winner. If y∗ refersto the winning output neuron then we update weights

∆w(k) = η(t)Λ(|y(k) − y∗|, t)(x−w(k))

window function decreases with time

38Kohonen Feature Mapping

������

���

�� ��� ��������� �

���

�����! #"!$�%'&)(!*+$��!(!,�-*+�!.!/102&3*54�,!02&

��,!/16!&2�7*54�,�0)&8�9

:�;1<)=#>@?'A�B!<CEDGF�A!;1B�<)FIHJ:�A!K)<

L�M NPORQTSVUXWXY�ZX[XY \^]+_a`�b�c#dXe#f�gahXi�j�i�hXfVklg�mnb�e#dXkpo�qX_Vr�o�stdXcEuXi�kl_�hX]vi�dXhXg�`�wxuXi�]+yV]+dXzXeE{a_|]+mXga{�_o�d}o~qX_�r�dXh7_�c#uXi�kl_ahX]+i�dXhXg�`~w�`�i�hX_�d7b�o�qX_Po�g�e#f�_�o�]+mXga{�_�{�gah}�X_�`�_ag�e#hX_�u}g�]�b�dX`�`�d7sR]v���)dXe3_ag�{�qmXd7i�hXo��l�������X� ���a�#�����������X�����~�X���#� ���a���+����� ���X�E�#���+�X�X�X�7���X� �X�X���X���~�����X� �+�7�X�E��� �+�X�a��� ���X���~������+�a�X�+���X���t�X�7���V�����a�V�����P�7 �¡�¢X£¥¤l¦X§+¨�©�ª�¨�¡ «a a¬�­)¦X®�ª�¯�©�®E¡�¨�°²±�¨~³X �¢X±�´R ¥ª�©�¢V¯~¡�¢Xµ¥¨�³X �§+ �§3¶X¦X¡�¢X¨�§¡�¢|¨�³X �§v¦X·X®Eª� a¸�¡�¨�¡�§'©�§¹¡�º»¨�³X ¼¡�¤l©�£� �¯~¡�¢X ¼¡�§¹¶X¯�©�ª� a½�¡�¢�¨~³X �§+¦X·7®Eª� ¼§+¶X©�ª� �¬»¾¿ ¼ª�©�¯�¯�¨�³7¡�§¹¨�³7 ¶X®E �À#¡�¤l©a£� �¦XºI¨�³X ¹¨�©�®#£a �¨I§+¶X©�ª� a¬IÁx¨I¨�³7 �§+¨�©�¨~ �§+³X¦7´t¢X±I¨�³X ¹¶X©�®#¨�¡�ªa·X¯�©�®�§+ a¢X§+ �½ ¶X¦X¡�¢X¨I¯~ �©�½X§Â¨�¦lÃTÄÅXÆaÇ�È�ÉPÊlËXÌ+Í�Î�Ï�Í~È~Ð�Æ�Ñ�Ò!ÓXÆÕÔ�Æ�Î�Ö#ÉXÈ�É7Ç¥ÖE×XÔ�ÆÕØ�Ù�ÚXÑ�ÛXÛXÜ7Ý�ÊlÎ�Þ�Æ�ÌßÈ�Í�ÌßÌ+ËX×7ÖEÏ�ÆÕàXËXÈ�É7Í�ÊlË7ÐaÆÕÍ�Ë7á�ÎaÖ#â�Í~ÓXÆÌ+ÆaÉXÌ+Æ�â àXËXÈ�É7Í�ãTÎ�Ì�ÌvÓXË7áRÉ ÅIä¹Í~ÓXƹÌ+ÊlÎ�Ô�ÔTÎaÖ#Ö#Ë7áßÑTå�ÆaÏ�Î�×XÌ+ƹËXæTÍ�Ó7ƹáRÈ�É7âXË7áçæ�×XÉXÏaÍ�È�ËXÉéè�êìë íTî!ï}ð�ñ òvóô�õ7öR÷XøEö�ù#ú�ûlüaý�öRþXÿ7÷XþXú��Xô���ü����ìü��aö��Xô7ô�þ��� ����� �������������������� �! ��"����#$�%�&��'$�&�(��)$��*�'$��+���#$�$,$-(#$�'$���. (�/��,$0�#$132�43�&,$0�#5���� ��"'$*�'�-6*��/�"�()$�� ������5�� �' 7��*������8 ��8��#$�6 ��"�9�: �'$�$�$����76�&��'$�&���8��#���!#$�����;�&�$,$�:0��;�&)$ �0���+< ;����)$�$���$-�*�0( �����7;0��$���"��0��<�� �)=*��!���� (�"'$���$1<>?�"�$��@<AB*�0�#� ��"�=CD1<E�,��$ �+F �(�����8G�1IH� ��"��+� �'$�JE� ���*��JK�1IL����$�"M�+ON�P(Q�Q�R�S"TJUWV�P�X�X&Y�Z$[�P�Q�Y�\�T�]I^W_$` a�b"c�d�e$f=ghji$k$k$lnm�oDprq$s$tuwv�x�y�o�z|{�q$t�}&~$��t$���

39Kohonen Feature Mapping

����

���

��

��

��� ���������������� �� !#"%$�&('%)%*,+-&�.�/�01+32�4�.�'%5�&�0�.�672(0�8�6:9;*�2�<=0�8=>%);.�&�?%&�.�>A@B);$�672�0�8C5�)%8=>%9%5,6:$�)%'%9%6&�.D0�.�9E/�&�@B9%.�6F&�0�.DG(*�9%2�5�HI)%.�/J5�+�0J/�&�@B9%.�6:&�0�.�6KG�8=&�>%L�5�H�MIN�.J9%)%'#LD'%);6:9%OI5�L�9E+-9%&�>%L�5�6K)%5I5�L�9@B)%P%&(@B)%*�*�"7)%'%5�& Q#974�.�&�5�OSRUTWV�X�Y[Z�\�]7Z�^;_=`%]%Zba:c�^%d;]7`%]%ZbZ�\�]7e�^;_=`%]%a:Zbf�]%X�`%\�Zbg�c�h�^%Z�]7fC\�X�e�]ig�Y�X�Z�ajBk _=]lh�X�a:Z�^%Y�Zm`%]%Zna j ^%e�e�]%_og�c�h�^%Z�]%pmqr_ k�jBsnt X�d#\�^;_=hvuipmw-g�h�^%VmxU]%Z�]%_oy;pnz-^%_=Z(Vn^%Y�h{w-^}|%X�hv~-p� Z k _=�;V��}�%�����;�=�B�����%�F�:�����%�%���������������1�%�=���%���i����������7�1 I¡£¢�¤�¥B¦J§�¨�©% -ª¬«%¢�¥�­:®�¯�¥�°%±

40Kohonen Feature Mapping

� � � � ���

����� ����� � � � ����� � ��� �����

� ����� � � � �����

� ��� � ����� � � � � �����

� � ����������������� ���������! #"!$�%�&�')(�$! +*���,��!-�-�$! #%.,� #$!/0$1%�-��!-�2�*�%�/3�! #$�')�!4�$�&�/�2�%�"�-�5�$�/0$!-�&�,.*���672�"!8�9�:�8�;�<�=���-�*�,�*���*�"!>2�?!�!����@�*� #4�$! A$!4 'B�!,C4�$!D1$1��*�,�/08EF5�$G%�&�'B(�$! H*��,��!-I-�$! #% ,� A$!/0$!%�-��!-I2�*�%�/J2�/K��2�/0-�$!4�8L6� A*�'BMNO2�?75��! #4 P�8QR&�4��1=SF$!-�$1 HT!8UR�! #-�=��!%�4BQR��D!2�4)VR8�W!-�*� AX!=Y�Z![�[I\!]#^B_a`�Z1b0b0c�d�e!Z![Ic�f�^�g�hai�jlk!m#n�o1p�qGrsut�v�v�w)xlyKz|{�}�~)�������!yR���1{�~��0����~��!�

41Kohonen Feature Mapping

����� ������� ����������� ��� ������� � ���������

� ������� ������������� � � ��������� � ����������� �����������

� � ������������������� ����� ��!�"$#�%$& ' (�)�*+)�(�&,!�� '+-�.�%$��/0' 12!�%$#�/3'���4�.�' %$����#�.�%65 �,�71�' 5 ��-�#8'���4�.�' %$��9�& %$)�:�;<-�' %$&+� -���1�' 5 � =<�>�?)�(@A)�& =<B�C�=�D�E�F<� ' 5AG2& %$)�:21�#�)�(�-<#�!<-�G���-�' %$& � -<��1�' 5+��)��H�7G�#�I�(2'+-�#�12-�G���1�#�)�(�-<)�(2-�G�����#�.�%J5 ����1�' 5+��-�G�' -<��� ' :��H/K' L )�/K' ����M� L+5 )�-�� �N-�G�' -O-�' %J& � -O1�#�)�(�-�=O@P%$#�/KQORS)�5AG�' %$:UT8=OV�.�:�' FXW<� -�� %ZY =O[N' %$-�FO'+(�:\V�'^] )�:U_�=X` -�#�%$a+Fcb^d e�e�f+g$h\i>j�d+k�k�l�m�n d+e�l�o�h�pq>r�sut+v$w�x y�z8{|~}������0�u�c�������0������� �N��� ������������� �

42Kohonen Feature Mapping

� ������� ��������� �����������

�� ��� �������������� �������������� ���!�"�#�$%!���&'���)(+*,�-��.�/� �01!��'&2 �/���3'!�$4 ���5�6�"�!-$708��9�6����'5�����:+3'!� � ���$4��0;#�$%!���&'���)"�<�5=/���08�-��(+"���!�& ��?>����'>�0@���A �/��B��!�3�CD��E=���?��F- �����08� E-�B:G6�$4 �/���$H �$%!��������'.B&����-0@���� ���"����)���'!� ��B �/��I>�����>�J�K��?086'5-/A5�!�08��0MLD"���!�$4�����'.B08/��'6�"�&N ��$4��08 �!�$4 G��&O*,�� �/P$%!-��&����)�GQ���&O*�����.�/� �0R!���&P3���0808� N "�<�!�*,��&��-$S*,���'&��'*T:�6���5� ����'�P!���&P08"��'*,��$U&���5�!�<����P"���!�$4�����'.�JV�$%���)WHX;�G5-/'!�$4&ZYAJH[,6'&�!�LH\+�- ���$^]�J7_,!�$4 �L7!���&`[,!�E���&`a,JH�� G��$4>�L;b�c-d�d�e�f4g`hSi�c�j8j8k�l'm�c�d�k�n�g�o7pSq�rDs�t4u�v�w'x�yz|{�}�}�~`�'������������������,���������8�������-�

43More examples

http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/JavaPaper/node24.html

44Some SOM applets

applet from rfhs8012.fh-regensburg.de/ saj39122/jfroehl/diplom/e-sample.html

applet from www.patol.com/java/fill/index.html

applet from www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/DemoGNG/GNG.html

45Let’s look at visual cortex example

Obermayer1990.pdf http://www.pnas.org/cgi/reprint/87/21/8345

46Neural Gas – Learn the Topology

http://www.ki.inf.tu-dresden.de/ fritzke/FuzzyPaper/node6.html

47Aside to related supervised algorithms (Kohonen’s LearningVector Quantization)

Supervised methods for moving cluster centers (makes use of given class label)

Can have more than one center per class.

Move centers to reduce the number of misclassified patterns.

Various flavours. LVQ2.1 minimizes number of misclassified patterns

48LVQ2.1 Learning rule

Let w(i), and w(j) be the closest codebook vectors

Only if exactly one of w(i) and w(j) belongs to the correct class andmin(||x−w(i)||/||x−w(j)||, ||x−w(j)||/||x−w(i)||) < s (x lies within a windowof the border region) do the following (the below rules assume w(i) is from thecorrect class, switch the rules if not)

w(i) = w(i) + ε(x−w(i))

w(j) = w(j) − ε(x−w(j))

49Improved LVQ2.1 Learning rule

Let w(i), and w(j) be the closest codebook vectors

Only if exactly one of w(i) and w(j) belongs to the correct class andmin(||x−w(i)||/||x−w(j)||, ||x−w(j)||/||x−w(i)||) < s(t) (x lies within awindow of the border region that decreases with time) do we apply the following(the below rules assume w(i) is from the correct class, switch the rules if not)

w(i) = w(i) + ε(x−w(i))||x−w(i)||

w(j) = w(j) − ε(x−w(j))||x−w(j)||

50LVQ2.1 in 2-D

w(i) = w(i) + ε(x−w(i))||x−w(i)||

w(j) = w(j) − ε(x−w(j))||x−w(j)||

w(i) is from the correct class, w(j) from an incorrect class

x

x

1

2

x x1 2

y y y y ya b c d e

51LVQ in 1-D

Force to the left <--

Force to the right -->

P(C )p(x|C )

P(C )p(x|C )

LVQ 2.0P

x

A

B

A

B

Class A decision Class B decision

P(C )p(x|C )

P(C )p(x|C )

LVQ 2.1P

x

A

B

A

B

Class A decision Class B decision

52LVQ in 1-D, Separable Distributions

Force to the left <--

Force to the right -->

P(C )p(x|C )BB

P(C )p(x|C )AAP(C )p(x|C )BB

P(C )p(x|C )AA

LVQ 2.0P

xClass A decision Class B decision

LVQ 2.1P

xClass A decision Class B decision

53Problem with K-means

What will happen here?

54Solution

Model the clusters as Gaussian’s and learn the covariance ellipses with the dataand use probabilities associated with the Gaussian density to determine

membership.

55Mixture of Gaussians (MOG) = A softer k-means

Model the data as coming from a mixture of Gaussian’s and you don’t know whichGaussian generated which data point

Each Gaussian cluster has an associated proportion or prior probability πk

p(x) =c∑

k=1

πkpk(x)

In the mixture of Gaussian’s case

pk(x) ∼ N(µ(k),Σk)

pk(x) =1

|2πΣk|.5e−(x−µ(k))T Σ−1

k(x−µ(k))

2

mixture models can be generalized

56MOG Solution

Normalize the probabilities to determine the responsibility of each cluster for eachdata point (soft-responsibility).

rk(x(n)) =πkpk(x(n))∑i πipi(x(n))

Now solve, similarly to k-means solution Recompute the mean, covariance andoverall weighting, for each cluster with each datapoint contributing weight

according to its responsibility. Then iterate as in k-means.

µ(k) =∑

n rk(x(n))x(n)∑n rk(x(n))

Σk =∑

n rk(x(n))(x(n) − µ(i))2

N∑

n rk(x(n))

πk =∑

n

rk(x(n))/∑

i

∑n

ri(x(n))

57Issues with MOG

Quite sensitive to initial conditions applet

it’s a good idea to initialize with k-means

There are a large number of parameters. We can reduce parameters by

57Issues with MOG

Quite sensitive to initial conditions applet

it’s a good idea to initialize with k-means

There are a large number of parameters. We can reduce parameters by

a) constraining Gaussians to have diagonal covariance matrices

b) constraining Gaussians to have the same covariance matrix

58

1

Aug 29, 2001Copyright © 2001, Andrew W. Moore

GaussiansAndrew W. Moore

Associate ProfessorSchool of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received.

Copyright © 2001, Andrew W. Moore Gaussians: Slide 2

Gaussians in Data Mining• Why we should care• The entropy of a PDF• Univariate Gaussians• Multivariate Gaussians• Bayes Rule and Gaussians• Maximum Likelihood and MAP using

Gaussians

59

21

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 41

After first iteration

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 42

After 2nd iteration

60

21

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 41

After first iteration

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 42

After 2nd iteration

61

22

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 43

After 3rd iteration

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 44

After 4th iteration

62

22

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 43

After 3rd iteration

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 44

After 4th iteration

63

23

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 45

After 5th iteration

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 46

After 6th iteration

64

23

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 45

After 5th iteration

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 46

After 6th iteration

65

24

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 47

After 20th iteration

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 48

Some Bio Assay data