probability & data mining

Intelligent Data Analysis & Data Mining, a.k.a. DM2 2007/2008. Alfredo Vellido

Probability & Data Mining3. Bayesian Neural Networks & Latent models

Contents of the course (hopefully)

1. Introduction to DM and its methodologies2. Visual DM: Exploratory DM through visualization3. Pattern recognition 14. Pattern recognition 25. Feature extraction6. Feature selection7. Error estimation8. Linear classifiers, kernels and SVMs9. Probability in Data Mining10. Latency, generativity, manifolds and all that SML11. Applications of GTM: from medicine to ecology12. DM Case studies

Bayes & supervised neuralnetworks

Recap: Bayesian Neural Networks

“Probability theory should lay at the foundation any learning algorithm, otherwise risking that the reasoning performed in it be inconsistent in some cases” (heuristics)

“Embedding probability theory into machine learning techniques requires modeling assumptions to be made explicit; it also automatically satisfies the likelihood principle and provides a natural framework to handle uncertainty.” (what type of uncertainty?)

“Probability theory is ideally suited as a theoretical foundation for pattern recognition”

Step back: plain Neural NetworksFrom a simple perceptron to the MLP

MLP: Two- and three-layered networks as universal aproximators.

( ) ( ) ��

��

��

��

�=��

��

�== � ��

= ==

�

�

�

�

��

�

�

��

��

Step back: plain Neural Networks

Bayesian Neural Networks (3)

( ) ( ) ( )( )��

��

�

�� =

( ) ( ) ( )( )��

��βαβα

=βα��

�

The Bayesian formalism framework works at different levels of the development of a supervised neural network:

( ) ( ) ( )( )��

��ωω

=ω�

�

�

��

��

� � ��


( ) ( ) ( )( )��

��ωω

ω =

( ) ( )��αω −= ��

��

�


We start from an a priori distribution of the weights

… where … this corresponds to the use of a

weight decay regularizer

��

�

�

�

�

�

�

��

=

==�

�

�ω


( ) ( ) ( )( )��

��ωω

ω =

( ) ( )��

�� βω −= ��


…starting from an expression for the evidence or likelihood

…where …and assuming that the

targets have been generated by a smooth function with added Gaussian noise (beta!)…

��

( ){ }�=

−=�

�

��

��

�

��


… then, the a posteriori probability is proportional to:

… Notice (!!!) that the maximization is almost equivalent to the ML solution for large data sets …


( ) ( ) ( )( )��

��ωω

ω =

( ) ( )( )�� αβω +−∝ �

��

��

��


… In what sense do we say that the maximization is almost equivalent to the ML solution for large data sets?

The most probable (likely) solution for the weights, or wMP , corresponds to the minimum of

… or minimum error

… and maximum posterior probability


( ) ( ) ( )( )��

��ωω

ω = ( ) ( )( )�� αβω +−∝ �

��

��

��

( ) ( ) �� αβ += ��

( ) ( )( )�� αβω +−∝ �

��

��

( ) ( )( )� �= =

+−=�

�

�

�

��

��

��

�

αβ��


Distribution of the network outputs

REMEMBER:

Error bars in regression

Two terms: one from intrinsic noise, another from width of post dist of weights


( ) ( ) ( ) ��

�� =

��

( ) ( ) ( )( )��

��ωω

ω =

( ) ( ) ( )�= ��

( ) ( )

��

��

��+=

��

��

� −−≅

−��

�

�

��

βσ

σ

�

�

��


Variance � and regularization � hyperparameters…

...can be calculated differentiating with respect to hyperparameters obtaining iterative formulae for their estimation


( ) ( ) ( )( )��

��βαβα

=βα��

��

��

( ) ( ) ( ) ( ) ( )�� ==��

�� αββαβαβα ��

( ) ( )( )� �= =

+−=�

�

�

�

��

��

� ��

�

αβ

Remember Over- and under-fitting

A STEP ASIDE: OVERFITTINGWhen fitting a model to noisy data (ALWAYS), we make the When fitting a model to noisy data (ALWAYS), we make the assumption that the data have been generated from some assumption that the data have been generated from some ““TRUETRUE””model by making predictions at given values of the inputs, then model by making predictions at given values of the inputs, then adding some amount of noise to each point, where the noise is dradding some amount of noise to each point, where the noise is drawn awn from a normal distribution with an unknown variance.from a normal distribution with an unknown variance.Our task is to discover both this model and the width of the noiOur task is to discover both this model and the width of the noise se distributiondistribution. In doing so, we aim for a . In doing so, we aim for a compromise between biascompromise between bias, , where our model does not follow the right trend in the data (andwhere our model does not follow the right trend in the data (and so so does not match well with the underlying truth), does not match well with the underlying truth), and varianceand variance, where , where our model fits the data points too closely, fitting the noise raour model fits the data points too closely, fitting the noise rather than ther than trying to capture the true distribution. trying to capture the true distribution. These two extremes are known These two extremes are known as as underfittingunderfitting and and overfittingoverfitting..IMPORTANT! : the number of parameters in a model; the higher, thIMPORTANT! : the number of parameters in a model; the higher, the e more accurately the model can fit the data. If the number of more accurately the model can fit the data. If the number of parameters in our model is larger than that the parameters in our model is larger than that the ““true onetrue one””, then we , then we risk risk overfittingoverfitting, and if our model contains fewer parameters than the , and if our model contains fewer parameters than the truth, we could truth, we could underfitunderfit..

�

Over- and under-fitting (2)A STEP ASIDE: OVERFITTING

The illustration shows how increasing the number of parameters iThe illustration shows how increasing the number of parameters in n the model can result in the model can result in overfittingoverfitting. . The 9 data points are generated The 9 data points are generated from a cubic polynomial which contains 4 parametersfrom a cubic polynomial which contains 4 parameters (the true model) (the true model) and adding noise. We can see that by selecting candidate models and adding noise. We can see that by selecting candidate models containing more parameters than the truth, we can reduce, and evcontaining more parameters than the truth, we can reduce, and even en eliminate, any mismatch between the data points and our model. Teliminate, any mismatch between the data points and our model. This his occurs when the number of parameters is the same as the number ooccurs when the number of parameters is the same as the number of f data points (an 8th order polynomial has 9 parameters).data points (an 8th order polynomial has 9 parameters).

�

Bayesian Neural Networks (13)What else does this Bayesian formalism offer us?

Variable selection: Automatic Relevance Determination (ARD): (ARD): regularization terms are asociated to each networkinput. The a priori distribution of the weights is defined as:

Automatic Regularization: The regularization coefficientsassociated to irrelevant inputs are inferred high values thatmake the corresponding weights tend to 0. This can be interpreted as soft input pruning. inspection of the final values {�c} indicates the relative relevance of each variable.

��

��

�−= � �

�

�

�

�

��

��

��

�� α

Bayesian Neural Networks (14)Bayesian formalism in a nutshellBayesian formalism in a nutshell

Initialization of hiperparameters � and �. Initialization ofweights with values obtained from ��Network training using standard optimization techniques tominimize �� Insert, every few iterations of the algorithm, the reevaluationof � and �, which entails the evaluation of the Hessian ofthe error function �� and the extraction of its eigenvalues.[optionally, repeat the three previous steps to limit thenegative effect of local minima of the error function]Repeat the previous steps for a sample of different possiblemodels or, alternatively, use a “committee of networks” or select models with high evidence to define a “mixture of experts”.

BayesiansC.M. Bishop, 1995. C.M. Bishop, 1995. Neural Networks for Pattern Neural Networks for Pattern RecognitionRecognition (Ch.10). Oxford University Press.(Ch.10). Oxford University Press.C.M. Bishop, 2006.C.M. Bishop, 2006. Pattern Recognition and Pattern Recognition and Machine LearningMachine Learning. Springer . Springer VerlagVerlag..D.J. MacKay, 2002. D.J. MacKay, 2002. Information Theory, Inference Information Theory, Inference & Learning Algorithms& Learning Algorithms. Cambridge Univ. Press.. Cambridge Univ. Press.

www.inference.phy.cam.ac.uk/mackay/itila/book.htmlwww.inference.phy.cam.ac.uk/mackay/itila/book.html

E.T. E.T. JaynesJaynes, 2003, Probability Theory: The Logic , 2003, Probability Theory: The Logic of Science. Cambridge University Pressof Science. Cambridge University Press

http://omega.albany.edu:8008/JaynesBook.htmlhttp://omega.albany.edu:8008/JaynesBook.html

Tom Tom LoredoLoredo, 2000. Bayesian Inference, a practical , 2000. Bayesian Inference, a practical primer (tutorial)primer (tutorial)

http://http://astrosun.tn.cornell.edu/staff/loredo/bayes/tjl.htmlastrosun.tn.cornell.edu/staff/loredo/bayes/tjl.html

LetLet’’s hold the Bayesian horses s hold the Bayesian horses ……for a whilefor a while

Latency, Projection and Generativity

(or … towards unsupervised Bayesian Neural Networks)

Latent modelsLatent models

What is a latent model?What is a latent model?

First of all… what is a latent variable?

According to the WIKIPEDIA …“Latent variables, as opposed to observable variables, are those

variables that cannot be directly observed but are rather inferred from other variables that can be observed and directly measured.

Examples of latent variables include quality of life, business confidence, morale, happiness, conservatism. Latent variables are also called as hypothetical variables or hypothetical constructs. The use of latent variables is common in social sciences and to an extent in the economicsdomain. The exact definition of latent variables varies in different domains.

What is a latent model? (2)What is a latent model? (2)First of all… what is a latent variable?

Still in WIKIPEDIA …

“One advantage of using latent variables is that it reduces the dimensionality of data. A large number of observable variables can be aggregated to represent an underlying concept, making it easier for human beings

to understand and assimilate information.”

Haven’t we heard all this before?

��

�� !"#$ �%�# %&'#�(

How to deal with multivariate andusually high-dimensional data? …It depends:

Low dimensionality (1-3D)Medium dimensionality (4-10D)High dimensionality (>10D)

�� lowlow--mediummedium dimensionalitydimensionality <10D<10D

Spatial Coordinates3D requieres interactivity

Further pre-cognitive elementsallow us to add dimensions

color, movement, shape, …

Fancy solutionsglyphs: Chernoff faces, stick-figures, whiskers...

�� &'highhigh dimensional datadimensional data

How do we visualize high dimensional data withoutlosing much on the way?

Some alternatives are relatively simple, others are not

Eliminate redundant or uninformative dimensions (variables) … at least the problem is alleviatedFEATURE SELECTION:FEATURE SELECTION: LluisLluis ……

Divide & Conquer: create multiple simultaneous low-dimensional visualizations.

Then … latent and projection methods!

What is a latent model? (3)What is a latent model? (3)“Latent Variable Models and Factor Analysis”by David Bartholomew, Martin Knott (2nd ed., 1999)

“LVMs provide an important tool for the analysis of multivariate data […] a conceptual frameworkwithin which many disparate methods can be unified and a base from which new methods can be developed”

“A statistical model specifies the joint distribution of a set of random variables – P(X) = P(x1,x2,…,xD) –and it becomes a LVM when some of the variables are unobservable.”


“… the interesting question concerns why LVs should be introduced into a model…”

–– One reasonOne reason … is to reduce dimensionality … “if the information contained in the interrelationships … can be conveyed … in a much smaller set, our ability to see the structure in the data will be much improved.”

–– Another reasonAnother reason … “is that latent quantities figure prominently in many fields to which statistical methods are applied … social sciences social sciences …business and marketingbusiness and marketing …”

What is What is ““Business confidenceBusiness confidence””? And what is ? And what is ““General intelligenceGeneral intelligence””??…… and and ““Perception of riskPerception of risk””??

“… as if they were measurable quantities but for which no measuring instruments exist …”

What is a latent model?What is a latent model? (Example)(Example)

www.cc.gatech.edu/gvu/user_surveys/

9th GVU’s WWW users survey, produced by the Graphics, Visualization & Usability Center (Kehoe and Pitkow, 1998).

Self-selection of respondents. Experience in Internet over the population average, but demographic variables well balanced. Marketing-wise, the over-representation of the experienced could be beneficial.

What is a latent model? (Example)What is a latent model? (Example)

�� !"��

� �#$�%&��'�%()��$*+,�-�.�,/� �$%,'$.�+%��($%0%�%(

� �$%12*'�'�13�'(,�$%��

�%0�'$%*%,+.�($%,'$.�

�'21,�+%��1(2'�,/

4 �21,$*'�1'0�(� �1$%1�0%11�+%��*+,#/��

�%5$'*+,�$%�'�(#%11

6 �55$'�+-�.�,/ 77

8 �#$�%&��'�%()��55$',� �+1�$5�21

9 �'$�2(,�'(,�$% :+'�,/)��%5$'*+,�$%�'�(#%11

; �21,$*'�1'0�(��$%12*'�'�13� �112'+%(�+%��'.�+-�.�,/��

�'5$'*+%(�'�13

< �$%12*'�'�13)��*+&�'�13� �.�,�1*

= �#$�%&��'�%(��

�21,$*'�1'0�(�

�55$',��

�1$%1�0%11�+%��*+,#/

FACTOR DESCRIPTION ATTRIBUTES

1 Shopping experience:

Compatibility

Control and convenience

2 Environmental control:

Consumer risk perception

Trust and security

3 Affordability --


Effort

Ease of use


Customer service

Responsiveness and

empathy

ORIGINAL FACTORS

SELECTED FACTORS


“… we prefer to speak of LVs since this accurately conveys the idea of something underlying the observed …”

“… A LV can be realreal in the sense that it could, in principle, be measured. For example, Personal Wealth …”

“… Business Confidence is not something which exists in the sense that Personal Wealth does …”

“… Much of the philosophical debate which takes place on LVMs centres on REIFICATION REIFICATION …”

REALISTS vs. INSTRUMENTALISTS

What is a latent model? (6)What is a latent model? (6)

Latent ClassAnalysis

Latent ProfileAnalysisCATEGORICAL

Latent TraitAnalysis

Factor AnalysisCONTINUOUS

LATENTVARIABLES

CATEGORICALCONTINUOUS

OBSERVED VARIABLES

LATENT TRAIT ANALYSIS: Tino, Kaban, Sun (2004) A GenerativeProbabilistic Approach to Visualizing Sets of Symbolic Sequences, Proceedings of the ACM SIGKDD

probability & data mining

Documents