esp project - webwork

PROBABILITY AND STATISTICS

ESP Project Machine Learning and Graphical Modeling

Anh Khoi Vu-Nguyen

Professor Ivan T. Ivanov

2

Abstract

In today's society there is an endless stream of information being processed. With each

passing day, information and data throughout various domains simply become more

complex and harder to interpret. For instance, how do sites such as hotmail or Gmail

distinguish between spam and non-spam messages? How does face detection and optical

character recognition work? It is simply bewildering to even think that a machine can

somehow categorize a hand written letter, when each human writes so differently.

Although these tasks are extremely complex at first glance, the stepping stones to the

actual process is actually quite beautiful and deceptively simple in nature. This is analogous

to being struck by awe at first glance of a beautiful castle or a painting. And then seeing the

actual process from beginning to end. The feat is no doubt still majestic, however it seems

much more within the realm of reality and not something born of science fiction.

As such, the aim of this short research will be to establish and ultimately develop an early

foundation of Machine Learning and Graphical Modelling.

3

Introduction

Machine Learning is the study of data-driven methods capable of mimicking, understanding

and aiding human and biological information processing tasks. In other words, it is a field

of study that applies computer algorithms in order for machines to do tasks that would

otherwise be too time consuming, or simply too complex. As mentioned earlier some of

these tasks include medical diagnoses, fraud detection, or weather prediction. These topics

range from a multitude of various different fields and can all be tackled by the human

effort. However human physical and metal effort also known as “labour” in the domain of

social sciences, is a rather precious resource and therefore machine learning is an

extremely vital field of study in order to further advance technological growth.

And ultimately that is where Machine Learning comes into play. The emphasis on machine

learning is to not only devise algorithms that allow machines to do learning automatically

but to ultimately teach them to a level that they no longer require human intervention.

Which is why many refer to the machine learning paradigm as programming by example. A

misconception about machine learning is that the computers are programmed to solve the

tasks at hand directly. However that is not necessarily correct, the computers are actually

taught methods on how to approach the problems. In other words, the computer will

establish its own solution based on the examples and the scenarios that we have provided.

It is very important to highlight the difference between the two because if the computers

do not actually learn, they will only exist for a specific task and therefore its overall

existence becomes static. They would not be able to handle different situations and as a

result we would no longer be able to call them “intelligent systems”. Tasks concerning

language or vision detection would be most likely impossible to tackle without the ability

to learn and adapt.

An article on theoretical machine learning nicely summarizes a typical learning problem in

a simple diagram.

4

From the previous diagram it is more clearly seen that a typical learning problem involves

a setup of many different ‘labeled’ examples that lead up and form the actual systematic

algorithms behind the machine learning. From then on, the machine attempts to interpret

new data based off new examples. The machine attempts to classify and link these new

examples to the previously shown examples. Based on the results of the predicted

classifications the machine proceeds to add this new information into its algorithm. As a

result the machine gains a new example and further grows in knowledge. In the future

when it is faced with the same scenario it will be able to easily decipher the information

needed. Combine a single machine that grows in knowledge with a whole network of

computers, the results are surely astounding.

In short, machine learning is a field of research that aims to learn and develop useful

information in the required setting through automatic methods. Machine Learning is a

very broad field that holds many ties to other fields such as Artificial Intelligence (AI),

statistics, mathematics, physics, theoretical computer science, and systems biology. The

latter which will be partially discussed towards the end.

Although machine learning is a compelling topic to learn about. Simply put, it is simply too

difficult to further develop on without going into the fundamental elements of machine

learning.

More specifically the majority of machine learning is broken down into graphical

modelling. With that said, the algorithms behind machine learning create a connection

between probability and graph theory. Graphical models are a very flexible class of

statistical model that has both the fundamentals of probability as well as possessing the

intuitive representation of relationship given by graphs. The two most large subsets of

graphical models are Bayesian and Markov networks.

Examples of Bayesian and Markov Networks respectively.

5

Bayesian Networks

To begin with, Bayesian networks or sometimes referred to as Belief networks focuses

strongly on modelling rather than inference. Bayesian networks emphasize on building

independent statements from the random variables by using graphs. In particular Bayesian

networks establish conditional dependencies by using a special graph known as a directed

acyclic graph (DAG).

Bayesian networks involve a structured factorisation of at least two random variables that

are defined on a probability space. In other words it is the structured factorisation of the

joint probability distribution for the given random variables.

For instance a common example of a Bayesian network is the probabilistic relationships for

a medical test. A Bayesian network of this situation would be forming a probabilistic

relationship between symptoms and a disease. At the heart of Bayesian networks we seek

to use relationships between random variables in order to constrain the setting and

ultimately reduce the distribution and amount of variable interactions. Because not only is

it unpractical but it is also impossible to factor in the total sample space of an event as it

could involve over millions of variables.

Fundamentally this is done through the definition of conditional probability and the usage

of Bayes’ theorem. For instance suppose the security alarm for a house sounds within thirty

minutes of installation and that there are only two possible events that can be responsible

for the sounding of the alarm. Either the house was burgled or the alarm malfunctioned.

Let

p(B) = Probability of house getting burgled

p(M) = Probability of alarm malfunction

p(M,B) = Probability of alarm malfunction and burgled at same time

Notice that the probability distribution p(M,B) is given by a total of 4 states formed by

22 = 4, however by relying on the definitions of conditional probability, the number of

states needed to be specified here can be reduced to 3.

By the application of conditional probability:

p(M,B) = p(M|B)p(B) = p(B|M)p(M)

In this situation there are two possible break downs for the probability of the alarm

malfunctioning and being burgled at the same time. However it can be seen that the final

number of states has been reduced to 3 states from previously 4. In general, for a

6

distribution of n random variables there can always be a simplification of the number of

states to 2n-1 number of states.

The purpose of this trivial example is to denote the importance of reducing the amount of

work required to perform inference. At the same time it can be seen that the number of

states that need to be specified increases exponentially with the number of variables.

Although only 1 state was removed in this case, ultimately the goal is the reduce the

maximal number of states that are not needed to perform inference or to draw logical

conclusions.

More generally a discrete Bayesian network is a distribution of the form

p(x1,...,xn) = i i )

Another way to view a Bayesian network is through a Directed Acyclic Graph (DAG), where

the nth node in the graph corresponds to the factor p(xi|pa(xi)).

An example of a DAG can be seen from the figure above. The key point behind a DAG is the

fact that every node has a directed arrow to the next node. In addition nodes that are

formed from other nodes are known as children or descendants while the nodes that form

them are referred to as parents or ancestors. In the above example of the tree DAG, node B

would be a child of node A which is the parent. Furthermore from the general distribution

of a Bayesian network whose equation was shown above, pa(xi) represents the parental

variables of variable xi. The notion of a parent child relationship between the nodes is

crucial in reducing the number of states and therefore is vital to performing when setting

the graph to perform Bayesian inference.

The importance of a directed acyclic graph is to model various different kinds of

information. As seen from the above DAG there is always an ordered sequence that must be

7

followed. This in turn has the advantages of both graphical and textual representations due

to the underlying order of the events.

Although Bayesian networks generally reduce the number of states through establishing

dependencies, the general equation presented earlier can be manipulated into representing

any Bayesian network by the use of Bayes’ theorem while still retaining the properties of a

DAG.

Bayes’ theorem is as follows

p(A,B) =

Although the theorem seems deceptively simple, it is the result from the derivation of the

basic axioms of probability. On an interesting albeit unrelated note, Sir Harold Jeffreys who

was a mathematician, statistician, geophysicist and astronomer once wrote that Bayes’

theorem “is to the theory of probability what Pythagoras’ theorem is to geometry”. But

without further ado by the application of Bayes’ theorem

p(x1,...,xn) = i i )

Where the above can be simplified to

p(x1,...,xn) = p(x1|x2,...,xn)p(x2,..., xn)

= p(x1|x2,...,xn)p(x2|x3,..., xn) p(x3,..., xn)

= p(xi|xi+1,...,xn)

It is important to note that any probability distribution can be represented by a Bayesian

network and have a directed acyclic graph. However, sometimes the Bayesian network will

yield no new information. Those Bayesian networks are referred to as a “cascade” DAG.

8

The above is a cascade DAG. A cascade representation tells us no valuable information

about the independence of the events and in this case represents the distribution

p(x1,x2,x3).

On the other hand often times directed acyclic graphs are very useful when conditional

independence exists within the probability distribution. For instance the following

probability distribution DAG will represent the same probability distribution as the earlier

cascade graph p(x1,x2,x3).

But due to the ‘orientation’ of the graph we can apply conditional independence and

rewrite the distribution as follows

p(x1,x2,x3) = p(x3|x1,x2)p(x1)p(x2)

As such often times when we have access to the directed acyclic graph that is not in a

cascade form we can draw valuable information (conditional independence) and simplify

the expression.

Before closing out this section it is worth mentioning the importance of a node that is in a

position such that the arrows of its neighbours (not its descendants) are pointing towards

it. This is known as a collider and allows us to draw further information upon the

x1 x2 x3

9

distribution. In the following figure conditioning on X3 (a collider) makes X1 and X2

graphically dependent on one another.

However if X3 was marginalised over, X1 and X2 would be unconditionally independent

because there is no information on X3.

Colliders are important for the notion of directional separation or also called

“d”-separation. Essentially colliders are important because having information on a collider

can release further information about its parents or its descendants depending on the

scenario. In other words colliders can reduce the total number of states through

dependencies based on a given directed acyclic graph.

The purpose of this section on Bayesian networks was to show and explain how it relates

back to Machine learning. In short, Bayesian networks represents a factorisation of a

probability distribution into conditional probabilities. Bayesian networks always

correspond to a DAG that aides us in determining whether variables are conditionally

independent or not. For instance if two variables within a Bayesian network are

independent then they are independent in any distribution that is consistent with their

own Bayesian network. In other words Bayesian networks are a more visual graphical

method that relies on the application Bayes’ theorem in order to limit the number of events

and ultimately aides in the construction of the learning algorithms in machine learning.

x2 x1

x3

x1 x2

10

Markov Networks

Before relating the importance of inference to machine learning there is one more

graphical model that is of relative importance. Markov networks (MN) are a graphical

model that resemble Bayesian networks as they both focus on modelling rather than

inference. Markov networks also have a unique factorisation on the joint probability

distribution. Markov networks factorise through something known as potentials.

Potentials (φ are non-negative functions of a variable and are always greater than 0.

For instance Markov networks would factorise a distribution differently than a Bayesian

network would:

p(a,b,c) =

φ a,b φ b,c

Notice here that factorisation via the Markov network removes the probability function

that is found within factorisation from a Bayesian network. Instead Markov networks are

defined as a product of potentials which represent the relative mass of probability found

within a clique. Furthermore in order to ensure that the sum is equal to 1 we need a

normalisation constant which in this case is the

.

Where Z = φ a, b φ b, c , ,

Similarly to the Bayesian network a Markov network also has a general equation to

describe any distribution as a factorisation of potentials:

p(x1,...,xn) =

φn

(xn)

In addition another difference between Bayesian networks and Markov networks is the fact

that all the connections between the graphical models are all undirected. As such, the

general equation above represents an undirected graph with a maximal of N cliques. Where

a clique is a fully connected subset of nodes (all members of a clique are neighbors). Due

to the undirected lines a MN can represent dependencies that Bayesian network would not

be able to such as cyclic dependencies.

A Markov network and its factorised potentials can be seen in the following example.

11

There are three noteworthy Markov properties that define a Markov network.

Firstly there is the local Markov Property which decides whether a variable is conditionally

independent of all other variables based on its surrounding neighbors. For positive

potentials:

Where ne(x) represents the closest neighbors of variable x.

The second is called the Pairwise Markov Property. This property states that any two non-

adjacent variables are conditionally independent given all other variables:

The final property is the Global Markov Property, any two subsets of variables are

conditionally independent given a seperating subset:

An important theorem known as the Hammersley-Clifford Theorem allows us to represent

positive probability distributions as Markov networks. More specifically, this theorem

states that any probability distribution with positive mass and density that satisfies one of

Markov properties can be factorised over the cliques of its undirected graph G. Essentially

the Hammersley-Clifford theorem shows that the factorisation property holds true for any

undirected graph.

12

For instance a trivial distribution with the graph:

Can be factorised simply because it is an undirected graph with positive potentials.

p(x1,x2,x3) = p(x1|x2,x3)p(x2,x3)p(x2,x3) = p(x1|x2)p(x2,x3)

= φ12(x1,x2)φ23(x2,x3)

Since x1 x3|x2 (from the graph)

p(x1|x2)

In short, if a distribution is expressible as either a product of potentials defined on the

cliques of a graph or as a factorisation into the clique potentials. We can always switch

between them interchangeably by using by the Hammersley-Clifford theorem, provided

that the potentials are positive.

Chain Graphs

A chain graph (CG) is a graphical model that combines both the Markov and Bayesian

network properties. The graph of a CG has both directed and undirected links, however

there are no directed cycles (no looping). Chain graphs can benefit from the ability of

having potentials and colliders in order to express conditional independent statements that

neither the Markov nor the Bayesian network can express alone. As such chain graphs can

be viewed as the unification and generalization of both Markov and Bayesian networks.

The general distribution for a chain graph distribution is similar to that of both Markov and

Bayesian networks:

Where T is the chain components and XT is the associated variables with the chain

components.

X1 X2 X3

13

Inference Bayesian and Markov networks are simply one of the many graphical models that depict

conditional dependence and independence between random variables. The majority of

graphical models are useful for either modelling or for inference. Inference is when we

already have some knowledge about the activity of the variables within the distribution.

More specifically after we have our given graphical models, inference consists of evidence

propagation and model validation in order to derive logical conclusions. Where evidence

propagation, sometimes called belief propagation or belief updating is the study of how the

parameters of the models are affected given new evidence and beliefs. This is usually done

by altering the specifications of the soft evidence or changing variables so that they become

hard evidence.

Where soft evidence or uncertain evidence is when the evidence variable is in more than

one state. While hard evidence is when we are sure that the variable resides only in one

state.

Factor graphs (FGs) are another type of graphical model that focuses mainly on inference

algorithms. Factor graphs are the result of graph conversions required to perform evidence

propagation. Essentially a factor graph is an undirected graph connecting variables and

their factors.

Given a function the general ‘equation’ for a factor graph:

Where is a factor that has a node (usually represented by a square in the graph)

Similarly to represent a distribution we need only to add a normalisation constant Z, where

In this case X represents all the variables within the distribution. Without this normalisation

constant the sum of the probability distributions would not sum to 1.

An example of a factor graph can be seen on the right

Notice the square in the middle represents the factor which in this case is the

also the potential φ (a,b,c)

14

Learning

Ultimately returning to our original premise, graphical models and inference are simply means to

establish learning algorithms. Fitting graphical models or learning through graphical models is a

term often used in artificial intelligence. The process breaks down into two main stages, structure

learning and parameter learning.

Structure Learning

Structure learning involves indentifying a distribution as close as possible to the correct one in the

probability space. In other words it involves indentifying structured objects instead of single labels

or real values. Structure learning consists of three approaches: constraint-based, score-based and

hybrid algorithms.

Constraint-based uses statistical tests to learn conditional independence relationships from the

data. Where statistical tests can be simply hypothesis tests which is a method of statistical

inference. Constraint-based algorithms are mainly used for Bayesian networks, however they have

been recently applied to Markov networks as well. Examples of constraint-based algorithms can be

seen in physical chemistry whether or not certain bodies obey Newton’s equations of motion .

Score-based algorithms treats learning as a “model selection problem” and gives a score based on

how well a structure matches the data. The score is generated through a statistical model known as

the ‘Goodness of Fit’.

Example of goodness of fit equation :

Where 2 would be the score.

Hybrid-based algorithms as its name suggests, is simply a combination of both constraint and score

based algorithms. Essentially both statistical tests and score functions are used in the construction

of the algorithm.

For structure learning it is important to add that there are many basic assumptions that are used.

These assumptions are important to simplify tasks and can be important in order to properly define

the global distribution of X. These assumptions are used regardless of whether it is a Bayesian or

Markov network.

15

Parameter Learning

Parameter learning is done after structure learning because by that time the structure will already

be known. As its name suggests, parameter learning focuses on learning the parameters for the

given structure.

Systems Biology

With the developments on graphical models across the last century, graphical modelling is

becoming more popular due to its strength in expressiveness. In particular graphical models are

being heavily applied in the field of genetics and systems biology. Systems biology is an “emerging

approach applied to biomedical and biological scientific research”. However it is interesting to note

that with each passing year the field is dealing with less and less biology and more with math, in

particular it can be argued that there is more statistics than biology. Mainly due to the fact that

systems biology requires extensive use of mathematical and computational models. As a result this

inspires new mathematical theories. The goals of systems biology is to model and discover new

properties of cells, tissues, and organisms. For instance these typically involve metabolic networks

which determine the physiological and biochemical properties of a cell.

The above is a Transcriptional Regulatory Network of Mycobacterium Tuberculosis which is a

pathogenic bacteria species and the main cause of tuberculosis. Evidently the network is difficult to

analyze without more information. However this is what the end product of a network or a

graphical model would look like in a realistic application. If we were to decompose these graphs to

16

the basic foundations we might see properties that resemble graphical models such as Markov or

Bayesian networks.

Generally graphical models in systems biology are used to describe and indentify

interdependencies amongst the genes and gene products. However the amount of data, parameters

and variables in these problems are very large. Therefore graphical modelling is needed in systems

biology in order to perform inference, in particular Bayesian inference.

Although most genetic networks are undirected graphs, there are times when directed conditional

dependencies are important. As a result Bayesian network inference are used when variables of

time are incorporated into the importance of systems biology. Therefore Bayesian networks are

important to draw logical conclusions about causal relations when time is of importance and

therefore when independence or dependence becomes an issue.

Conclusion

Briefly put, machine learning is an ever changing field that relies heavily on graphical models in

order to construct the elementary operations required to establish an “Intelligent System”. A

system is deemed intelligent when it is capable of adapting and learning to new environments. The

process of constructing these learning algorithms are constructed mainly via graphical models.

Popular graphical models include but are not limited to Bayesian and Markov networks in addition

to the special case Chain and Factor graphs. Once the models are established inference can be

performed to build logical conclusions and commence the operations to begin the construction of

the learning algorithms. Where learning is a two step process that consists of structure and

parameter learning in that order.

Machine Learning is but a broad field that is practiced in many other different fields such as

systems biology and genetics. Systems biology is arguably one of the more important areas where

the application of graphical models is seen firsthand. However there is no doubt that the future of

Machine learning will always be hand in hand with technological growth.

17

Works Cited

Barber, David. Bayesian reasoning and machine learning. Cambridge: Cambridge University Press,

2011. Print.

“Bayesian Network Learning Structure.”

<http://www.cs.iastate.edu/~jtian/cs573/WWW/Lectures/lecture12-BN-4-scored-

based-learning-2up.pdf>

“Complex Systems and Networks Lab.” <http://cosnet.bifi.es/research-lines/systems-biology>.

"Factor graph." Wikipedia. Wikimedia Foundation. <http://en.wikipedia.org/wiki/Factor_graph>.

“Graphical models and inference algorithms.” Graphical Models.

<http://www.psi.toronto.edu/index.php?q=graphical%20models>.

Scutari and Strimmer. Introduction to Graphical Modelling. 2011. John Wiley & Sons. 2011

"Structured prediction." Wikipedia. Wikimedia Foundation, 12 Apr. 2013. Web. 26 Dec. 2013.

<http://en.wikipedia.org/wiki/Structured_prediction>.

“Systems biology.” Wikiepedia. Wikimedia Foundation

<http://en.wikipedia.org/wiki/systems_biology>.

“Theoretical Machine Learning." Princeton University . Department of Computer Science, Ned. Web. 11

Nov. 1982. <http://www.cs.princeton.edu/courses/archive/spr

“Use of Directed Acyclic Graph Analysis in Generating Instructions for Multiple Users.” Margaret Mitchell

Language Technology Group. <http://crpit.com/confpapers/CRPITV9Mitchell.pdf>.

http://www.bibme.org/


http://www.cs.iastate.edu/~jtian/cs573/WWW/Lectures/lecture12-BN-4-scored-based-learning-2up.pdf

http://www.cs.iastate.edu/~jtian/cs573/WWW/Lectures/lecture12-BN-4-scored-based-learning-2up.pdf






esp project - webwork

Documents