Transcript
Page 1: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Understanding your data with Bayesian networks

(in python)

Bartek Wilczyń[email protected]

University of Warsaw

PyData Silicon Valey, May 5th 2014

Page 2: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Are you confused enough?

Or should I confuse you a bit more ? Image from xkcd.org/552/

Page 3: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Data show: Confused students score better!

Data from Eric Mazur

Page 4: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

There may be factors we haven't thought about

● Maybe confusion helps with learning?

● Or maybe there is an alternative explanation?

● As long as these are just cartoon models – we cannot really rule out any structure

Paying attention

Beingconfused

Correct answer

Beingconfused

Correct answer

or

Page 5: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

What do I mean by data?Sex Age Smoking Stress Lung Heart FeelM 0-20 never N No no great

F 70 sometimes N minor no OK

M 50-70 daily Y no severe Not-so-well

M 20-50 daily N no minor OK

F 70 never N no minor great

F 20-50 sometimes Y severe minor Not-so-well

F 20-50 never Y no no great

M 20-50 sometimes N minor no great

M 50-70 never Y severe no OK

F 0-20 never N no severe OK

M 20-50 daily Y no no OK

M 0-20 daily N no no Not-so-well

M 20-50 never N minor no OK

.... ... ... ... ... ... ...

Page 6: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Network of connections

Smoking (daily, sometimes, never)

Age(0-20,20-50, 50-70,70+)

Stressful job(yes,no)

Lung problems(no,minor,severe)

Heart problems(no,minor,severe)

Sex(male,female)

How did you feel this morning?(great, OK, not-so-well, terrible)

Page 7: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

What is a Bayesian Network ? ● A directed acyclic graph without cycles● with nodes representing random variables ● and edges between nodes representing dependencies

(not necessarily causal)● Each edge is directed from a parent to a child, so all

nodes with connections to a given node constitute its set of parents

● Each variable is associated with a value domain and a probability distribution conditional on parents' values

Page 8: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Back to our confused students

● Let us consider our model of confused students

● We can consider the model with an additional variable

● We need to heve data on the additional variable to be predictive

● Sometimes we need to use “wrong” models if they are predictive

Paying attention

Beingconfused

Correct answer

Paying attention

yes no

confused 80% 0%

not confused 20% 100%

Paying attention

Beingconfused

Correct answer

Paying attention

yes no

correct 50% 20%

incorrect 50% 80%

Page 9: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Can we find the “best” Bayesian Network?

● Given a dataset with observations, we can try to find the “best” network topology (i.e. the best collection of parents' sets)

● In order to do it automatically we need a scoring function to define what we mean by “best”

● A score function is useful if it can be written as a sum over variables, i.e. the best network consists of best parent sets for variables (modulo acyclicity)

Page 10: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

How to find the best network?● There are generally three main approaches to defining BN scores:

– Bayesian statistics, e.g. BDe (Herskovits et al. '95)

– Information Theoretic, e.g. MDL (Lam et al. '94)

– Hypothesis testing, e.g. MMPC (Salehi et al. '10)● There are also hybrid approaches, like the recent MIT (de Campos '06)

approach that uses information theory and hypothesis testing

● We have two issues:

– There are exponentially many potential parent sets

– The desired network needs to have no cycles● The second issue is more important and makes the problem NP-complete

(Chickering '96)

Page 11: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Cycles are not always a problem

● Dynamic Bayesian Networks are avariant of BN models that describe temporal dependencies

● We can safely assume that the causal links only go forward in time

● That breaks the problem of cycles as we now have two versions of each variable: “before” and “after”

X1

X2

X3

X1 X1

t t+1

X2 X2

X3 X3

Page 12: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Different types of variables

● Another common situation is when we have different types of variables

● We may know that only certain types of connections are causal

● Or we may be interested only in certain types of connections

● This breaks the cycles as well

Mutations

Protein expression

Diseases

Page 13: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

BNFinder – python library for Bayesian Networks

● A library for identification of optimal Bayesian Networks

● Works under assumption of acyclicity by external constraints (disjoint sets of variables or dynamic networks)

● fast and efficient (relatively)

Page 14: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Example1 – the simplest possible

Page 15: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Now, parallellize!

● Since we have external constraints on acyclicity, we can search for parent sets independently

● This leads to a simple parallelization scheme and good efficiency

Page 16: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Bonn et al. Nat. Genet, 2012

Page 17: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Active Inactive

Page 18: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Making the training set for “activity” variable

Page 19: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Handling continuous data

Page 20: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Network model

Page 21: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014
Page 22: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Does it provide useful predictions?

• 12 positive and 4 negative predictions tested

• >90% success (1 error)

Page 23: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Some more continuous data with perturbations

Page 24: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

• 8008 enhancers compiled from 15 ChIP experiments (almost 20k binding peaks)

• Activity data for ~140 enhancers divided into

– 3 tissues (MESO, VM, SM)

– 5 stages (4-6,7-8,9-10,1112,13-16)

• Gene expression data for 5082 genes from the BDGP database

Wilczynski et al.PLoS Comp.Biol 2012

Page 25: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014
Page 26: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Predictions validated:19/20 correct stage, 10/20 correct tissue

Page 27: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Summary

● Bayesian Networks can provide predictive models based on conditional probability distributions

● BNFinder is an effective tool for finding optimal networks given tabular data. And it's open source!

● It can be used as a commandline tool or as a library● It can use continuous data as well as discrete● Can be run in parallel on multiple cores (with good efficiency)● Convenience functions (cross-validation, ROC plots) included

http://launchpad.net/bnfinder

Page 28: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Thanks!

● Norbert Dojer

● Alina Frolova

● Paweł Bednarz● Agnieszka Podsiadło

● Questions?


Top Related