from genes to populations: the intelligent data analysis of biological data

69
From Genes to Populations: The Intelligent Data Analysis of Biological Data Allan Tucker School of Information Systems Computing and Mathematics, Brunel University, London. UB8 3PH. UK P êcheset O céans Fisheries and O ceans C anada C anada Moorfields Eye Hospital

Upload: thai

Post on 07-Jan-2016

17 views

Category:

Documents


0 download

DESCRIPTION

From Genes to Populations: The Intelligent Data Analysis of Biological Data. Allan Tucker School of Information Systems Computing and Mathematics, Brunel University, London. UB8 3PH. UK. Moorfields Eye Hospital. The Data Explosion. “We are drowning in information, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

From Genes to Populations:The Intelligent Data Analysis of

Biological Data

Allan TuckerSchool of Information Systems Computing and Mathematics, Brunel University, London. UB8

3PH. UK

Pêches et Océans Fisheries and OceansCanada Canada

Moorfields Eye Hospital

Page 2: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

The Data Explosion

“We are drowning in information,but starving for knowledge” John

Naisbett

• Advance of IT and the Internet• Massive increase in ability to:

• Record: Electronic records and forms• Store: Data Warehouses• Analyse: Data Mining and Visualisation

• Risk of Information Overload

Page 3: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Intelligent Data Analysis

• IDA attempts to deal with data explosion to discover patterns and knowledge from data• Typical analysis tasks:

• Clustering • Classification• Feature Selection• Prediction and Forecasting

Page 4: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Overlap with Statistics

“Statistics is the art to collect, to display, to analyze, and to interpret data in order to gain new knowledge.” Sachs 1999

“... statistics, that is, the mathematical treatment of reality ...” Hannah Arendt

“There are lies, damned lies, and statistics.”Benjamin Disraeli

Page 5: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Clustering (unsupervised learning)

Page 6: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Classification (supervised learning)

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

-0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1

NM_008695

NM_013720

Diseased

Control

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

-0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1

NM_008695

NM_013720

Diseased

Control

Page 7: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Feature Selection

Scatterplots from different features of

the same dataset

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5 6 7 8

Page 8: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Bayesian Networks

• An IDA method to model a domain using probabilities• Easily interpreted by non-statisticians• Can be used to combine existing knowledge with data• Essentially use independence assumptions to model the joint distribution of a domain

Page 9: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Bayesian Networks

• Simple 2 variable Joint Distribution

• Can use it to ask many useful questions• But requires kN probabilities

Gene ¬ Gene

Disease 0.89 0.01

¬ Disease 0.03 0.07

P(Gene, Disease)

Page 10: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Bayesian Network for Toy Domain

Gene C

Gene D

Gene E

P(A) P(B).001 .002

A B P(C)T T .95T F .94F T .29F F .001

C P(E)C P(D)T .70F .01

T .90F .05

Gene A

Gene B

Page 11: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Bayesian Networks

• Use algorithms to learn structure and parameters from data • Or build by hand (priors)

• Also continuous nodes (density functions)

Page 12: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Bayesian Networks for Classification & Feature Selection

Node that represents the class label attached to the data

Page 13: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Dynamic Bayesian Networks for Forecasting

• Nodes represent variables at distinct time slices• Links between nodes over time• Can be used to forecast into the future

Page 14: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Biological Data

• Microbiology (bioinformatics): • Genes, parallel sequencing

• Biological / Clinical (systems biology, medical informatics):

• Cell Models, Clinical Tests

• Population (Ecoinformatics?) :• Data from species: biomass etc.

Page 15: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Some of our projects in 1. Genes: UCL & Leiden University

• Identifying Genes relevant to conditions (MD)

• Identifying Genes common across organisms

2. Biological & Clinical: Brunel & Moorfields• Modelling vesicles within cells for

controlling osteoblasts• Develop model to forecast early glaucoma

based on differing clinical tests

3. Population: Kew & DFO, Canada• Identifying ideal germination conditions for

seeds• Identifying key species in different oceans

Page 16: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

1 Microarray Data

Page 17: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Microarray Data

• Major source of data for gene expression activity

• Technology takes measurements over 1000s of genes simultaneously

• Gene Regulatory Networks (GRNs) model how genes interact

• Eliciting reliable GRNs from data key to understanding biological mechanisms

Page 18: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Aims

• Reliability issues that surround microarray gene expression data

• Can we build GRN models that have enhanced performance, based on a richer and/or broader collection of data than a single microarray dataset?

Page 19: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Aims

• Three main threads of research:

• Text-based knowledge from the body of scientific literature integrated into the reverse-engineering process as prior knowledge for Bayesian network models to improve resulting GRN models

• Take advantage of multiple publicly available microarray gene expression datasets that have been generated in similar biological studies

• Expand this idea to explore biological mechanisms that are consistent between different biological models with increasing complexity (and between different species)

Page 20: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

a) Literature-based priors for gene regulatory networks

• Literature Prior calculated from profiles which are generated using software that converts the number of times two concepts are discussed within publications

• Convert it to a Prior Probability = correlation falling within a 2 tailed confidence interval

• Incorporated into scoring metric when learning networks

(2008) Jelier R, et al. Literature-based concept profiles for gene annotation: The issue of weighting. Int. J. Med. Inform.; 77:354-362.

(2009) Steele, E., Tucker, A., 't Hoen, P.A.C. and Schuemie, M.J., Literature-Based Priors for Gene Regulatory Networks, Bioinformatics 25 (14) : 1768-1774

Page 21: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Experiments

• Learn Bayesian networks from data • Given known biological structures, test using ROC

analysis:• True Positives: links that are correctly id• False positives: links that are incorrectly id• False Negatives: links that are missed• True Negatives: links that are correctly missed

Page 22: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Yeast and E-Coli

• Issues with circularity when validating

Page 23: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

b) Consensus Bayesian Networks

• Different platforms involve different biases:

e.g. Oligonucleotide estimates of absolute value of expression whereas cDNA measures relative differences between genes.

• Previous research established comparing datasets using standard normalisation is difficult and not straightforward

• An attempt to combine multiple microarray data sources through post-learning aggregation

Steele, E. Tucker A. “Consensus and Meta-analysis regulatory networks for combining multiple microarray gene expression datasets”, Journal of Biomedical Informatics 41(6), pp 914-926 , 2008

Page 24: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Consensus Bayes Networks

Page 25: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

E Coli

Page 26: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Yeast

Page 27: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

How to select best input networks?

• Prediction – Train a network on one dataset• Test it on the others sets (Independent Data)

• As opposed to Cross Validation (testing on the same dataset)

Page 28: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

c) Models of Increasing Complexity

Specification of three muscle differentiation datasets

(2010) Anvar, S.Y., t' Hoen, P.A.C. and Tucker, A., The Identification of Informative Genes from Multiple Datasets with Increasing Complexity, BMC Bioinformatics 11 : 32

Page 29: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

MIC

• Select one dataset for training

• Others become test sets

• Score mean and variance of SSE using CV and indpt test sets

• Use these to rank genes

Page 30: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

MIC - Datasets

• All concerned with the differentiation of cells into the muscle (Myogenic) lineage

• In-vitro system mimics the formation of new muscle fibres in-vivo

• Cao uses embryonic fibroblasts, others use tumor cell line that has the potential for differentiation into different lineages (mainly muscle and bone)

• Cao use MyoD and MyoG to force cell differentiation (others use serum starvation)

• Sartorelli includes different treatments that affect timing and efficiency

Page 31: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

MIC Select genes using one dataset (black) at a time and compare average CV error rate of BN classifier learnt on same dataset and validated on the other two datasets independently (grey).

Cao does well on CV but overfitsTomzczak does well on both

Page 32: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

MIC • Select 100 informative (KS test), and 50 uninformative

genes. • Train BN classifier on Tomczak and test on Sartorelli. • Rank genes according to average error rate.• Score average improvement or deterioration of Myogenesis-

Related, Top 100 and 50 random selected genes in Sartorelli • Compare our method with

rankings generated by concordance model.

Page 33: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

MIC Conclusions

• Predictive and consistent genes across independent datasets are more likely to be fundamentally involved in the biological process under study • Results imply that gene regulatory networks identified in simpler systems can be used to model more complex biological systems

Page 34: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Inter-species Mechanisms

Page 35: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Inter-species Mechanisms

Page 36: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

2 Medical Data

Page 37: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Eye Disease: VF and HRT Data

• Progressive loss of the field of vision is characteristic of many eye diseases

• Glaucoma is a leading cause of irreversible blindness in the world.• VF Data: sensitivity of

field of vision• HRT Data: anatomical info

of retina

Page 38: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

a) Classification of Early Glaucoma

1. Expert Knowledge2. Clinical Decision based on VF Tests3. Clinical Decision based on HRT Image

Tests

Can we combine these to improve the detection of the early onset of glaucoma?

(2010) Ceccon, S., Garway-Heath, D., Crabb, D. and Tucker, A., Investigations of Clinical Metrics and Anatomical Expertise with Bayesian Network Models for Classification in Early Glaucoma, Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications (SUEMA 2010), held at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2010)

Page 39: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

BN Classification of Early Glaucoma

1) Learnt from Control Data only2) Built from Anatomical Knowledge

3) Learnt based on MRA HRT Test4) Learnt based on AGIS VF Test

Page 40: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

BN Classification of Early Glaucoma

- Different networks capture different features (AGIS vs MRA)- Anatomy network is better in finding converters - Control-based network is better in finding controls

CONVERTERS

SAGIS

SMRA

TIME

P

SANATOMY

SCONTROLS,

SAGIS

Page 41: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Modelling Clinical Data

• Biomedical studies often involve data sampled from a cross-section of a population

• Collecting medical information on patients suffering from a particular disease and controls

• These studies show a “snapshot” of the disease process but disease is inherently temporal:

• Previously healthy people can develop a disease over time going through different stages of severity

• If we want to model the development of such processes, usually require longitudinal data (expensive)

Page 42: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

b) Pseudo Time-Series for CS Data

Tucker, A. and Garway-Heath, D., The Pseudo Temporal Bootstrap for Predicting Glaucoma from Cross-Sectional Visual Field Data, IEEE Transactions on IT in Biomedicine 14 (1) : 79-85 , 2010

Page 43: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Pseudo Time-Series Models

• Ordering labelled CS data based upon Minimum Spanning Trees & PQ-Trees (Rifkin et al. 2000)

• Treat ordered data as “Pseudo Time-Series” to build temporal models (Tucker et al., 2009)

• Here we use hidden variables to discover disease states (and transitions) within these pseudo time-series

Page 44: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Discovered State Transitions• Our algorithm unlabels the

known healthy / disease states (used to build the Pseudo TS)

• Uses EM to relearn an increasing no. of hidden states

• The discovered states and their trajectories show:

• Stable healthy state (4)• Stable disease state (1)• Glaucoma in HRT only (3)• Glaucoma in VF only (2)

Healthy

Severe Disease

Page 45: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Applicable to any clinical CS study?

-10 -5 0 5 10 15 20-10

-5

0

5

10

15

1

2

Breast Cancer:Found key variable with ‘tipping

point’

Page 46: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Applicable to any clinical CS study?

-20 -15 -10 -5 0 5 10-4

-3

-2

-1

0

1

2

3

4

1

2

Parkinson’s Disease:Found cluster of controls with mild

symptoms

Page 47: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Conclusions

• We explore how to build time-series models from cross-sectional data

• Here we use a simple incremental approach to discover hidden states and the transitions between them

• Demonstrate on glaucoma test data from two different sources

• Transitory and stable states are found that relate to known anatomical and clinical expectations

Page 48: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

3 Population Data

Page 49: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

3 Models of Population

• Genetics and disease impact on individual level

• But also on the population level• Spread of disease• Biological variation amongst a population

Page 50: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

a) The Millennium SeedBank

• RBG, Kew banking seeds for 35 years• MSB established for 10 years • 152 partner institutions in 54 countries

worldwide• Collected and stored >47,000 collections

representing >24,000 species

Page 51: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

The Problem

• Large, growing backlog of data• Optimum germination conditions &

simplest to apply – for users• Can we integrate GIS with SB DB?• How best to exploit the data – focus on

UK• What methods can solve these

problems?• Feature Selection• Classification• Explanation

Page 52: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Results: Classifiers – Performance

Page 53: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Results: Classifiers – Decision Tree

Page 54: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Decision Tree Interpretation

• Some subtrees hard to clarify, others generate quite reasonable hypotheses:

• Rainfall and altitude which seems to fit into the rough split of highland and lowland regions

• Cluster of FAILs for Umbill. before middle of August. Interesting to see why these conditions set up wrong in experiments

• Large cluster of FAILs for Cyperaceae at higher annual rainfall in the tree. Need to explore what it is in our applied treatments that is not resulting in successful germination.

Page 55: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Results: Classifiers – Bayes Net

Page 56: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Results: Classifiers – Bayes Net

Page 57: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Bayes Net Interpretation

• Markov Blanket includes all variables: all offer some improvement in prediction of germination success

• BN offers the advantage of making ‘what if’ queries by entering observs. into model:

• a very recognisable pattern now emerging from analysis at Kew that agrees with the network: Where a pre-treatment is necessary at all, and it is applied, there is nevertheless a relatively high probability of failure

Page 58: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Conclusions

• Millennium SeedBank project collated data on germination test conditions for 1000s of species

• Now need to focus on explaining underlying relationships between conditions and germination success

• Carried out the initial stage here

• Now need to specialise algorithms

Page 59: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

b) Fish Population Modelling

Page 60: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Data

• Northern Gulf (region a)• Biomass data collected at different

locations• 100s of different species• From 1960s until present day• Massively complex foodwebs:

• Fish predating others, cannibalism, competing for resources, unmeasured variables

Page 61: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

-47

-45

-43

-41

-39

-37

-35

89

04

47

44

14

49

90

81

35

32

01

28

59

74

52

74

78

46

11

93

73

08

49

18

78

21

78

11

14

44

47

53

81

96

15

07

21

82

13

84

42

44

43

96

64

51

79

24

26

72

67

00

80

99

99

58

93

81

98

11

28

17

88

89

81

45

72

80

88

36

81

38

71

18

21

84

89

47

01

71

68

92

83

58

12

80

57

91

71

78

09

3

Results 7: Feature Selection with Bootstrap to identify “cod collapse”

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

44

14

47

89

01

29

04

49

19

33

20

46

14

44

27

72

18

13

51

50

42

69

66

18

75

72

70

07

92

85

94

75

38

05

78

11

24

43

70

17

17

74

58

13

88

19

68

21

72

44

78

72

67

30

80

88

09

89

28

09

38

11

19

14

51

71

17

16

81

28

14

81

98

35

83

68

44

84

98

89

89

34

89

48

17

88

21

38

21

89

99

5

Wrapper method using BNs

Filter method using Log Likelihood

Redfish

Page 62: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Results : Feature Selection

Change in Correlation of interactions between cod and high ranking species before and after 1990:

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

whitehake

thornyskate

searaven

haddock whitehake

silverhake

witchflounder

redfish* shrimp*

pre 1990 correlation

post 1990 correlation

Page 63: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Fitting Dynamic Models

Learning DBNs with latent state variable

LSS = 5.0106

0 5 10 15 20 25-2

-1

0

1

2

0 5 10 15 20 250.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Fluctuation: Early Indicator of Collapse?

Page 64: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Examining DBN Net

Exploring dynamic links:

Cod

Hakes

Haddock

White Hake

Redfish

Witch Flounder

Shrimp Thorny Skate

Page 65: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Linear Dynamic System

• Instead of hidden state, continuous var:

• Could be interpreted as measure of fishing? Predator population (e.g. seals)? Water temperature?

0 5 10 15 20 25-2

-1

0

1

2

3

4

5

6

1984

1991

1987

(white fur ban)

1997 (white fur hunt)

Page 66: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Conclusions

• Potential of IDA models for predicting fish biomass data• Dynamic models for capturing the complexity of foodwebs• Latent variable analysis to explore unmeasured variables (climate change, fishing, legal changes)

Page 67: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Summary

• Intelligent Data Analysis• What it is• What it can be used for

• Brief Overview of existing research• Biological Level (Microarray)• Medical / Clinical Level (Disease Progression)• Population Level (Marine biomass / Seed)

• What next?• Linking the levels?• Impact of Microbiological models in clinic?• Impact of disease models on populations?

Page 68: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Caveats to IDA

Data Quality ✓Spurious Correlations ✓Over-fitting ✓“Black Box” Modelling ✓Over-reliance – slave to the data ?“Can’t see the wood for the trees” ?

Page 69: From Genes to Populations: The Intelligent Data Analysis of  Biological Data

Thanks for listening!

Pêches et Océans Fisheries and OceansCanada Canada

Symposium for IDA, Porto, Portugal: Deadline May

IDA Medicine and Pharmacology,

Bled, Slovenia: Deadline April