chapter 8 data analysis, modelling and knowledge discovery ... · computational modelling in...

32
12/16/2002 Nik Kasabov - Evolving Connectionist Systems Chapter 8 Data Analysis, Modelling and Knowledge Discovery in Bioinformatics Prof. Nik Kasabov [email protected] http://www.kedri.info

Upload: others

Post on 06-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

12/16/2002Nik Kasabov - Evolving Connectionist Systems

Chapter 8Data Analysis, Modelling and

Knowledge Discovery in Bioinformatics

Prof. Nik [email protected]://www.kedri.info

Page 2: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

• Bioinformatics - an area of information growth and emergence of knowledge

• Dynamic DNA and RNA sequence data analysis and knowledge discovery

• Gene expression data analysis, rule extraction, and disease profiling

• Fuzzy evolving clustering of genes according to their time-course expression

• Protein secondary structure prediction • Dynamic cell modelling

Overview

Page 3: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Biology Basics• DNA ( Dioxyribonucleic Acid) is a chemical chain,

present in the nucleus of each cell of an organism• The whole process of DNA transcription, gene

translation, and protein production is continuous and it evolves over time

• RNA (ribonucleic acid) has a similar structure as the DNA except for one chemical molecule

• Genes are complex chemical structures and cause dynamic transformation of one substance into another during the whole life of an individual, as well as the life of the human population over many generations

• Modelling these interactions, learning about them and extracting knowledge, is a major goal for Bioinformatics

Page 4: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Bioinformatics

• First draft of human genome is completed, now the challenge is to be able to process the vast amount of dynamic information and to create intelligent systems for prediction and knowledge discoveries at different levels of life, from cell to whole organisms and species.

• Bioinformatics is concerned with the application of the methods of information sciences for the analysis, modelling and knowledge discovery of biological processes in living organisms

Page 5: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

BioinformaticsA schematic representation of the central dogma of molecular biology; from DNA to RNA (transcription) and from RNA to proteins (translation). (Fig 8.1)The central dogma of the molecular biology states that the DNA is transcribed into RNA, which is translated into proteins.

Page 6: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Life-long Learning & Evolution

• Through evolution genes are slowly modified over many generations of populations of individuals and selection processes (e.g. natural selection).

• Evolutionary processes imply the development of generations of populations of individuals where crossover, mutation, selection of individuals, based on fitness criteria are applied in addition to the learning processes of each individual

• A biological system evolves its structure and functionality through both, life-long learning of an individual, and evolution of populations of many such individuals,

Page 7: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Computational Modelling in Molecular Biology

• There are five main phases of information processing and problem solving in most bioinformatics systems:

1. Data collection, e.g. collecting biological samples and processing them.

2. Feature analysis and feature extraction3. Modelling the problem4. Knowledge discovery in silico5. Verifying the discovered knowledge in vitro and in vivo

Page 8: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Computational Modelling in Molecular Biology

• Some of the modelling techniques (decision trees, KBNN) allow for extracting knowledge – e.g. rules from the models, that can be used for explanation or for knowledge discovery.

• For large data sets and for continuously incoming data streams that require the model and the system to rapidly adapt to new data, it is more appropriate to use on-line, knowledge based techniques and ECOS in particular as it is demonstrated in this chapter.

• There are many problems in Bioinformatics that require their solutions in the form of a dynamic, learning, knowledge based system

• An ultimate task for bioinformatics would be predicting the development of an organism from its DNA code

Page 9: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Dynamic DNA & RNA Sequence Analysis

• Analysis of a DNA sequence and identifying promoter regions

• Identify splice junction (E/I, or I/E, or None):

Page 10: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

On-line learning of ribosome binding site data (fig 8.3)

0 200 400 600 800 1000 1200 1400 1600-0.5

0

0.5

1

Des

ired

and

Act

ual

0 200 400 600 800 1000 1200 1400 16000

20

40

60

80

100

Num

ber

of r

ule

node

s

Page 11: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Identify intron/exon splice junction

EXTRACTION OF RULES:

Rule1: if ----------------------------AGGT-AG------------------------- then [EI]

Rule8: if ------------------T------T-CAG------------------------------ then [IE]

Page 12: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Gene Expression Data: Biological Perspective• Microarray equipment is used widely at present to

evaluate the level of gene expression in a tissue, or in a living cell.

• Each point (pixel, cell) in a microarray represents the level of expression of a single gene

• Microarray analysis might not identify unique markers (e.g. a single gene) of clinical utility for a disease because of the heterogeneity of the disease, but a prediction of the biological state of disease is likely to be more sensitive by identifying clusters of gene expression (profiles)

• Gene expression clustering has been used to distinguish normal colon samples from tumours from within a 6,500 gene set.

• Another example of profiling developed in this chapter is for the distinction between two subtypes of Leukaemia, namely AML and ALL.

Page 13: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Gene Expression Data Analysis• A gene profile is a pattern of expression of a

number of genes that is typical for all, or for some of the known samples of a particular disease.

• A disease profile would look like:» IF (gene g1 is highly expressed) AND (gene g37 is low

expressed) AND (gene 134 is very highly expressed) THEN most probably this is cancer type C (123 out of available 130 sampleshave this profile),

• This profile can be matched against existing gene profiles and based on similarity, it can be predicted with certain probability if the patient is in an early phase of a disease or he/she is at risk of developing the disease in the future with certain probability.

Page 14: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Gene expression data analysis, modelling and knowledge discovery

• Goal: identify a gene or a group of genes associated with the state of the cell (tissue), e.g. cancer.

• Large number of genes (appr. 30,000) expressed in a microarray (in vitro) from a single tissue.

• It is difficult to find consistent patterns of gene expression for a class of tissue

• After all, a microarray data is just of few microseconds snapshot of what is happening in the cell

• Genes interact – how do we find out about that?

• Growing number of examples

Page 15: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Fuzzy representation of gene expression data

Page 16: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Gene Profiling Methodology• Phases:

1. Microarray data pre-processing.2. Selecting a set of significant differentially expressed

genes across the classes.3. Finding subsets of (a) under-expressed genes, and (b)

over-expressed genes, from the selected ones in the previous step.

4. Clustering of the gene sets from (3) that would reveal preliminary profiles of jointly over-expressed/under-expressed genes across the classes.

5. Building a classification model and extracting rules that define the profiles for each class.

Page 17: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Gene Expression Knowledge Discovery

• Goal: identify a gene or a group of genes associated with the state of the cell (tissue), e.g. cancer.

• Large number of genes (appr. 30,000) expressed in a microarray (in vitro) from a single tissue.

• It is difficult to find consistent patterns of gene expression for a class of tissue

• After all, a microarray data is just of few microseconds snapshot of what is happening in the cell

• Genes interact – how do we find out about that? • Growing number of examples and complexity.

Page 18: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Case Study: Gene Profiling of Colon Cancer using EFuNN

• Rule 1:IF M24902 (High 0.988) and H13238 (Low 0.991) and H16758 (High0.995) and X90908(Low 0.992) and T55255(Low 0.998) THEN COLON CANCER (High 1.0) (receptive field 0.5, examples explained by the rule 23/40;

• Rule 2:IF T71662(Low 0.984) and X76383(High 0.985) and X54938(Low 0.989) and H88522(Low 0.987) and H92523(High 0.989) THEN NORMAL TISSUE(High 1.0) (receptive field 0.19; examples explained by this rule 13/22; used thresholds for the condition membership degrees 0.98 and for

the conclusion memb. degrees 0.95)

• Two of the 12 extracted rules that reveal some conditions for a colon cancer against normal tissue. Each rule represents a sub-class (cluster) of each of the two classes.

Page 19: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Disease Profiling Through Rule Extraction from EFuNN

Rule extraction from EFuNNs:» Input space restricted to genes with high significance (e.g. 98 genes

for the colon cancer data set (Alon et al)» Rule extraction after learning in an EFuNN» Rules represent disease profiles» Proper visualization for a better understanding

Page 20: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Dynamic modeling and knowledge discovery

from 14 cancer type gene expression data• A continuous flow of data• An adaptive “mother

model” is being created and updated over time: new data; new genes; new classes

• At any time, an “optimal simple model” is extracted and analyzed

• Rules are extracted and genes arte analyzed

• Example: Ramaswami’s data (PNAS,January,2002) of 14 types of cancer

• Future work: dynamic modeling of gene interaction networks and cell development prognosis

Page 21: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Using Evolving Self-organising Maps ESOM for

clustering of time course gene expression data

On-line clustering of time-course gene expression data by ESOMs

(Da Deng, and N. Kasabov, 2002,Neurocomputing)

Page 22: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Amino Acid codonsThe codons of each of the 20 amino acids. The first column represents the first base in the triplet, the first row the second base, and the last column the last base (Table 8.6)

Page 23: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Protein Structure Prediction• The mRNA is translated by ribosomes into proteins• A protein is a sequences of amino-acids, each of them

defined by a group of 3 nucleotides (codons) • 20 amino acids all together (A,C-H,I,K-N,P-T,V,W,Y)• Initiation and stop codons• Proteins have complex structures:

» Primary (linear),» Secondary (3D, defining functionality)» Tertiary (high level energy minimisation packing), » Quaternary (interaction between molecules)

• The Protein Data Bank – www.rcsb.org - 100,000 hits a day on average

Page 24: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Protein Structure Prediction• Predicting the secondary structure from the primary • Segments from a protein can have different shapes:

» Helix» Sheet » Coil (loop)

• ANN is trained on existing data to predict the shape of an arbitrary new segment; window of 13 amino-acids

• 273 inputs – 3 outputs; 18,000 examples for training • Research done mainly by Mike Watts in collaboration with

Natural Selection Inc., based in La Jolla, California.

Page 25: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Proteins and protein structure prediction

• The mRNA is translated into proteins• A protein is a sequences of amino-

acids, each of them defined by a group of 3 nucleotides (codons)

• 20 amino acids all together (A,C-H,I,K-N,P-T,V,W,Y)

• Initiation and stop codons• Proteins have complex structures:

» Primary (linear),» Secondary (3D, defining functionality)» Tertiary ( energy minimisation packs), » Quaternary (interaction between molecules)

• The Protein Data Bank – www.rcsb.org -100,000 hits a day on average

Page 26: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Towards comprehensive EI for bioinformatics applications

• Hybrid models• Using all available information – gene expression, biological,

clinical, etc. à comprehensive simulation systemsCell Parameters System Parameters

DNA data of aliving cell

RNA data

Protein data

Existing data bases

(DNA, Genes, Proteins,Metabolic networks)

New knowledge extracted

Output information

Evolving model of a cell

Page 27: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Dynamic Cell Modelling• “The cell is never conquered until its total behaviour is

understood, and the total behaviour of the cell is never understood until it is modelled and simulated” . (Tomita, 2001)

• Computer modelling of processes in living cells is an extremely difficult task. » The processes in a cell are dynamic and depend on many variables

some of them related to a changing environment.» The processes of DNA transcription, and protein translation are not

fully understood.

• Several cell models have been created and experimented• A starting point to dynamic modelling of a cell would be dynamic

modelling of a single gene regulation process• The next step in dynamic cell modelling would be to try and

model the regulation of more genes, hopefully a large set of genes

Page 28: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Genetic networks and reverse engineering

• GN describe the regulatory interaction between genes• Reverse engineering – from gene expression data to GN. • It is assumed that gene expression data reflects the

underlying genetic regulatory network • Co-expressed genes over time – either one regulates the

other, or both are regulated by same other genes• What is the time unit?• Appropriate data needed• Validation procedure• Correct interpretation of the models may generate new

biological knowledge

Page 29: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Evolving fuzzy neural networks for GRN modeling

G(t) EFuNN G(t+dt)

• On-line, incremental learning of a GN

• Adding new inputs/outputs (new genes)

• The rule nodes capture clusters of input genes that are related to the output genes

• Rules can be extracted that explain the relationship between G(t) and G(t+dt), e.g.:

• IF g13(t) is High (0.87) and g23(t) is Low (0.9)

THEN g87 (t+dt) is High (0.6) and g103(t+dt) is Low

• Playing with the threshold will give stronger or weaker patterns of relationship

Page 30: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

DENFIS: Dynamic, evolving neuro-fuzzy inference systems for GN modeling

(IEEE Trans. FS, April, 2002)

• G(t) -> gj(t+dt)

• Dynamic partitioning of the input space

• Takagi-Sugeno fuzzy rules, e.g.:

if G1 is ( 0.63 0.70 0.76) andG2 is ( 0.71 0.77 0.84) andG3 is ( 0.71 0.77 0.84) andG4 is ( 0.59 0.66 0.72) and

then Gy = 1.84 - 1.26 X1 - 1.22X2+ 0.58X3 - 0.03 X4

Page 31: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Summary• Modelling biological processes is aiming at the

creation of models that trace these processes over time.

• The models should reveal the steps of development, the metamorphoses that occur at different points of time, the “trajectories” of the developed patterns.

• Biological processes are dynamically evolving and they require appropriate techniques, such as evolving connectionist systems.

Page 32: Chapter 8 Data Analysis, Modelling and Knowledge Discovery ... · Computational Modelling in Molecular Biology • Some of the modelling techniques (decision trees, KBNN) allow for

Nik Kasabov - Evolving Connectionist Systems

Further Readings• Computational Molecular Biology (Pevzner, 2001).• Applications of neural network methods, mainly multiplayer perceptrons and self-

organising maps, in the general area of genome informatics (Wu and McLarty, 2000).

• Microarray gene technologies (Schena, 2000).• Data mining in biotechnology (Persidis, 2000).• Application of the theory of complex systems for dynamic gene mo delling ( Bar-

Yam, 1997).• Computational modelling of genetic and biochemical networks (Bower and

Bolouri, 2001).• Dynamic modelling of the regulation of a large set of genes (Somogyi et al, 2001;

D’haeseleer et al, 2000).• Methodology for gene expression profiling (Futschik, et al, 2002; Futschik, 2002).• Using fuzzy neural networks and evolving fuzzy neural networks in bioinformatics

(Kasabov, Futschik and Middlemiss, 2000). • Fuzzy clustering for gene expression analysis (Futschik and Kasabov, 2002).• Artificial neural filters for pattern recognition in protein sequences (Schneider and

Wrede, 1993).• Dynamic models of the cell (Schaff and Loew, 1999; Tomita et al, 1999; Kohn

and Dimitrov, 2000).