alexandros kanterakis 17-5-2005 heraklion crete. presentation outline dna and microarray...

61
Alexandros Kanterakis 17-5-2005 Heraklion Crete

Upload: eustace-carroll

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Alexandros Kanterakis17-5-2005Heraklion

Crete

Page 2: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Presentation Outline DNA and DNA and Microarray ExperimentsMicroarray Experiments From Genomic to Post-Genomic Informatics From Genomic to Post-Genomic Informatics Combined Clinico-Genomic Knowledge Discovery Combined Clinico-Genomic Knowledge Discovery Towards Reliable Gene-Markers: Towards Reliable Gene-Markers: Supervised Gene Supervised Gene

Selection Selection Discovery of Co-Regulated Genes: A Discovery of Co-Regulated Genes: A ClusteringClustering

Approach Approach The MineGene SystemThe MineGene System and and ImplementationImplementation Issues Issues Future WorkFuture Work

Page 3: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

DNA Microarrays• Devices that can estimate in parallel, the

expression of many thousands of genes.• Their invention in 1995 brought a revolution in

molecular biology, medicine as well as in pharmaceutical and biotechnology.

• They mainly used to estimate differential expression of genes acquired from tissues in various states and conditions, making practical comparisons between a sample genotype profile and an arbitrary phenotype attribute or clinical observation

Page 4: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

• Microarray experiments consist of numerous steps where each include a variety of procedures, protocols and data.

• Most of these steps and procedures follow specific guidelines, annotations and ontologies that need to be followed

• It is crucial for a laboratory to record, maintain and publish that data in modern information systems.

• The final outcome of this procedure is the gene expression matrix that is a 2D matrix containing the expressions of genes per sample. Genes and samples are accompanied with covariate information.

DNA Microarray Experiments

Page 5: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

From Genomic to Post-Genomic Informatics

• Sequence DatabasesThere are three major co-operating DBs (EMBL, GenBank, DNA Data Bank) containing millions of sequences with billions of nucleotides from several organisms with exponential growth.

• Secondary Sequence DatabasesSuitable for Microarray experiments. Contain better annotation and meta-information. Example: UniGene, TIGR, RefSeq

• Genomic DatabasesExamine sequences for microarrays from a genomic perspective Contain gene names and annotations (rather than gene sequences) organized per organism. Example: Ensembl, CMR (Microbial Genomes).

• Gene Expression Databases

Forms of Genomic Informatics:

Page 6: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Gene Expression Databases

• Handle Gene expression data:– Store, retrieve and update data.– Analyze data

• Publish:– Verify, compare, expand and improve

findings.– Develop novel data analysis methods

• Provide a Laboratory Information Management System (LIMS)– Record every step of the experimental

process as it happens (experiments, dates, protocols used, experimental parameters)

– Provides data reproducibility– Standardize microarray experiments.

• Flow data seamlessly between the different components. Ideally it should be possible to replace any component without affecting the other parts of the flow.

Provide data management for data generated by gene expression experiments. Their main purposes are to:

In many respects gene expression databases are inherently more complex than sequence databases..

Page 7: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

The Microarray Gene Expression Data Society (MGED)

• MIAME. Minimal Information About a Microarray Experiment. Formulates the information required to record about a microarray experiment in order to be able to describe and share the experiment.

• Ontologies. Determine ontologies for describing microarray experiments and the samples used with microarrays (available in RDF, OWL and DAML). – Other Ontologies used in GEDs are Taxonomic and Gene Ontologies.

• MAGE. Formulates the object model (MAGE-OM), exchange language (MAGE-ML) and software modules (MAGE-stk) for implementing microarray software.

• Transformations. Determines recommendations of describing methods for transformations, normalizations and standardizations of microarray data.

MGED is a group of researchers with the intention of establishing standards for microarray data annotation and to enable the creation of public databases for microarray data.

MGED’s work is arranged into four working groups:

Page 8: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Expression Database Comparison Objective: Analyze existing Microarray Gene Expression databases for their ability to serve as an integrated environment for a laboratory as part of the PrognoChip project. Selected candidates are widely known, open source systems: BASE and ArrayExpress (cooperation with FORTH-ISL):

BASE (selected) ArrayExpress

Supporting Standards •Support MAGE-ML extraction•Did not support experiment MAGE-ML submission

•Problems with MAGE-ML submission and extraction

Consensus/Supporting community

•Mailing list, active community•On line documentation

•Mailing list, active community•Better on line documentation

Installation/Software maintenance

•Light-weight and robust inherent RDBMS (MySql)•Rational hardware requirements

•Tricky and problematic installation and tuning (Oracle).•Extreme hardware requirements

Provided tools / Extensions •Basic analysis tools•Integrated plug-in schema (through PHP language)

•Perl Language (Obsolete?)•Analyze through Expression Profiler

Interface supplied / Usability / Security

•Includes LIMS with graphic interface•Basic security schema

•No graphic submission tool•More sophisticated security schema.

Page 9: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Applications of Genomic: The “New Genomics”

• In USA, projections suggest that 40% of those alive today will be diagnosed with some form of cancer at some point in their lives.

• By 2010, that number will have climbed to 50%.• Today it is known that 9 of the 10 leading causes of

mortality have genetic components.• This aspect of genetics has to consider diseases caused

partly by mutations in specific genes (e.g., breast cancer, colon cancer, diabetes, Alzheimer disease) or prevented by mutations in genes (e.g., HIV, atherosclerosis, some forms of cancer).

• These conditions are significantly common enough to directly affect virtually everyone making genetics play large role in healthcare and in society.

Page 10: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Genomic Medicine and Healthcare

• knowledge of individual genetic predispositions via microarray and other technologies.– individualized screening (i.e. Mammography schedule).– Individualized behavior changes (informed dietary).– presymptomatic medical therapies.

• creating Pharmacogenomics– individualized medication based on genetically determined variation in

effects and side effects.– new medications for specific genotypic disease subtypes.

• allowing genetic engineering.• better understanding of non-genetic (environmental) factors in

health and disease.• emphasizing health maintenance rather than disease treatment• creating a fundamental understanding of the etiology of many

diseases, even “non-genetic” diseases.

Genomic medicine will change healthcare by providing:

Page 11: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrating Clinical and Genomic Information

• Most genetic contributions to common disease identified so far have been low frequency with high penetrance alleles (i.e., BRCA1, BRCA2 , HNPCC).

• On a population level, most genetic contributions to common disease are from high frequency, low penetrance alleles (i.e., APC, Alzheimer disease, HIV/AIDS resistance).

• What makes these low penetrance alleles to be expressed seems to be a complex concept that has to include environmental factors.

• Thus, clinical observations are strictly correlated with specific alleles during the expression of these diseases.

Page 12: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrated Clinico-Genomic Knowledge Discovery: A Scenario

• Step 1. Collections of samplesTissue sample is extracted from specific cancer patients. The tissue sample is appropriately treated and preserved in order to reserve RNA expression.

The conceptualization of individualized medicine is to be realized by respective procedures, protocols and guidelines in the context of integrated and synergic clinico-genomics decision-making scenarios.

Such a scenario is presented for the case of cancer – the same scenario may be conceptualized and appropriately extended to other diseases.

The 5 step scenario illustrates the key processes, namely: collection of samples, phenotyping, genotyping and the transition from phenotypes to genotypes.

Page 13: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrated Clinico-Genomic Knowledge Discovery: A Scenario

Step 1/5

Page 14: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrated Clinico-Genomic Knowledge Discovery: A Scenario

• Step 2. Phenotyping– Characterization of samples:

Collected samples are assigned to various clinico-histopathological types and stages.

– Classification of samples:Assigned to different phenotypical profiles (e.g. phenotypes F1 and F2) which may include:

age, habits & environmental factors, family-history, tumour type, medical-imaging parameters,…

During this procedure we build various phenotypes as:

Phenotype F1 Phenotype F2

Domain 1 Good Prognosis Bad Prognosis

Domain 2 Respond to chemotherapy Don’t Respond to chemotherapy

Domain 3 Metastasis occured No Metastasis occured

Page 15: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrated Clinico-Genomic Knowledge Discovery: A Scenario

Step 2/5

Page 16: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrated Clinico-Genomic Knowledge Discovery: A Scenario

• Step 3. Genotyping.– By microarrays technology, the

molecular profiles of the samples are extracted.

– By fundamental molecular biology knowledge we may assess relevant molecular-pathways (e.g., genetic networks).

Such knowledge will help to the identification of validated and more refined genotypes.

Page 17: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrated Clinico-Genomic Knowledge Discovery: A Scenario

• Step 4. From Phenotypes to Genotypes .– Applying data-mining operations (gene

selection) on the acquired gene-expression matrix and identify potential discriminatory genes. For example genes

that distinguish between the two identified phenotypes.

– These genes compose the molecular signature (or gene markers) of the

respective phenotypes.

Page 18: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrated Clinico-Genomic Knowledge Discovery: A Scenario

Step 3,4/5

Page 19: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrated Clinico-Genomic Knowledge Discovery: A Scenario

• Step 5. From Genotypes to Phenotypes.– The decision making process described above may be

initiated the other way around, towards the establishment of more fundamental knowledge.

– Applying again data-mining operations (e.g. clustering) we are able to identify clusters of samples based on their gene-

expression profiles. – These clusters represent potential interesting

genotypes, e.g., genotypes G1 and G2. – In the course of diagnostic, prognostic or, therapeutic

decision making process, each, yet untreated, patient may be assigned to its corresponding genotypical class (i.e., to the discovered cluster genotype into which the patient belongs).

– Then, with the aid of a supervised predictive learning operation (i.e., decision trees) re-classification of the disease on

the phenotypical level - a fundamental task in the clinical research for compacting major diseases.

Page 20: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrated Clinico-Genomic Knowledge Discovery: A Scenario

Step 5/5

Page 21: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Gene Expression Data Mining

• Gene expression database mining is used to identify intrinsic patterns and relationships in gene expression data.

• Traditionally molecular biology has concentrated on a study of a single or very few genes in research projects.

• With genomes being sequenced, this is now changing into so-called systems approach where new research questions can be studied such as:– how many genes are expressed in different cell types?– which genes are expressed in all cell types?– what are the functional roles of these genes?– how a group of genes is regulated?– what genes are interfered in a specific phenotype?

• We make a distinction between two types of analysis tasks: gene selection and gene clustering.

Page 22: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Towards Reliable Gene-Markers: Supervised Gene Selection

Although biological experiments vary considerably in their design, the data generated by microarray experiments can be viewed as a matrix of expression levels, organized by samples versus genes.

Microarray gene expression experiments are organized in four basic types:

• A comparison of two biological samples.• A comparison of two biological conditions, each represented

by a set of replicate samples• A comparison of multiple biological conditions• Analysis of covariate information

Page 23: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

A Novel Gene SelectionGene Selection Approach: Methodology and Algorithms

We present a novel gene-selection methodology composed by four main modules and is based on Discretisation of gene-expression data:

Page 24: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Discretization of Gene-Expression Data

• In most of the cases, we are confronted with the problem of selecting genes that discriminate between two classes (i.e., diseases, disease-states, treatment outcome, recurrence of disease, in other words phenotypes). It is convenient to follow a two-interval discretisation of gene-expression patterns.

• A general statement of the two-interval discretisation problem followed by a two-step process to solve it follows.

Given: A sorted vector of numbers:

where, each number in is assigned to one of two classes.

Find: A number, that splits the numbers in into two intervals: and , and best discriminates

between the two classes. Best discrimination is decided according to a specified criterion.

knnnV ,,, 21 k

Vknn 1:),[ 1 n ],[ kn

V

Page 25: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Discretization of Gene-Expression Data

• Step 1

For all consecutive pair of numbers in their midpoint, is computed, and the corresponding ordered vector of midpoint numbers is formed:

• Step 2

For each the well-known information gain metric is computed

where sets , and include numbers from which are less than and higher (or equal) to , respectively.

1, ii nn 21 iii nn

121 ,,, kM

M

)Entropy(V V

VEntropy(V)μ)IG(V, u

u

h}{l,u

lVhV

V

V

Page 26: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Discretization of Gene-Expression Data

• Step 3

The midpoint that exhibits the maximum information gain:

is considered as the gene’s expression value which, when considered as a split point, exhibits the best discrimination between the classes.

This point is selected to assign the gene’s expression values to the nominal ‘l’ow or, ‘h’igh values, respectively (i.e., less than and higher that ).

),,(maxargmax VIG

max max

Page 27: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Discretization of Gene-Expression Data, an overview

Page 28: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Discretization of Gene-Expression Data

The aforementioned discretisation process is applied independently on each gene in the training set. The final result is a discretised expression-value representation / transform of each gene:

Page 29: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Gene Ranking

For each discretised gene we count the number of ‘h’s and ‘l’s that occur in the respective samples. Assume that each sample is assigned to one of two classes, i.e., P, and N. The following quantities are computed:

= number of ‘h’ values for gene g assigned to class P

= number of ‘l’ values for gene g assigned to class P

= number of ‘h’ values for gene g assigned to class N

= number of ‘l’ values for gene g assigned to class N

PgH ,

PgL ,

NgH ,

NgL ,

Page 30: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Gene RankingFormula below, computes a rank for each gene that measures the power of the gene to distinguish between the two classes:

For a completely distinguishing gene where, all of its values for class P are ‘h’, and all of its values for class N are ‘l’, and, , takes its maximum positive value. In this case the gene is considered to be descriptive of (associated with) class P.

The gene remains completely distinguishing in the inverse case where, and, , takes the minimum negative value. In this case the gene is consider descriptive of class N.

The gene ranking formula encompasses and expresses: (a) a polarity characteristic (b) the descriptive power of the gene with respect to the

present disease-state classes

gr

PgNgNgPgg LHLHr ,,,,

0,, NgPg HL

0,, NgPg LH gr

Page 31: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Gene GroupingBy gene grouping we group genes that have similar ranking. First we estimate the value:

MaxRank and MinRank are the maximum and minimum ranking of the genes respectively as they were computed from the previous step.

Gene i is assigned to a group according the formula:

is the ranking of gene i, and k is an integer variable.

1

n

MinRankMaxRankg

iO

1,,1

,

1,1,1

1

1

kkgRRk

gRRk

ki

O

ii

iii

iR

Page 32: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Greedy gene-groups elimination

Group p1

Group p2

Group p3

Group p4

Group p5

Group n5

Group n4

Group n3

Group n2

Group n1

Step 1. Initialisation

During Greedy gene-groups elimination, we initially consider all groups as identifiers and we assess the predictive power of the selected genes

Step 2. Choose what to eliminate

We consequentially choose to eliminate:

A. The last Positive Group …

B. The last Negative Group…

C. Both of them…

Step 3. Estimation of prediction ability.

We assess the predictive ability of selected genes in cases A, B, C and we choose the best predictive set (say C), and we continue steps 2, 3 until we increase accuracy no more.

Page 33: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Greedy gene-groups addition

Group p1

Group p2

Group p3

Group p4

Group p5

Group n5

Group n4

Group n3

Group n2

Group n1

Step 1. Initialisation

During Greedy gene-groups addition, we initially consider no groups of identifiers at all.

Step 2. Choose what to add

We consequentially choose to add:

A. The first Positive Group…

B. The first Negative Group…

C. Both of them

Step 3. Estimation of prediction ability.

We assess the predictive ability of selected genes in cases A, B, C and we choose the best predictive set (say C), and we continue steps 2, 3 until we increase accuracy no more.

Page 34: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Samples Class Prediction

During class prediction we have a set of selected genes along with their identifiers as computed in the previous steps:A new unclassified sample enters..Keep only values of selected genes..Descritise new sample according to MidPoints…

Assess the predictive power of each selected gene.

For positive genes is: (HighPos – LowPos) / #Pos

For negative genes is: (HighNeg – LowNeg) / #Neg

Estimate the sum of the product of the predictive power of each gene and the descritization of the sample. Estimation is done separately for positive and negative genes.

Unclassified sample is assigned to class Pos because C1 > C2, and the process continues with the next unclassified sample…

Page 35: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Sample Class Prediction

Np Rg

NgNggsg

Rg

PgPggsgs N

LHEsign

P

LHEsignC ,,

,max,,,

,max, ,maxarg

sC

Np RR ,

The previous process can be modeled in the following formula:

gmax,

gsE ,

NP ,

is the class that will be assigned to unclassified sample s.

is the set of positive ranked genes and negative ranked genes respectively.

is the midpoint of gene g.

is the expression value of unclassified sample s at gene g.

is the total number of positive and negative number of train sample.

Page 36: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Sample Class Prediction

• As with the gene-ranking formula, this formula also encompasses a polarity characteristic. In addition, the strength with which the sample is predicted to belong to one of the two classes is also provided so that, strong (or, weak) predictions could be made.

• This strength can be applied to tackle domains with more than two classes (multi-classmulti-class prediction):Let S be an unclassified sample that belongs to a domain with c classes. We also assume that we have selected g genes to be our discriminant attributes. We apply the predictor described above subsequently for each class. That is, we estimate the prediction strength of S belonging to each one of the c classes. Finally we assign the sample S to the class that made the best prediction score.

Page 37: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Experimental Evaluation

We applied the introduced gene-selection and samples classification methodology on eight real-world gene-expression domain studies that are pioneers in their fields:

Page 38: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Experimental Evaluation

Summarization of the results of applying the introduced gene-selection and sample classification/prediction method:

Page 39: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Discovery of Co-Regulated Genes: A ClusteringClustering Approach

• By comparing gene-expression profiles, and forming clusters, we can hypothesize that the respective genes are coregulated and possibly functionally related.

• The discovery of genes’ function may help to the identification of genes being involved in particular molecular pathways, and by though ease the modelling and exploration of metabolic pathways (i.e., metabolomics).

• Clustering of genes may reveal gene-families, i.e., metagenes, and their potential linkage with combined clinical features – a task which is too-difficult to be achieved when we are confronted with the huge number of available genes (~25000-30000 for the human case).

Page 40: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

A Graph Theoretic Clustering (GTC)

• We present a novel Graph Theoretic Clustering (GTC) approach on clustering of microarray gene expression profile data. The approach is based on:– The arrangement of the genes in a weighted graph– The construction of the graph’s Minimum Spanning Tree– An algorithm that recursively partitions the tree.

• Main advantages of the method:– Domain background knowledgeDomain background knowledge can be utilized in order to

compute distances between objects.– No need to specify the number of clusters in advance.– Hierarchical clustering.

Page 41: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Step 1: Fully Connected Graph

• Distances may be simple or more domain specific (i.e., Euclidean,Pearson, Mahalanobis).

• Or, a complete arbitrary, external source of information. This characteristic makes the whole data analysis process more ‘knowledgeable’ in the sense that established domain knowledge guides the clustering process.

Compute the distances of all gene expression profiles and construct the fully connected graph:

Page 42: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Step 2: Minimum Spanning Tree Construction

The minimum spanning tree of the fully-connected weighted graph of the objects is constructed. The formed MST contains exactly n-1 edges:

• MST reserves the shortest distance between the genes. This guarantees that objects lying in ‘close areas’ of the tree exhibit low distances. • Finding the ‘right’ cuts of the tree could result in a reliable grouping of the genes.

Page 43: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Step 3: Iterative MST partition

At each node in the so-far formed hierarchical tree, each of the edges in the corresponding node’s sub-MST is cut. With each cut a binary split of the genes is formed. If the current node includes n genes then n-1 such splits are formed. The two sub-clusters, formed by the binary split, plus the clusters formers so far compose a potential partition

Page 44: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Where K is the number of clusters formed so far, is the standard deviation forsample i in class k , and is the standard deviation for attribute i of all the genes participating in the clustering.

The one that exhibits the highest CU is selected as the best partition of genes in the current node.

Step 4. Best Split

ik

For each binary split we compute a category utility (CU) that indicates the division ability of the split. The more compact the clusters formed the higher the CU.

iP

J. Yoo and S. Yoo.“Concept Formation in Numeric Domains. Proceedings of ComputerScience Conference, pp. 36-41, Nashville, TN, March, 1995.

Page 45: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Step 5: Iteration and termination criterion

Each new cutting point found on the tree, divides the tree in two sub-trees: The left and the right.

The best cut of these two trees is found as described in steps 3 and 4.

In order to decide what will be the new cut, four potentials have to be examined.

In order to decide what potential is the proper one we estimate the CU of each one.

Page 46: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

The time and space complexity of calculating all distances of n genes with F samples is . When dealing with real-domain problems the order of computed distances may reach the order of .

Even though this complexity can be arranged by contemporary modern computers in the field of time, it is very hard to be arranged in the field of space.

In order to overcome this bottleneck we introduce a heuristic that reduces significantly the order of the computed distances:

We assume that the maximum degree of computed MST’s nodes is a value less than a constant value, let t. This hypothesis comes from the belief, that the data has a minimum sparseness. Thus a MST of a fully connected graph cannot have a node with degree greater than t. This reduces the space complexity to even though it increases the time complexity as the burden of sorting the distances of each node has been added.

Time Complexity

1110 FnO 2

ntF

Page 47: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

c12 (w1)

c112 (w4)

c1112 (w2)

c2 (w5)

c1111 (w3)

Experimental Evaluation on Gene-Expression Data Clustering

Large-scale temporal gene-expression mappingof Central Nervous System development (112 genes; 9 developmental time-points) Wen, et.al., PNAS 95, 334-339, January 1998

Page 48: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

GTC is: Well-formed

ReliableStable

Clusters almost identical to Wen

w4 - c112 : LATEw2 - c1112 : EARLY_MIDw3 - c1111 : EARLY_MID_Cw5 - c2 : Constantw1 - c12 : EARLY

The same using GTC-VDM

(a) Wen GTC

c1 (w 5)

0.00

0.50

1.00

E11 E13 E15 E18 E21 P0 P7 P14 A

w 5

0.00

0.50

1.00

E11 E13 E15 E18 E21 P0 P7 P14 A

w1

c12

C2111 (w 3)

0.00

0.50

1.00

E11 E13 E15 E18 E21 P0 P7 P14 A

w 3

0.00

0.50

1.00

E11 E13 E15 E18 E21 P0 P7 P14 A

w3 c1111

C2112 (w 2)

0.00

0.50

1.00

E11 E13 E15 E18 E21 P0 P7 P14 A

w 2

0.00

0.50

1.00

E11 E13 E15 E18 E21 P0 P7 P14 A

w2 c1112

C22 (w 1)

0.00

0.50

1.00

E11 E13 E15 E18 E21 P0 P7 P14 A

w 1

0.00

0.50

1.00

E11 E13 E15 E18 E21 P0 P7 P14 A

w4 c112

C212 (w 4)

0.00

0.50

1.00

E11 E13 E15 E18 E21 P0 P7 P14 A

w 4

0.00

0.50

1.00

E11 E13 E15 E18 E21 P0 P7 P14 A

w5

c2(b)

EA

RLY

_MID

EA

RLY

_MID

_CC

on

sta

nt

EA

RLY

Indicative Patterns

LA

TE

GTC: Comparison & Interpretation of Results

Page 49: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

The MineGene System: Implementation Issues

• MineGene is a collection of Machine Learning / Data Mining algorithms and heuristics for intelligent processing of gene expression data produced by DNA Microarray experiments.

• It is designed and implemented to be suited as a plug-in in a gene expression database.

• It implements (among others) all the methods presented.

Page 50: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Minegene’s Pathway

There is not yet any standard method for microarray gene expression data analysis but some general guidelines that recently have started to be formed.

These guidelines represent a sequencing procedure, a pathway that starts after data acquisition and ends to the construction of a predictor or a clustering mechanism depending if we are performing supervised or unsupervised data analysis.

Page 51: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Class hierarchy of MineGene

MineGene should:• Act as a plug-in in a

gene expression database.

• Be composed by several components with certain correlations between them, as algorithms belonging to the same family share common attributes.

• Utilize a Graphical User Interface.

Thus, Object Oriented Programming via C++

Page 52: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

MineGene’s GUI

Page 53: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

MineGene supports:• Filtering methods:

– Remove NaN (Not a Number Values).– Remove not Significant genes (according to Wilcoxon

rank-sum test.– Read from external resource. study genes

• Ranking Methods– According to Entropy (as presented)– According to Standard Deviation (Signal to Noise):

– According to Significance (Wilcoxon rank-sum test)– According to an external resource (file)

ba

ba

Page 54: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

MineGene Supports:

• Grouping Methods:– According to the method presented.– No grouping at all.

• Gene Selection Methods:– ADD / DEL Methods Presented– A priori gene or groups Selection:

Page 55: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

MineGene Supports

• Prediction Methods:– Descritisation

(presented before) for dual or multiclass domains.

– Support Vector Machines (through libsvm)

– K-Nearest Neighbours (KNN)

– K-Means

Page 56: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

MineGene Supports

• Clustering through GTC (as presented)– MST, Distance and Category

Utility methods selection– Heuristics for a-priori cluster size.– Options to cluster an arbitrary tree,

to use external distances and

to cluster an arbitrary graph

(not fully connected).– Option to visualize clustering

in .jpg format through GraphViz.

Page 57: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

MineGene Supports

• Study comparisonA study contains the genes selected by an external work. These are compared with the genes found by our study and the common genes are exported.

• ValidationLeave One Out Cross Validation is supported (currently extended).

Page 58: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Where #CL is the total number of clusters produced by our algorithm and #Cl is the total number of external clusters. is the number of genes contained in cluster i of our algorithm and is the number of genes contained in cluster j of external clustering and belong to cluster i of our algorithm.

MineGene Supports• Study clustering

When we are performing clustering, our outcomes can be compared with an external clustering. The similarity of two clusterings can be assessed by:

)(##

1i

CL

i

i CLEn

CLE

Cl

jijiji ClPClPCLE

#

1

)(log)()(

i

ijij CL

ClCl

#

#

ijCl#

iCL#

Page 59: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Future Work• Porting to other well known analysis tools

as R-package (standard in Bioinformatics).• Inclusion in an Integrated Clinico-

Genomics Environment (not a standalone application or a Gene Expression Database).

• Include Visualization methods.• Support of clinico-genomic knowledge-

dicsovery scenarios.…

Page 60: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

Integrated Clinico-Genomics with MineGeneA Multi-Strategy Data Mining Approach

ClusteringClustering Clusters of Genes Means of Clusters = Meta-GenesMeta-Genes

Association RulesAssociation Rules Interesting associations between Clinical-Parameters and Meta-Genes = Interesting ClinicalInteresting Clinical Profiles/Categories Profiles/Categories ER+ & PR+ & AGE > 40 & GOOD-prognosis VS. ER+ & PR+ & AGE > 40 & BAD-prognosis)

Gene-SelectionGene-Selection Select discriminant genes that distinguish between the discovered Clinical profiles

ER+ & PR+ & AGE > 40 & MG-1=High & MG-2=Low GOOD-prognosis (> 5 yrs)

Page 61: Alexandros Kanterakis 17-5-2005 Heraklion Crete. Presentation Outline  DNA and Microarray Experiments  From Genomic to Post-Genomic Informatics  Combined

T H A N K Y O U !T H A N K Y O U !

?